finding inﬂuencers in social networks · inﬂuential nodes in a social network. with two...

Finding Influencers in Social Networks

Carolina de Figueiredo Bento

Dissertation submitted to obtain the Master Degree in

Information Systems and Computer Engineering

Jury

President: Prof. Dr. Mario Jorge Costa Gaspar da Silva

Supervisor: Prof. Dr. Bruno Emanuel da Graca Martins

Co-Supervisor: Prof. Dr. Pavel Pereira Calado

Member: Prof. Dr. Alexandre Paulo Lourenco Francisco

November 2012

Abstract

From the millions of users that social platforms have, one can acknowledge that the activities of a

selected number of users are more rapidly perceived and spread through the network, than those

of others. These users are the influencers. They generate trends and shape opinions in social networks,

being crucial in areas such as marketing or opinion mining.

In my MSc thesis, I studied network analysis methods to identify influencers, experimenting with different

types of networks, namely location-based networks from services like FourSquare or Twitter, that include

relationships between users and between users and the locations they have visited, and academic

citation networks, i.e., networks that relate scientific papers through citations.

Within location-based networks I estimated the most influential nodes, through a set of network analysis

techniques. Assessing the veracity of these results was done by comparison to traditional measures

(e.g., the number of friends a user has) because there is no ground-truth list, i.e., a list containing a

set of well known accepted influencers. The majority of the influencers are not the ones who have the

highest number of friends.

Within academic citation networks, the most influential papers identified were really considered impor-

tant publications, due to being authored by renowned authors and recipients of important awards, being

fundamental reading or recent developments on a topic. I also developed a framework to predict fu-

ture influence scores and download counts through a combination of features. Accurate estimates were

obtained through the use of learning methods such as the RT-Rank.

Keywords: Social Networks, Network analysis, Impact Scores, Finding Influencers,

i

Resumo

Os servicos de social networking tem milhoes de utilizadores contudo, percebemo-nos que a activi-

dade de um grupo selecto de utilizadores e mais rapidamente captada e propagada pela rede, do

que a de outros. Chamamos a este grupo os influentes. Eles criam tendencias e dominam as opinioes

nas redes sociais, sendo cruciais em areas como o marketing ou opinion mining.

Na minha tese, estudei metodos de analise de redes para a identificar influentes, analizando dois tipos

de redes, nomeadamente, redes baseadas na localizacao, provindas de servicos como o FourSquare

ou o Twitter, que incluem relacoes entre os utilizadores e entre estes e os locais que estes visitaram, e

redes de citacoes academicas, i.e., relacionando artigos cientıficos atraves de citacoes.

Em redes baseadas na localizacao, estimaram-se quais os nos mais influentes, atraves de um conjunto

de tecnicas de analise de redes. A veracidade destes resultados foi aferida comparando medidas

tradicionais (e.g., o numero de amigos de um utilizador) dado nao existir uma lista de influentes para

validacao, i.e., uma lista contendo um conjunto de influentes unanimemente reconhecidos.

Em redes de citacoes academicas, os artigos obtidos como mais influentes sao realmente publicacoes

importantes, devido a serem da autoria de cientistas de renome galardoados passado, por serem

publicacoes essenciais ou desenvolvimentos recentes num topico especıfico. Desenvolvi tambem uma

framework que preve futuros valores de influencia e o futuro total de downloads efectuados, combinando

caracterısticas como valores de influencia anteriores. Atraves da utilizacao de metodos de aprendiza-

gem com o RT-Rank, e possıvel realizar estimativas precisas.

Palavras-chave: Redes Sociais, Analise de Redes, Valores de Influencia, Encontrar Influentes

iii

Acknowledgments

First and foremost I have to thank my parents, sister and brother-in-law for the unconditional support

and selflessness throughout these years, and specially during my MSc thesis.

I must thank my advisors, Prof. Dr. Bruno Martins and Prof. Dr. Pavel Calado, for all the support,

motivation, patience and availability. It is very comforting to be able to share ideas and openly discuss

new ways of addressing a problem with such ease. Also, I must thank them for giving me the oppor-

tunity of being part of projects, such as, the European Digital Mathematics Library (EuDML) and the

Services for Intelligent Geographical Information Systems (SInteliGIS), both funded by the Portuguese

Foundation for Science and Technology (FCT) through the project grants with reference 250503 in CIP-

ICT-PSP.2009.2.4 and PTDC/EIA-EIA/109840/2009, respectively.

I thank all the colleagues and close friends that have accompanied me throughout the years, and spe-

cially, the ones who have filled these last couple of years with so much joy, laughter and camaraderie.

So, to Ana Silva, Joao Lobato Dias, Luıs Santos, Joao Amaro, Pedro Cruz, Jacqueline Jardim, Maria

Rosa, Luıs Luciano, Carlos Simoes, Mafalda Abreu, Celia Tavares and, thankfully many others, I express

my enormous gratitude for keeping me (in)sane.

Last, but definitely not the least, I must thank my boyfriend, Joao Fernandes, for the unconditional love,

support, patience and confidence, for helping me being more creative and acute during the stressful

times and for showing me there is always a light at the end of the tunnel.

v

Contents

Abstract i

Resumo iii

Acknowledgments v

1 Introduction 1

1.1 Hypothesis and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Fundamental Concepts 5

2.1 Fundamental Concepts in Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Influencers in Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Prestige, Popularity and Attention in Social Networks . . . . . . . . . . . . . . . . . . . . . 9

2.4 Recognition, Novelty, Homophily and Reciprocity . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Active versus Inactive Users, User Retention, Confounding, Social Influence and Social

Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.6 Information Cascades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.7 Information Diffusion Models and Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.8 Graph Centrality Measures and Bibliographic Indexes . . . . . . . . . . . . . . . . . . . . 14

2.9 Unsupervised Rank Aggregation Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.10 Supervised Learning for Rank Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

vii

3 Related Work 27

3.1 The Hyperlinked Induced Topic Search (HITS) Algorithm . . . . . . . . . . . . . . . . . . . 27

3.2 The PageRank algorithm and its Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 Weighted PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.2 Topic-Sensitive PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.3 TwitterRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 The Influence-Passivity (IP) Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4 Citation and Co-Authorship Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.5 Temporal Issues in Ranking Scientific Articles . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Finding Influencers in Social Networks 43

4.1 Available Resources for Finding Influencers . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1.1 Characterizing Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Analysis of Location-based Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.1 Data Collection from Online Services . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.2 Adaptation of the Influence-Passivity (IP) Algorithm . . . . . . . . . . . . . . . . . . 49

4.3 Analysis of Academic Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3.1 Predicting Future Influence Scores and Download Counts . . . . . . . . . . . . . . 51

4.3.2 The Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Validation Experiments 57

5.1 The Considered Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.3 The Obtained Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.3.1 Finding Influencers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.3.2 Predicting Future PageRank Scores and Download Counts . . . . . . . . . . . . . 67

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

viii

6 Conclusions 71

6.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Bibliography 75

Apendices 83

A Important Awards in Computer Science 83

ix

List of Tables

5.1 Characterization of the FourSquare and Twitter networks. . . . . . . . . . . . . . . . . . . 58

5.2 Characterization of the DBLP dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.3 Characterization of the DBLP network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.4 User influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built

from the FourSquare dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.5 User influence scores for PageRank and HITS algorithms, for the User Graph, built from

the FourSquare dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.6 User influence scores for the IP algorithm, built from the FourSquare dataset. . . . . . . . 64

5.7 Spot influence scores for PageRank and HITS algorithms (that present the exact same

top-10), for the User+Spot Graph, built from the FourSquare dataset. . . . . . . . . . . . . 65

5.8 User influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built

from the Twitter dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.9 User influence scores for PageRank and HITS algorithms, for the User Graph, built from

the Twitter dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.10 Spot influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built

from the Twitter dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.11 PageRank scores for top-10 highest ranked papers of the DBLP dataset. . . . . . . . . . 67

5.12 Results for the prediction of impact PageRank scores for papers in the DBLP dataset. . . 68

5.13 Results for the prediction of download numbers for papers in the DBLP dataset. . . . . . . 69

xi

List of Figures

2.1 A graph with the set of vertices V={1, ..., 8}, the set of edges E={(1, 2), (2, 4), (3, 4), ...}

and encoding a path P with length 6 (adapted from (Diestel, 2005)). . . . . . . . . . . . . 7

2.2 Graph with three components and two SCC’s denoted by dashed lines (adapted from

Easley & Kleinberg (2010) and Cormen et al. (2001)). . . . . . . . . . . . . . . . . . . . . 8

2.3 Flowchart for the Single Transferable Vote rule. . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4 Learning-To-Rank (L2R) Framework (adapted from Liu (2009)). . . . . . . . . . . . . . . . 24

3.5 A graph with hubs and authorities (adapted from Kleinberg (1998)). . . . . . . . . . . . . . 28

3.6 A graph illustrating the computation of PageRank (adapted from Page et al. (1998)). . . . 29

3.7 The general TwitterRank framework (adapted from Weng et al. (2010)). . . . . . . . . . . 34

4.8 Example of a location-based social network (adapted from Zheng & Zhou (2011)). . . . . 46

4.9 A sequence of subdivisions of the world sphere, starting from the octahedron, down to

level 5 corresponding to 8192 spherical triangles. The circular triangles have been plotted

as planar ones, for simplicity (adapted from Szalay et al. (2007)). . . . . . . . . . . . . . . 48

4.10 The HTM recursive division process (adapted from Szalay et al. (2007)). . . . . . . . . . . 49

4.11 Transformation of the original network graph (left) to our IP algorithm graph (right). . . . . 51

4.12 Structure of the citation graph built upon the DBLP data. . . . . . . . . . . . . . . . . . . . 51

4.13 Framework for predicting future PageRank scores and download counts. . . . . . . . . . . 52

5.14 Degree distribution for nodes in the User+Spot Graph and the User Graph, from the

FourSquare and Twitter datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.15 Degree distribution for the DBLP dataset from 2008 to 2011. . . . . . . . . . . . . . . . . 62

xiii

Chapter 1

Introduction

The rise of social media platforms such as Twitter1 or Google+2, with their focus on user-generated

content and social networks, has brought the study of authority and influence over social networks

to the forefront of current research. For companies and other public entities, identifying and engaging

with influential authors in social media is critical, since any opinions they express can rapidly spread far

and wide. For users, when presented with a vast amount of content relevant to a topic of interest, sorting

content by the source’s authority or influence can also assist in information retrieval.

There has been a substantial amount of recent work studying influence and the diffusion of informa-

tion in social networks. Moreover, there has also been much work in the field of network analysis that

has focused explicitly on sociometry, including quantitative measures of influence, authority, centrality or

prestige. These measures (e.g., degree centrality or betweenness centrality) are essentially heuristics,

usually based on intuitive notions such as access and control over resources, or brokerage of informa-

tion.

In the context of my MSc thesis I conducted a thorough study on the problem of identifying the most

influential nodes in a social network. With two different types of networks at hand, namely location-based

social networks from services such as FourSquare or Twitter, and academic citation networks encoding

relations between papers, the main focus was to use well-known social network analysis techniques and

algorithms.

One of the most important contributions of this work consisted in adapting the Influence-Passivity (IP)

algorithm, initially strictly intended for Twitter data and relying on re-tweets to capture information flow,

to be used in the context of location-based social networks, where the propagation of information is

done via the locations that users visited, i.e., exploring patterns related to having a new user j visiting a

location l, after one of his friends i had already visited l.

1http://twitter.com/2https://plus.google.com/

1

http://twitter.com/

https://plus.google.com/

In what regards the study of influence in academic social networks, I studied techniques for estimat-

ing the future influence scores and future download counts. In this context, I specifically developed a

framework to predict the future PageRank scores and future download counts of scientific articles that

were downloaded in the ACM Digital Library1, for a specific year, through a combination of features that

include the age of the article and previous PageRank scores.

1.1 Hypothesis and Methodology

In the context of my MSc thesis, I focused on the task of identifying the most influential users in a social

network, working with two types of networks, namely (1) location-based networks from services like

FourSquare or Twitter, that include relationships between users in the network and between users and

the locations they have visited, and (2) academic citation networks, i.e., networks that relate scientific

papers according to their citation relationships. The main hypothesis I tried to validate was that we

can identify the most influential users through social network analysis techniques and algorithms. More

specifically, with location-based networks, the presence of locations aids in the propagation of influence

scores through the network and, on the other hand, with academic citation networks, one can focus on

the temporal dynamics of networks and use networks from the past to predict future networks, assessing

how influence scores evolve through time.

In order to validate the research hypothesis, we began by collecting real and up-to-date data from two

social networking platforms, namely, FourSquare2 and Twitter. To assess the accuracy of our results

for social networks based on location, we made an empirical analysis of our top-10, looking into the

user profiles and spot check-ins, in order to understand how their profile characteristics were related to

their influence in the network. Regarding academic social networks, a citation network was built with

data from the DBLP3 digital library. In location-based social networks, different ranking algorithms were

computed and the top-10 highest ranked users and the top-10 highest ranked spots were extracted and

analyzed. To assess the accuracy of our results over location-based social networks, conducted an

empirical analysis, relying on profile information.Regarding the quality of the results from the academic

citation network, it was assessed by crossed-checking the authors of the top-10 highest ranked scientific

papers in the DBLP collection with the recipients of various renowned scientific awards - see Appendix A.

Considering the experiments for estimation of future influence scores of scientific papers and future

download counts for these scientific papers, a set of evaluation metrics, including the normalized root

mean squared error and the spearman correlation, was used to assess the quality of our predictions

comparing to the real influence scores.

1http://dl.acm.org/2https://foursquare.com/3http://www.informatik.uni-trier.de/˜ley/db/

2

http://dl.acm.org/

https://foursquare.com/

http://www.informatik.uni-trier.de/~ley/db/

1.2 Main Contributions

The following are the most important contributions of this thesis, according to their relevance:

• I conducted a thorough study regarding ranking algorithms, with special focus in the PageRank

algorithm and its variants. I specifically implemented the HITS and the Influence-Passivity (IP)

algorithms. The IP algorithm was adapted to the context of location-based social networks. I

computed the influence for each node and extracted the highest ranked nodes in each type of

network. The code implementation of HITS and IP algorithms was made available as an open-

source project1, so that it can be re-used by others researching this topic.

• I implemented crawlers to extract data from FourSquare and from Twitter, from which I built net-

works with two types of nodes, namely users and spots (i.e., the locations users have visited).

These networks were used in the context of experiments for finding the most influential nodes

through algorithms such as HITS, PageRank or IP. The source code for the FourSquare crawler

was made available as an open-source project2, so it can be re-used by others researching this

topic.

• I built a citation network with data from the DBLP digital library, being able to extract its most

influential papers, after computing the PageRank algorithm. The accuracy of these results was

assessed by cross-checking the authors of these papers against a list of the recipients of various

renowned scientific awards. From this experiment, we could conclude that the majority of the most

influential papers in this network are authored by recipients of important scientific awards.

• I developed a framework to predict the future PageRank scores and the future download counts of

scientific papers, for a specific year, using the citation network built from DBLP data. This task was

addressed through an ensemble learning regression algorithm. I assessed the impact that different

features have in the accuracy of the results. Our predictions were compared to the real PageRank

scores and the real number of downloads from the ACM Digital Library for each specific paper

and year. We concluded that in some cases, depending on the combination of features that we

used, having more information can deviate negatively the results, while in others, as we combine

more information, predictions become closer to the real values. Globally, this prediction approach

proved to be accurate, with the results being very close to the real values.

1http://code.google.com/p/ezgraph/2http://code.google.com/p/fscrawler/

3

http://code.google.com/p/ezgraph/

http://code.google.com/p/fscrawler/

1.3 Organization of the Dissertation

The structure for the rest of this document is the following: Chapter 2 presents fundamental concepts

in social network analysis. Chapter 3 describes the most significant work related to the task of finding

influencers in social networks, and related to the analysis of location-based social networks. Chapter 4

details the work that was developed in the context of my MSc thesis, namely, the methodology for data

collection, how the networks were built, the specific implementation and adaptation of the IP algorithm,

as well as the methodology to find the influential nodes in the networks. Regarding the experiment on

the prediction of future PageRank scores, Chapter 4 also includes the description of the features and

the learning approach that was used. Chapter 5 describes the validation experiments and the obtained

results, alongside with a brief discussion. Finally, Chapter 6 closes this document, highlighting the most

important conclusions of this MSc thesis, and presenting possible paths for improvement and future

work.

4

Chapter 2

Fundamental Concepts

This chapter introduces the fundamental concepts related to the problem of finding influencers in

social networks. After a brief introduction to graph theory, more specific concepts are then pre-

sented, such as what it is to be an influencer, the distinction between popularity and prestige, what does

one mean when discussing social gestures, and the social gestures that are more relevant in the context

of this MSc thesis, namely homophily and reciprocity. Finally, this chapter introduces fundamental con-

cepts behind graph centrality measures, bibliometric indexes and rank aggregation approaches, these

latter concerning the combination of the output of various ranking methods, to generate a consensual

ranked list.

2.1 Fundamental Concepts in Graph Theory

A graph G can be represented as a pair G = (V,E), where V or V (G) is the set of vertices or nodes

and E or E(G) is the set of edges or links between the nodes (Figure 2.1). The number of vertices of a

graph indicates the graph’s order (Diestel, 2005). Graphs are usually used when representing networks,

either undirected (Figure 2.1) or directed (i.e., digraphs in which the edges have a direction from node a

A to node a B). A way of representing a directed graph D is with an adjacency matrix, which is a square

matrix A = A(D) where each cell (i, j) has a value equal to 1 if there is an edge from i to j, and a value

equal to 0 otherwise (Harary, 1962).

In what regards graph measures, the degree dG(i) or valency of a vertex i in an undirected graph G

is the number |E(i)| of edges at i, which is equal to the number of neighbours of i, i.e., the number of

vertices that are adjacent to i. It can be mathematically expressed as follows, where a(i, j) denotes a

cell in the graph’s adjacency matrix:

5

dG(i) =∑j

a(i, j) =∑j

a(j, i) (2.1)

In what regards directed graphs, we have the same notation as in undirected graphs, with the exception

that, when specifying the set of edges E, all pairs of connected vertices have to be oriented. Besides

the measure of degree, one can also measure the in-degree dG(i)in and out-degree dG(i)out of a vertex

i, which are, respectively, the number of incoming edges and outgoing edges of that vertex (Clark &

Holton, 1991). The indegree and outdegree can also represent the cardinality of, respectively, the set of

predecessors and successors of a node, and can be formally expressed as follows:

dG(i)in =∑j

a(j, i) (2.2)

dG(i)out =∑j

a(i, j) (2.3)

One might also want to represent a weighted network, i.e., a network in which each edge is assigned

with a specific weight. A weighted network can be expressed as an adjacency matrix with each entry

indicating the weight of the edges (wij), as follows (Newman, 2004):

Aij = wij (2.4)

When representing a weighted network with a graph, one just has to add the weights to each edge, thus

defining a weighted graph. For a weighted network, besides the in-degree and out-degree for a vertex

i, one is usually more interested in the strength of i, i.e., the sum of the weights w of the corresponding

edges. The in-strength Sini and out-strength Souti of a vertex i are expressed as follows (Luciano et al.,

2005):

sini =∑j

w(j, i) (2.5)

souti =∑j

w(i, j) (2.6)

Also important in graph analysis is the notion of stochastic matrix. A square matrix A = (akλ) can

only be called stochastic if all its elements are non-negative and if the following conditions are verified

(Brauer, 1952):

6

n∑λ=1

akλ = 1 (k = 1, 2, , ..., n) (2.7)

Stochastic matrices can be used to encode weighted graphs where the indegree or the outdegree cor-

respond to probability distributions.

A path within a graph is a non-empty sub-graph P = (V,E) such that V = {x0, x1, ..., xk}, where

E={(x0x1), (x1x2), ..., (xk−1xk)}, and where xi are all distinct from one another – see Figure 2.1. The

nodes x0 and xk are called the ends of path P (Bondy & Murty, 1976). For undirected and unweighted

graphs, the number of edges (|E|) in a path is the length of the path.

Figure 2.1: A graph with the set of vertices V={1, ..., 8}, the set of edges E={(1, 2), (2, 4), (3, 4), ...} and encoding apath P with length 6 (adapted from (Diestel, 2005)).

One might also be interested in determining the geodesic path, i.e., the shortest path, between two

vertices. The geodesic path between vertices i and j is the path between them that has the minimum

length (Luciano et al., 2005).

When describing the structure of a graph, one can parcel it out into components or connected compo-

nents, i.e., subsets of nodes in which every node has a path to every other node, but are not part of

a larger set that is also internally connected (Gibbons, 1985) – see Figure 2.2. A directed graph can

have strongly connected components (SCCs), which are sets of nodes such that, for any nodes i and j

belonging to the set, there is an acyclic path from i to j and from j to i (Gibbons, 1985). Dangling nodes

are defined as nodes that have no outlinks. Figure 2.2 illustrates both these concepts in a graph.

2.2 Influencers in Social Networks

Influence in social networks is very important, not only from the perspective of information flow, but also

for network analysis applications aimed at business and marketing purposes. In terms of what it is to be

influential, many authors have their particular definitions.

Watts & Dodds (2007) define an influential person or an opinion leader as an individual that is part of a

minority and who has influence over a great number of peers. This influential individual belongs to the

top q% of the influential distribution p(n), having as a premise that an individual i, within a population of

7

Figure 2.2: Graph with three components and two SCC’s denoted by dashed lines (adapted from Easley & Kleinberg(2010) and Cormen et al. (2001)).

size N , influences ni other randomly chosen individuals, where ni comes from p(n) and refers to how

many people i influences, regarding a specific topic X.

From work developed in the Web Ecology Project, in the context of Twitter, an influential is defined as

a user who, from his actions (i.e., from interactions such as replies, retweets, mentions or attributions)

has the potential to initiate an action from another user (Leavitt et al., 2009). These actions are called

markers of influence and should be taken into account in the task of assessing influence on Twitter

users, instead of the primordial measure of the follower count, which states that the user with the greater

amount of followers is the most influential.

Bakshy et al. (2011), also on Twitter, consider that if a person B is following a person A, if person A

posted an URL earlier than person B did, and if person A is the only of B’s friends who has posted that

specific URL, then person A has influenced person B to post that URL. Regarding the computation

of influence, the authors recognize that three different approaches can be considered, if person B has

more than one friend who has posted the same URL:

i. First Influence, crediting exclusively the person who first posted the content, thus assuming that

individuals are influenced when they first see novel information, even if they do not act on it imme-

diately;

ii. Last Influence, crediting the last person who posted the content;

iii. Split Influence, crediting equally all friends that posted that specific content before its most recent

post. This last approach assumes that either the likelihood of noticing novel content or the intention

of acting upon it steadily accumulates, as the information is reposted by more and more friends.

On their turn, and still in the realm of Twitter, Cha et al. (2010) defined three types of influence for a user,

instead of just one. These metrics are directly related to interpersonal activities:

8

i. Indegree Influence, counting the total number of followers to determine the size of the user’s audi-

ence in the network;

ii. Retweet Influence, counting the total amount of retweets with a user’s name to measure the ability

of a user to generate content that is spread by others through the network (i.e., his pass-along

value);

iii. Mention Influence, counting the total amount of mentions with a user’s name to measure the ability

of engaging other users in a conversation.

Another important aspect of influence in social networks relates to the fact that influence is determined

by the information flow through the network, i.e., the flow of user content and its propagation through the

network (Romero et al., 2011).

2.3 Prestige, Popularity and Attention in Social Networks

Although popularity and prestige are two distinct concepts, they are commonly mistaken one for the

other. Both these concepts are related to influence, since prestigious and/or popular users are more

likely to be influential.

One can define popularity as a direct quantification of the level of attention someone, in a social network,

has (Romero et al., 2011). Regarding social networks, one can, for instance, assess the popularity in

Digg1 or in Youtube2, respectively, by the number of votes (Diggs) and the number of views that the

content of a given user has (Szabo & Huberman, 2010).

As for the notion of prestige, it is most commonly associated with scholar networks, such as paper cita-

tion and journal citation networks. In this realm, there is also the distinction between journal popularity

and journal prestige, as the former considers journals that are frequently cited by journals with little pres-

tige, and the latter considers journals that have few citations, but only from highly prestigious journals

(Bollen et al., 2006).

Regarding the popularity and prestige of authors, journal or paper, in a scholar network, the popularity

of an author, journal or paper, is the quantification of the number of times he was cited by other nodes

in the network, while prestige is the number of times the node was cited by other highly cited nodes on

the network (Ding & Cronin, 2011).

In the academic realm, attention is seen as a payment mode, as well as, the main input to scientific

production (Franck, 1999). Scientific publications earn attention when cited by other authors, in their

1http://digg.com/2http://youtube.com/

9

http://digg.com/

http://youtube.com/

publications. Also in other social networks, attention is regarded as a form of value and as a catalyst for

more contributions in the social network (Wu et al., 2009).

2.4 Recognition, Novelty, Homophily and Reciprocity

Influential and popular people are recognized by their peers and also by many others outside their

communities. As for recognition, may it be in blogs, academia or social media, it is done by referencing

a person’s work, opinions or ideas, and it can have a bidirectional relationship with influence, since the

more influential is what a user references, the more influential the user can become (Agarwal et al.,

2008).

Novelty is also correlated with influence, in the way that novel ideas generally exert more influence. In

the blogosphere, novelty is also correlated with the number of outlinks of a blog post. Nevertheless, this

is a negative correlation, as a greater number of outlinks indicates that the post refers to many other

blog posts, revealing that the post is not likely to be novel (Agarwal et al., 2008).

In the context of human interaction, being at the presence of homophily involves recognizing that similar

people or people with similar characteristics, interests and/or preferences, tend to be more in contact

with each other than with people with less characteristics and/or preferences in common. As stated in

the work of McPherson et al. (2001), homophily implies that distance in terms of social characteristics

translates into network distance, the number of relationships through which a piece of information must

travel to connect two individuals.

Another important social phenomena is reciprocity, rising from the following relationships in social net-

works, such as Twitter, where a user has the tendency of following back a user that followed him in the

first place. This is revealed by the high correlation existing between the number of friends and followers,

meaning that the more friends a user has, the more followers he usually has, and vice-versa (Weng

et al., 2010).

Weng et al. (2010), in the study of TwitterRank, addressed the presence of homophily and reciprocity

on Twitter, considering that these characteristics are behind the following relationships, giving more

meaning to social ties and to the identification of influential people on Twitter.

2.5 Active versus Inactive Users, User Retention, Confounding,

Social Influence and Social Correlation

When a user performs an action for the first time, such as purchasing a product or visiting a website,

one can state that the user has become active. With a total number of a already active friends, a user

10

has an activation probability p(a), which can be modeled with a logistic function expressed as follows:

p(a) =eα ln(α+1)+β

1 + eα ln(α+1)+β(2.8)

In the formula, α and β are coefficients, with α measuring social correlation. Both can be estimated

using maximum likelihood logistic regression (Anagnostopoulos et al., 2008).

An active user can become a retained user if he stays active in the network, therefore affecting the

retention of other users and keeping them from leaving the network (Heidemann et al., 2010). This can

also be used as an evaluation metric to identify influential users in social networks, as Heidemann et al.

(2010) proposed to do.

Also, one can state that two adjacent nodes u and v in a social network have a social correlation tie if

the events that turned u into an active user are correlated with the events that turned v into an active

user as well. This behavioral correlation can be accounted by homophily, confounding factors (i.e., the

environment) and social influence (Anagnostopoulos et al., 2008).

Confounding factors are the influences from external elements, which end up affecting the individuals

that are closer in a social network. It can be mathematically expressed as the presence of a confounding

variable X and a set of active individuals W , both in a social network G, and the fact that the set of

active individuals W comes from a distribution that is correlated with X (Anagnostopoulos et al., 2008).

In confounding, the individuals’ choices of becoming friends with others and of becoming active are

exclusively affected by the same unobserved variable X.

The phenomena of social influence is also one of the causes for social correlation. With social influence,

the actions of individuals can induce their friends in acting the same way, which can occur via (i) an

example to their friends, (ii) informing friends about the action taken, or (iii) increasing the value of an

action for their friends (Anagnostopoulos et al., 2008).

2.6 Information Cascades

In the theory behind information cascades, we assume that agents observe private signals of some in-

herent state and make public decisions. The following decision-makers will face the difficulty of knowing

if their own private signal is significant in the choice of a state that is unlikely, given the public decisions

that were previously observed (Anderson & Holt, 1995).

We are at the presence of information cascades when all decisions (initial and subsequent) coincide in

the way that it is optimal for the following decision-makers to ignore their private signals and follow a

11

pattern that has been established. For example, suppose that a worker is not hired by several prospec-

tive employees because of poor interview performances. With this pubic decision information, we have

that a following prospective employee may not hire the worker, due to the fact the worker’s information

is dominated by negative signals inferred by previous rejections, even if the candidate does well in his

interview (i.e., a positive private signal). Therefore, an information cascade can result from rational in-

ferences that others’ decisions are based on information that dominates one’s own signal (Anderson &

Holt, 1995).

From the work developed by Papagelis et al. (2009) in the context of the blogosphere, we have that a

cascade can be characterized by its (i) size, i.e., the number of nodes involved in the cascade, excluding

its initiator; (ii) height, i.e., the height of the resulting spanning tree, after a depth first search traversal

on the cascade; (iii) minimum reaction time of all posts in the cascade, excluding its initiator; (iv) mean

reaction time of all posts in the cascade, excluding its initiator; and (v) maximum reaction time of all

posts in the cascade, excluding its initiator.

In social networks, there are many factors that influence information cascades, such as the graphical

interface used to interact with the network (Millen & Patterson, 2002), the fact that an in-topic conversa-

tion/interaction is being maintained (Arguello et al., 2006), or positive attention and feedback (Huberman

et al., 2009).

The analysis of information cascades can provide insight on public opinion over a variety of topics

(Papagelis et al., 2009). Therefore, this is related to the task of finding influential users on a social

network, since those influential users are the ones who tend to shape, i.e., influence, the opinions of

other users the social network.

2.7 Information Diffusion Models and Measures

Arising, respectively, from the realms of marketing, sociology and economics, Young (2009) presents

three information diffusion models, namely, (i) social contagion, (ii) social influence and (iii) social learn-

ing.

In social contagion, information spreads like in an epidemic, i.e., people spread information when they

contact with others who have already been in contact with that same information (Young, 2009). This

model is, thus, based on exposure. The homogeneous contagion model at time t can be mathematically

described as the following ordinary differential equation:

p(t) = (λp(t) + γ)(1− p(t)) (2.9)

12

In the formula, λ and γ are non-negative parameters, not both equal to zero, and respectively corre-

sponding to the instantaneous rate at which a current non-adopter hears about the information from a

previous adopter within and outside the group.

In social influence, users spread information when enough other people in their group have already

been in contact with it. In a standard model, it is assumed that users have different social thresholds,

which determine if they will spread that information or not, as a function of the number of others that

have already spread it. Users are, thus, moved by social pressure, in a way that the aforementioned

thresholds refer to their degree of responsiveness to social influence. Also, the threshold of user i is

the minimum proportion ri ≥ 0, such that i will only spread information if, at least, a proportion ri of the

members of the group already have done the same. If ri > 1, it is implied that, for user i to spread the

information, at least, the whole group had to have spread it as well. Therefore, in this latter case, i never

spreads the information. With F (r) being the cumulative distribution function of thresholds in some given

population, at time t, the proportion of people whose thresholds have been crossed is F (p(t)). Having

λ as the instantaneous rate at which people are converted to spread the information, and assuming that

p(t) have already spread it, the proportion of users whose thresholds have been already crossed, but

have not yet spread information is F (p(t)) − p(t) (Young, 2009). Thus, this model can be expressed as

follows:

p(t) = λ[F (p(t))− p(t)], λ > 0 (2.10)

In a social learning model, users spread information once they have enough empirical evidence to

convince them that the information is worth spreading. Thus, users make rational use of previously

gathered evidence in order to reach a decision (e.g., when a new smartphone is out in the market,

people tend to see how it works for others over some period of time before trying for themselves).

Due to sources of heterogeneity, such as discrepancies in their prior beliefs, the amount of information

they have gathered, or idiosyncratic costs, people may spread information at different times. In this

type of model, which gives us the reason why people would spread information, given that others have

already spread it, the adoption decision flows directly from the rational evaluation of evidence. There

are two types of social learning models, namely (i) social learning models with direct observation, where

the evidence comes from other people’s experiences, i.e., people believe that the information is worth

spreading because other people have done it, and their spreading payoff is fully observable, and (ii)

herding models, where only the spreading act is observable (Young, 2009).

In a social learning model with direct observation, one can assume that (Young, 2009)):

i. Payoffs are observable;

ii. Payoffs generated by different individuals and/or at different points in time are independent and

13

equally informative;

iii. Users are risk-neutral and myopic (i.e., they only see close items);

iv. There is no idiosyncratic component to payoffs due to differences in user’s types, although users

may have different costs (not necessarily observable);

v. There are differences in users’ prior beliefs about how good the information is relative to the status

quo;

vi. There are differences in the average number of people users observe, and hence in the amount of

information they have;

vii. The population is fully mixed.

In this case, the system becomes very simple and the various types of heterogeneity are reduced to

a composite index that measures the probability of a given user spreading, conditional on the amount

of information that has been generated so far, in the population (Young, 2009). Regarding information

diffusion measures, the most commom are (i) speed, which considers when the diffusion instance will

take place and if it will take place or not, (ii) scale, i.e., the number of instances that were affected at a

first degree, and (iii) range, which measures how far the diffusion chain can continue on its depth (Yang

& Counts, 2010).

2.8 Graph Centrality Measures and Bibliographic Indexes

In graph theory, graph centrality measures provide a way of measuring the varying importance of network

vertices, according to specific criteria and the role played by the nodes of a network. In Bibliometrics,

an area concerned with the analysis of patterns in scientific literature, bibliometric indexes are used to

evaluate the quality, impact and relevance of the work of a particular scientist, usually by analyzing the

citation graph. In the context of this MSc thesis, both these areas are particularly important, because

they can provide robust approaches for estimating influence. Some of the most important network

centrality metrics are as follows:

i. Degree Centrality : Degree centrality is a measure of the popularity of a node in a network (New-

man, 2003). It is defined according to the number of edges connected to a particular vertex in the

network, and is mathematically expressed as follows:

CD(v) =dG(v)

n− 1(2.11)

In the formula, dG(v) is the degree of vertex v and n is the total number of vertices in the network.

14

ii. Betweenness Centrality: This measure is based on the number of shortest paths that pass through

a vertex. For instance, the betweenness of a vertex i is the fraction of geodesic paths between

pairs of vertices of the network that happen to be passing through i. In case of more than one

shortest path between a pair of vertices, each path is given an equal weight such that their sum is

equal to one (Newman, 2003). Assuming that g(jk)i is the number of geodesic paths from vertex

j to vertex k that are passing through i, assuming that njk is the total number of geodesic paths

from vertex j to vertex k, and assuming that n is the total number of vertices in the network, the

betweenness of vertex i is computed as follows:

bi =

∑j<k g

(jk)i /njk

(1/2)n(n− 1)(2.12)

With the betweenness measure, the extent to which a node has control over the information that

flows between others can be estimated.

iii. Closeness Centrality: This measure is defined as the average geodesic distance, i.e., the average

shortest path, between a vertex and all the other vertices that are reachable from it. By measuring

a vertex’s closeness, we can measure how long it will take to spread information from this par-

ticular vertex to the other vertices in the network (Freeman, 1978). Closeness Centrality can be

mathematically expressed as follows:

CC(i) =1∑

j∈V \i g(i,j)

(2.13)

In the formula, V represents the total set of vertices of the network and g(i,j) is the distance of the

geodesic path between vertices i and j.

iv. Eigenvector Centrality: This measure weights the contacts according to their centralities, taking

into account the whole pattern of the network and computing the weighted sum of both direct and

indirect connections of every length. Therefore, having the graph G(E, V ), the adjacency matrix

A, λ as the largest eigenvalue of A, and n as the number of vertices, the eigenvector centrality xi

of node i can be expressed as follows (Bonacich, 2007):

λxi =

n∑j=1

aijxj i = 1, ..., n (2.14)

v. Clustering Coefficient: As a measure for transitivity, Watts & Strogatz (1998) introduced the clus-

tering coefficient. This coefficient measures the degree to which neighbours on a network can be

closer to one another, and it can be globally expressed as follows (Kaiser, 2008):

15

C =

∑i∈V Γi∑

dG(i)(dG(i)− 1)(2.15)

In the formula, i is a vertex of graph G that has V as its set of vertices, dG(i) is the degree of i and

Γi is the number of edges between vertex i and its neighbours. The above global definition of the

clustering coefficient is obtained through the computation of a local clustering coefficient which,

for undirected graphs is defined as in Equation 2.16, and for directed graphs is as expressed in

Equation 2.17 :

C(i) =2|ejk|

dG(i)(dG(i)− 1)(2.16)

C(i) =|ejk|

dG(i)(dG(i)− 1)(2.17)

In both formulas, i, j and k are vertices of graph G, dG(i) is the degree of i, and |ejk| represents

the total number of existing edges between the neighbours of vertex i.

vi. Average Path Length: This network topology measure determines the distance between any pair

of vertices, and it can be used to determine if the graph is characteristic of a social network (Reka

& Barabasi, 2002). It is computed as the average length over all shortest paths between pairs of

vertices (Luciano et al., 2006), and it can be mathematically expressed as follows:

〈L〉 =1

n(n− 1)

∑i,k∈V

gik (2.18)

In the formula, V is the set of vertices in the network, gik represents the distance of the geodesic

path between vertices i and k, and the parameter n represents the total number of vertices in the

graph.

To compute these network centrality measures, some readily available open-source libraries can be

used. These include:

i. Gephi1 (Bastian et al., 2009): A Java library for social network analysis and data visualization;

ii. NetworkX 2 (Hagberg et al., 2008): A Python library to create, manipulate and analyze complex

networks;

iii. Network Workbench3: A Java framework for large-scale network analysis and data visualization;

1http://gephi.org/developers/2http://networkx.lanl.gov/3http://nwb.cns.iu.edu/

16

http://gephi.org/developers/

http://networkx.lanl.gov/

http://nwb.cns.iu.edu/

iv. iGraph1: A C library for graph analysis which integrates with the R package2 for data visualization

and statistical computing, which also provides other methods for social network analysis;

v. CiteSpace3 (Chen, 2006): A Java application for visualizing and analyzing trends and patterns in

scientific literature;

vi. NetKit-SL4 (Macskassy & Provost, 2007): A set of Java packages which provide an implementation

of several graph centrality measures;

vii. CytoScape5 (Shannon et al., 2003): A Java software platform for complex network visualization,

which also provides network analysis via plugins.

As for bibliometric indexes, some of the most widely used are as follows:

i. The h-index and its variants : Proposed by Hirsch (2010) to quantitatively represent the output

of a researcher, this index measures the productivity and total impact of a scientist, supporting

comparisons between scientists of different ages (Hirsch, 2010). A researcher has an h-index of h

if h of his/her Np papers (i.e., the total number of published papers) have at least h citations each,

and the other (Np − h) papers have ≤ h citations each.

Several variants of this metric have been proposed, in order to deal with some of the problems of

the original h-index. One such extension is the contemporary h-index (Sidiropoulos et al., 2007),

which takes into account the age of an article and allows us to acknowledge the work of young

promising scientists and of senior scientists, who happen to still be active. The contemporary

h-index score Sc(i) for article i depends on the value of:

Sc(i) = γ ∗ (Y (now)− Y (i) + 1)−δ ∗ |C(i)| (2.19)

In the formula, Y (i) represents the year of publication of the article i and C(i) represents the

number of articles that cite article i. The parameter δ is set to 1, so that Sc(i) is the total number

of citations received by article i, divided by the age of the article. By introducing this parameter,

we have that the score Sc(i) will be too small, so the coefficient γ is set to 4, making the citations

of an article of the current year account as four times more important and, consequently, an article

published 4 years ago will have its citations in account only once. With this approach and, as time

goes by, older articles gradually lose their value.

1http://igraph.sourceforge.net/2http://www.r-project.org/3http://cluster.cis.drexel.edu/˜cchen/citespace/4http://netkit-srl.sourceforge.net/5http://www.cytoscape.org/

17

http://igraph.sourceforge.net/

http://www.r-project.org/

http://cluster.cis.drexel.edu/~cchen/citespace/

http://netkit-srl.sourceforge.net/

http://www.cytoscape.org/

In brief, a researcher has a contemporary h-index of hc if hc of his/her Np articles have a score of

Sc(i) ≥ hc each, and the remaining (Np − hc) articles each have a score of Sc(i) ≤ hc.

Another variant is the trend h-index, addressing the fact that the h-index does not take into account

the age of a citation (Sidiropoulos et al., 2007). Articles that continue to be cited along the years

indicate that the topic/solution is still up to date and that the respective scientist can be an influential

mind, who still has an impact on younger scientists. As an article is continually cited, we can also

be in the presence of a trend-setter, i.e., a scientist whose work is, in some way, pioneering and/or

is currently working on something that is considered as trendy. Hence, the trend h-index, with γ, δ

, Y (i) and S(i) as defined in Equation 2.19, can be expressed with basis in the value of:

St(i) = γ ∗∑∀x∈C(i)

(Y (now)− Y (x) + 1)−δ (2.20)

In brief, a researcher has a trend h-index of ht if ht of his/her Np articles have a score of St(i) ≥ ht

each, and the remaining (Np − ht) articles each have a score of St(i) ≤ ht.

There is also the normalized h-index, which mitigates the fact that scientists from different research

areas do not publish the same number of articles, providing a fairer h-index metric (Sidiropoulos

et al., 2007). A researcher has a normalized h-index of hn = h/Np if h of its Np articles have

received at least h citations each, and the remaining (Np − h) articles have received no more than

h citations.

Recent work developed by Devezas et al. (2011) applied the h-index to the task of ranking web

blogs. Analogously to Bibliometrics, blogs can be seen as the authors and the posts as the papers

published by them. Therefore, a blog has a index h if h of its N posts have at least h inlinks each

and the remaining (N − h) posts have no more than h in-links each. The h-index turned out to be

a more balanced metric, comparing to the use of the indegree, to assess the importance of a blog.

ii. The g-index: This index is an improvement over the h-index, measuring the global citation perfor-

mance of a list of articles (Egghe, 2006). A set of papers has a g-index of g if g is the highest

unique rank such that the top g papers have, together, at least g2 citations. This is valid if the list

of articles is sorted in decreasing order of the number of citations received by each article, and

the top g + 1 papers have less than g2 citations. Thus, with α > 2 denoting the Lotkaian exponent

(Lotka, 1926) and with T denoting the total number of sources, i.e., articles, the g-index can be

mathematically expressed as follows:

g =

(α− 1

α− 2

)α−1α

T1α , with α = 1 +

ln(growth rate of sources)

ln(growth rate of items)(2.21)

18

In the formula, the sources are the scientific articles and the items are the citations between those

articles (Egghe, 2009).

iii. The a-index: The a-index is a derived index, dependent on the h-index, and is a constant ranging

between 3 and 5 that helps us to better understand the relation between the total number of

citations of an article (Nc,tot) and the h-index (Zhang, 2009). The a-index allows us to describe the

magnitude of the hit contributions of individual scientists and is defined as follows (Sidiropoulos

et al., 2007):

Nc,tot = ah2 (2.22)

This index can be used as a secondary metric to rank and evaluate scientists, due to the fact that

h2 underestimates the Nc,tot of the h most cited papers, which is usually greater than h2, and

disregards the papers that have less than h citations (Hirsch, 2010).

iv. The e-index: This metric was proposed by Zhang (2009) to address two specific drawbacks from

the original h-index, namely:

• Loss of citation information - excess citations are ignored, making the comparisons based

only on the h-index misleading;

• Low resolution - the h-index is composed of natural numbers, instead of real numbers, hence,

confining a relatively narrow range to the results.

The e-index can formally be defined as follows:

e2 =

h∑j=1

citj − h2 (2.23)

In the formula, citj is the number of received citations by the jth paper and the e2 value is ex-

pressed as a real number. This index is also related to the aforementioned a-index in the following

way:

a = h+e2

h(2.24)

v. The ISI Impact Factor: This index measures the popularity of a journal referring to a specific year.

It is defined as the mean number of citations that have occurred in the specified year, to articles

that were published in the journal during the prior two years (Bollen et al., 2006).

IF (vi, t) =∑j

c(vj , vi, t)

n(vi)(2.25)

19

In the formula, c(vj , vi, t) is the number of citations from journal vj to journal vi in year t, and n(vi)

corresponds to the number of publications in journal vi during the two years prior to t, which ends

up normalizing the resulting citation count of an article in a mean 2-year citation rate (Bollen et al.,

2006).

vi. The Y-Factor: Due to the fact that there can be some discrepancies between the values of the

ISI Impact Factor and the Weighted PageRank, introduced in Section 3.2.1, (i.e., a journal may

have a high ISI Impact Factor, but a low Weighted PageRank value) this measure results from the

multiplication of both these values. The Y-factor of journal vj can be mathematically expressed as

follows:

Y (vj) = ISI IF × PRw(vj) (2.26)

When assessing the authority of an individual, it is important to use these measures not only individually,

but also in combination, since scientific impact can be seen as a multi-dimensional construct (Bollen

et al., 2009).

2.9 Unsupervised Rank Aggregation Approaches

Given that each metric introduced in the previous section produces an ordering for the nodes in a graph,

we can leverage on the rank aggregation methods from social choice theory (i.e., voting protocols) to

combine the individual rankings.

In the realm of voting protocols, we consider that there are voters who submit votes over their favorite

alternatives, i.e., the candidates. Determining the winner, or the best ordering of candidates, requires

the aggregation of the rankings of all voters. This process depends on the voting rule that is used, and

it can be defined as follows: Let C be the set of candidates, R(C) the set of all possible rankings of the

candidates, and n the number of voters. A voting rule is a mapping from R(C)n to C, if one wishes to

produce a winner, and from R(C)n to R(C), if one wishes to produce an aggregate ranking (Conitzer,

2006a). The most common voting rules are as follows:

i. Scoring Rules - Borda Rule, Plurality Rule and Veto Rule: Let ~α = 〈α1, ..., αm〉 be a vector of

integers. For each voter, α1 is the number of points that a candidate gets if the voter ranks him

first, α2 the number of points that candidate gets if the voter ranks him second, and so on.

With the Plurality Rule, candidates are ranked simply in terms of how often voters have ranked

them in the first place, thus having a system of scores corresponding to ~α = 〈1, 0, ..., 0〉. With this

rule, it is irrelevant how voters rank the candidates that are below the top candidate.

20

The Veto Rule is the opposite of the Plurality Rule, because it is based on a system of scores with

~α = 〈1, 1, ..., 1, 0〉, i.e., it only takes into account how often the candidate is not ranked in last place.

As such, each voter vetoes a single candidate and the least vetoed candidate wins the election

(Procaccia et al., 2006).

The Borda Rule is based on a system of scores with ~α = 〈m− 1,m− 2, ..., 0〉, which means that a

candidate obtains m− 1 points for the first position in the preference of a voter, m− 2 points for the

second position, and so forth, with m representing the total number of candidates. The candidate

who sums the maximum number of points from all voters is the winner (Kiselev, 2008).

ii. Single Transferable Vote (STV): This is a method to calculate the result of an election with the

guarantee of proportional representation, under reasonable conditions, for the sets of voters who

share a set of most preferred candidates (Geller, 2002). Running through m − 1 rounds, the STV

voting rule is based upon three principles:

• Order of preference - the candidates are listed in ordinal preference by the voters (i.e., in

descending order).

• Quota - the number of votes needed for a candidate to win the election must be calculated in

the following way:

q =

⌊|V |e+ 1

⌋+ 1 (2.27)

In the formula, |V | represents the total number of voters and e the number of seats available

in the election (i.e., the number of candidates to elect). In each round, if a candidate gets a

greater number of votes than the quota, that candidate is automatically elected.

• Transfer - When a candidate c is elected and there are still more seats to be filled, the surplus

of the votes from that newly elected candidate must be redistributed to each voters’ next

ranked candidate. The transfer value fc takes into account the quota q and the number of

votes wc that candidate c has. It is computed as follows:

fc =wc − qwc

(2.28)

When, in each round, the top voted candidate does not have enough votes to be elected (i.e.,

the total number of votes is less than the quota), the last placed candidate is eliminated, and

those candidate’s votes are redistributed to the next highest ranked candidate, for each voter

for whom the recently eliminated candidate was the top preference.

The flowchart in Figure 2.3 depicts the steps taken in each round, in order to conclude the election.

It is based on the additional information provided by an online simulation of the Single Transferable

21

Vote1 system.

Figure 2.3: Flowchart for the Single Transferable Vote rule.

iii. Plurality Rule with Run-Off : This rule proceeds in two rounds. In the first, all candidates are

eliminated, except the ones with the highest plurality scores, i.e., the candidates with the first

and second highest number of votes in the election. Then, as in the STV Rule, all the votes are

transfered to these selected candidates. The second round, which is called the run-off, is used

to define the final winner of the election, from the two remaining candidates. All candidates are

ranked according to their Plurality scores, except the top two whose relative ranking is determined

according to the results of the second round.

iv. Maximin: Letting N(c1, c2) be the number of votes that show the preference of candidate c1 over

candidate c2, the maximin score (also known as the Simpson Score) assigned to a candidate c1 is

as follows:

s(c1) = minc1 6=c2N(c2, c1) (2.29)

In the formula, s(c1) is the worst score of candidate c1 in a pairwise election. As all candidates are

ranked by their scores, the winner of the election is the candidate with the highest maximin score.

v. Copeland : For any two candidates c1 and c2 we simulate a pairwise election, so we can determine

how many voters prefer c1 over c2 and how many prefer c2 over c1 (Xia et al., 2011). All candidates

are ranked by their score, and they gain or lose a Copeland point for, respectively, each election

they win or lose (Conitzer & Sandholm, 2005). If there is a tie, Copeland points are also assigned

to the candidates. Therefore, for a pairwise election between candidates c1 and c2, a score is

assigned according to the following procedure:1http://stv.humancube.com/

22

http://stv.humancube.com/

C(c1, c2) =

1, N(c1,c2) > N(c2,c1)

12 , N(c1,c2) = N(c2,c1)

0, N(c1,c2) < N(c2,c1)

(2.30)

Then, the Copeland Score of candidate c1 is given by:

s(c1) =∑c2 6=c1

C(c1, c2) (2.31)

The candidate who has the highest score wins the election.

vi. Bucklin: The Bucklin Score of a candidate c is the smallest number lc such that more than half of

the voters rank c among the top positions, i.e., the Bucklin score B(lc) >n2 (Xia et al., 2011). The

winner is the candidate with the lower Bucklin Score. All candidates are ranked in inverse order by

lc and if there is a tie, B(lc) is used as a tie-breaker.

vii. Slater : In the Slater voting rule, we choose a ranking of candidates that is inconsistent with the out-

comes of as few pairwise elections as possible (Conitzer, 2006b). An inconsistency corresponds

to the case in which, for each pair of candidates c1 and c2, c1 is ranked higher than c2 and c2

defeats c1 in their pairwise election. Therefore, the intent of the Slater ranking is to minimize such

inconsistencies.

viii. Kemeny : Similarly to the Slater Rule, a ranking is a Kemeny ranking if it minimizes the number

of inconsistencies. However, this rule produces a ranking that aims at minimizing the number of

times that the ranking is inconsistent with a vote on the ranking of two candidates. Therefore, an

inconsistency in the terminology of the Kemeny ranking is defined as follows: Given the ranking r

of two candidates, given each combination of candidates (c1, c2), and given a voter ra, we have an

inconsistency if ra ranks c1 higher than c2, but ri ranks c2 higher than c1.

ix. Cup and its variants: Cup Rule runs a single-elimination contest to decide which candidate wins

the election. It does not produce a full aggregate ranking of the candidates, and it requires an

additional schedule for matching up the remaining candidates. The rule is defined by a balanced

binary tree T , where each candidate is assigned to a leaf through the aforementioned schedule. To

each of the remaining non-leaf nodes is assigned the winner of the pairwise election of that node’s

children. There is a winner whenever a candidate is assigned to the root node.

As for Cup Rule’s variations, we have the regular cup, which assumes that all voters know to

which leaf a candidate is assigned to, prior to their voting, and the randomized cup, in which the

assignment of candidates to leaves is uniformly chosen at random, after the voting. Votes can

be weighted and thus there can also be a different interpretation to the weight, such that it can

represent the decision power of a voting agent in a setting where not all agents are considered

23

equal, e.g., a weight of K counting as K votes of weight 1.

2.10 Supervised Learning for Rank Aggregation

In the previous section we presented unsupervised techniques to perform rank aggregation. Neverthe-

less, one can also use supervised learning techniques to address this task. In order to do that, Learning

to Rank (L2R) has emerged as a way of using machine learning techniques for rank aggregation (Li,

2011).

In L2R, there are two general phases, namely, learning and ranking. The learning phase takes training

data as input, which corresponds to ranked lists of objects, with each object being described by a set of

features (i.e., a set of simple ranking measures that we want to combine). Given a new set of objects, one

aims at predicting the best possible ranking, combining the available information. Figure 2.4 illustrates

the general framework.

Figure 2.4: Learning-To-Rank (L2R) Framework (adapted from Liu (2009)).

Learning-to-Rank methods can be categorized according to three different types of approaches, namely,

pointwise, pairwise, and listwise (Li, 2011; Liu, 2009).

In the pointwise approach, the ranking problem is transformed into a classification, regression or an

ordinal classification problem. The input space has each object’s feature vector, while the output space

contains the ranking order predicted to each object (Liu, 2009). The loss function is said to be pointwise

because it is defined on a single object’s feature vector (Li, 2011) and inspects the ground truth ranking

order for each single object. The hypothesis space on a pointwise approach contains the functions that

take the feature vector of an object as input and predict the ranking order of that same object (Li, 2011).

In the pairwise approach, the ranking problem is transformed into a pairwise classification problem, i.e.,

24

one classifies a given pair of objects as if the pair were in a correct ranking order or not. In this approach,

the loss function is pairwise, due to it being defined on a pair of feature vectors.

The listwise approach takes ranked lists of objects as instances and, unlike the aforementioned ap-

proaches, it maintains the group structure of the ranked lists. This approach also learns a ranking model

from the given training data, which can later assign scores to feature vectors, and then ranks these

feature vectors using those scores.

One particular supervised listwise ranking method is CRanking (Lebanon & Lafferty, 2002) which applies

the following probabilistic model:

P (π|θ,Σ) =1

Z(θ,Σ)exp(

k∑j=1

θj · d(π, σj)) (2.32)

In the formula, π is the final ranking, Σ = (σ1, ..., σk) are the basic rankings being combined, d is the

distance between the two rankings (e.g., Kendall’s τ ) and θ is a weighting parameter. Z is a normalization

factor over all the possible rankings, and can be defined as follows:

Z(θ,Σ) =∑π

exp(

k∑j=1

θj · d(π, σj)) (2.33)

When learning, the algorithm is given S = {(Σi, πi)}mi=1 as training data, in order to build a model for rank

aggregation. Maximum Likelihood Estimation is used to learn the model’s parameters. Considering that

both the final ranking and the basic rankings are all full ranking lists in the training data, the likelihood

function can be computed as follows:

L(θ) =

m∑i=1

logexp(

∑kj=1 θj · d(πi, σi,j))∑

πi∈∏ exp

∑kj=1 θj · d(πi, σi,j)

(2.34)

For the final step of prediction, the algorithm is given the learned model and the basic rankings Σ. Then,

the probability distribution π : P (π|θ,Σ) of the final ranking is calculated, in order to be later used when

calculating the expected rank for each object. Objects are finally sorted according to their expected rank,

the latter being defined as follows:

E(πi|θ,Σ) =

n∑r=1

r · P (πi = r|θ,Σ) =

n∑r=1

r ·∑

π∈∏,π(i)=r

P (π|θ,Σ) (2.35)

25

2.11 Summary

In this chapter, the fundamental concepts regarding the tasks of characterizing a network and finding

the network’s most influential nodes were introduced. Broader concepts such as prestige, popularity

or recognition were also explored, distinguishing them from what it is to be an influencer. Other related

network analysis topics were introduced, namely information cascades and information diffusion models,

since the most influential nodes in a network have the capacity to disseminate information through the

network at a much faster pace, reaching a greater number of other nodes. Learning-To-Rank and rank

aggregation techniques were also introduced as ways of combining different ranking lists to produce a

single, global and uniform ranking list.

26

Chapter 3

Related Work

This chapter presents the most important related work in the context of my MSc thesis. The chap-

ter starts by presenting the HITS algorithm and Google’s PageRank algorithm for ranking web

pages, discussing how the latter evolved from its original implementation to more detailed and specific

approaches, such as the Weighted PageRank algorithm and the Topic-Sensitive PageRank algorithm.

Then, the section introduces the IP Algorithm, a recent development that extends the benefits of PageR-

ank and determines the influence and passivity of network nodes based on their capacity to forward

information. In the specific realm of Twitter we present TwitterRank, an approach to measure the influ-

ence of a Twitter user, based on the principle of homophily regarding the topics that users write about.

Finally, we take a deeper look at the work that has been done in Bibliometrics, in order to find influencers

in citation and co-authorship networks, and also describing works that take into account the temporal

evolution of graphs.

3.1 The Hyperlinked Induced Topic Search (HITS) Algorithm

The HITS algorithm, a Web page ranking method developed by Kleinberg (Kleinberg, 1998), is based

on the notion of authorities and hubs. The authorities, i.e., pages that have a greater amount of inlinks,

have a mutually reinforcing relationship with hubs, i.e., the pages that have outlinks to many related

authorities, in a way that a good hub is a page that points to many good authorities, and a good authority

is a page that is pointed by many good hubs – see Figure 3.5. This relationship is put into use through

the iterative procedure shown in Algorithm 1, which maintains and updates the weights of each page

(Kleinberg, 1998).

27

Figure 3.5: A graph with hubs and authorities (adapted from Kleinberg (1998)).

Algorithm 1 The Hyperlinked Induced Topic Search (HITS) Algorithm

G: A graph with n interlinked pagesk: A constant corresponding to the number of iterationsz: The vector (1,1,1,...,1) ∈ RnSet x0 := zSet y0 := zfor i = 1, 2, ..., k do

Apply xp =∑q,q→p yq to (xi−1, yi−1), obtaining new x-weights x

′

i

Apply yp =∑q,p→q xq to (x

′

i, yi−1), obtaining new y-weights y′

i

Normalize x′

i, obtaining new authority scores xiNormalize y

′

i, obtaining new hub scores yiend for

In order to compute the HITS algorithm, the aforementioned Gephi1, NetworkX 2 and Network Work-

bench3 software packages can be used.

3.2 The PageRank algorithm and its Variants

The PageRank algorithm arose in the context of the development of Google’s search engine, at the

time described as a prototype of a large-scale search engine that made heavy use of the hyperlinked

structure of the web (Brin & Page, 1998).

PageRank is based on principles from academic citation analysis, applied to the web. It can be mathe-

matically expressed as follows:

PR(A) =(1− d)

N+ d

∑i

PR(Ti)

C(Ti)(3.36)

A page A has T1, ..., Tn pages that point to it (i.e., that cite page A) and, C(T1), ..., C(Tn) is the number

of outlinks from page A to pages T1, ..., Tn. The term N corresponds to the total number of pages in

1http://gephi.org/developers/2http://networkx.lanl.gov/3http://nwb.cns.iu.edu/

28

http://gephi.org/developers/

http://networkx.lanl.gov/

http://nwb.cns.iu.edu/

the network. The free parameter d is called the damping factor and controls the performance of the

algorithm, being usually set to 0.85. In a random web surfer scenario, the surfer can restart his search

with probability 1 − d by jumping to another page that is randomly and uniformly chosen, instead of

following a random link, which can be done with probability d (Chen et al., 2007). Figure 3.6 depicts the

computation of the PageRank score for a three-node network.

Figure 3.6: A graph illustrating the computation of PageRank (adapted from Page et al. (1998)).

From Figure 3.6, one can acknowledge that page A has an inlink from page C and two outlinks to pages

B and C. Therefore, page A is going to split its PageRank score of 0.4 for its two outlinks, equally

transfering a value of 0.2 to pages B and C. In its turn, page B has a PageRank score of 0.2 that A

transfered to it. Because B only has an outlink to C, this page entirely transfers its PageRank score to

page C. Finally, page C, that receives PageRank scores of 0.2 from A and B, accumulates a PageRank

score of 0.4, which is entirely transfered to its only outlink, page A.

A page can achieve a high PageRank score if it has many other pages pointing to it, i.e., if it is highly

cited, or if some of the pages that point to it have themselves a high PageRank score.

Even though PageRank works over networks originally corresponding to directed graphs, the works

of Perra & Fortunato (2008) and of Mihalcea (2004) revealed that PageRank can also be applied to

undirected graphs, hence having vertices with equal indegrees and outdegrees.

In the realm of Bibliometrics, PageRank is used as a complementary method to citation analysis, due

to the fact it mitigates citation count’s drawback of not taking into account the importance of a paper.

PageRank allows us to identify publications that are being referenced by highly cited articles (Ding

et al., 2009).

Authors such as Chen et al. (2007) suggested to set d = 0.5, due to the hypothesis that, in the context of

citation networks, the entries in a reference list of a typical paper are collected following, approximately,

an average length of 2. Chen et al.’s justification is based on the empirical observation that about 50%

of the articles that are in the references list of a paper A have at least one citation following the pattern

B → C, in which the article C is part of A’s reference list. Thus, the author assumes that there is a

feed-forward loop among A, B and C, such that A→ B, B → C and, consequently, A→ C.

Due to its probabilistic nature, and also to the fact that each node is guaranteed to be visited, PageRank

29

scores are not comparable across different graphs. To mitigate this, Berberich et al. proposed a normal-

ization of the PageRank scores, which eliminates any dependency on the size of the graph (Berberich

et al., 2006). The normalized PageRank score can be computed as follows:

PR(v) =PR(v)

1|V |(d+ (1− d)

∑d∈D PR(d)

) (3.37)

In the formula, the denominator represents the lower-bound for Equation 3.36, while |V | is the total

number of vertices in the graph and D ⊆ V is the set of dangling nodes.

Alternatively to the random surfer model, and specifically for social phenomena such as epidemics or

word-of-mouth recommendation, Ghosh et al. (2011) proposed a broadcast-based non-conservative

diffusion model, due to the fact that this phenomena can be modeled as contact processes, in which

an active (infected) node will activate its neighbours, via broadcast, with some probability. The differ-

ence between this model and the random surfer model is that, while the latter conserves the amount

of substance that is being diffused on the network, the former is non-conservative in a way that the

information changes while it spreads from an individual to his neighbours. Ghosh et al. (2011) state that

PageRank is a steady state solution of conservative diffusion and, therefore, a conservative metric, while

Alpha-Centrality, a non-conservative metric, which measures the total number of paths from a node ex-

ponentially attenuated by their length, is a steady state solution of linear non-conservative diffusion. In

their study, the authors propose an efficient algorithm for computing the Alpha-Centrality.

To compute the PageRank algorithm, we can use some readily available open-source software li-

braries, such as the aforementioned Gephi, NetworkX and Network Workbench packages, or the LAW-

Webgraph1 Java library for large-scale web graph analysis (Boldi & Vigna, 2004).

3.2.1 Weighted PageRank

In the original PageRank algorithm from Equation 3.36, we have no notion of hyperlink weight, and thus

all hyperlinks express the same degree of relationship between the pages they link (Bollen et al., 2006).

However, in many practical applications, we have that not all links express the same type of relationship.

Acknowledging that some links in a web page may be more important than others, Xing & Ghorbani

(2004) proposed a Weighted PageRank algorithm that assigns higher scores to more important links,

instead of the traditional even division among the outlinks of a page. Each link is assigned with a value

that is proportional to the popularity of the destination node, i.e., proportional to its number of inlinks and

outlinks.

In this approach, there is an inlink weight W in(v,u) and an outlink weight W out

(v,u). The inlink weight of link

1http://webgraph.dsi.unimi.it/

30

http://webgraph.dsi.unimi.it/

(v, u) is based on the number of inlinks of page u and the number of inlinks from all the pages that are

referenced by page v. The outlink weight is analogous. They are calculated as follows:

W in(v,u) =

Iu∑p∈R(v) Ip

W out(v,u) =

Ou∑p∈R(v)Op

(3.38)

In the formulas, Iu and Ip represent, respectively, the number of inlinks of pages u and p, while Ou

and Op, represent the number of outlinks of pages u and p. R(v) is the set of outlinks from page v.

Considering the introduction of these two weights in the computation of a Weighted PageRank algorithm,

the latter can be mathematically expressed as follows:

PR(u) = (1− d) + d∑

v∈B(u)

PR(v)W in(v,u)W

out(v,u) (3.39)

The studies conducted within the work of Xing and Ghorbani revealed that their Weighted PageRank

algorithm has a better performance than the original PageRank.

Fiala et al. (2008) also proposed modifications to the original PageRank algorithm, ensuring its applica-

tion in bibliographic networks. The authors take into account the citation and co-authorship information,

in the way that each edge (u, v) ∈ E, withE corresponding to the set of edges between the vertices of the

graph where nodes correspond to the authors of the papers, is associated with weights wu,v, cu,v, bu,v.

The value wu,v is the number of citations from author u to author v, the value cu,v is the number of

common publications by u and v and bu,v can assume different values, depending on the semantics of

edge weights that we want to stress. The new ranking for authors is defined as follows:

R(u) =1− d|A|

+ d∑

(v,u)∈E

R(v)

wv,ucv,u+1

bv,u+1

∑(v,j)∈E wv,j∑

(v,k)∈Ewv,k

cv,k+1

bv,k+1

∑(v,j)∈E wv,j

(3.40)

In the formula, |A| is the set of vertices (e.g., the set of authors of the papers) and d is a damping factor,

empirically set to d = 0.9.

In this approach, a Weighted PageRank algorithm is reached if, according to Equation 3.40, the coeffi-

cients b and c equal to zero.

Bollen et al. (2006), when applying the Weighted PageRank algorithm to journal citation networks, took

into account journal citation frequencies in the transfer of PageRank values, so that the prestige of a jour-

nal can be accordingly transfered along the iterations of the algorithm. They referred to this transfered

value as the Propagation Proportion between journals and defined it as follows:

w(vj , vi) =W (vj , vi)∑kW (vj , vk)

(3.41)

31

In the formula, W (vj , vi) is the weight of the link between journals vj and vi, which are then normalized

by the weights of journal vj ’s outlinks. In the application of the Weighted PageRank algorithm described

by Bollen et al. (2006)., the number of outlinks C(Ti) from Equation 3.36 has been replaced with the

Propagation Proportion, resulting in the following equation:

PRw =(1− d)

N+ d

∑PRw(vj)× w(vj , vi) (3.42)

On the other hand, within the work of Yan & Ding (2011), citation counts are incorporated with the

network topology, resulting in the following integrated Weighted PageRank algorithm:

PRw = (1− d)CC(p)∑Nj=1 CC(pj)

+ dPRw(pi)∑ki=1 C(pi)

(3.43)

In the formula, CC(p) represents the number of citations pointing to an author p,∑Nj=1 CC(pj) is the sum

of the citation counts for all the nodes in the network, and (1− d), as in previous PageRank definitions,

ensures that results sum up to one. Yan & Ding (2011) pointed out two extreme scenarios regarding the

variation of d. If d = 0, then each node would have its relative citation score equal to∑Nj=1 CC(pj), which

equals to the normalized citation counts. Also, and in accordance with Boldi et al. (2005), when d→ 1−

PageRank becomes unstable and its convergence rate slows.

3.2.2 Topic-Sensitive PageRank

The link-structure of the Web is used in the original PageRank algorithm to pre-compute topic-independent

scores that reflect the importance of web pages. The pre-computed importance scores can afterwards

be combined with other Information Retrieval scores, e.g., term frequency, to produce a ranking of the

pages towards specific user queries (Brin & Page, 1998).

Haveliwala (2002) proposed a Topic-Sensitive PageRank algorithm, where one computes offline a set

of PageRank vectors, which are biased towards a set of representative basis topics from the Open

Directory Project1. For each page, and regarding the considered set of topics, a set of importance

scores is created and, at query-time, the similarity of the query and/or user context is calculated. To

achieve the final ranking, one linearly combines the topic-sensitive vectors, which are weighted with the

similarity of the query towards the topics.

The mathematical approach to this Topic-Sensitive PageRank is as follows. Considering q the query

and q′ its respective context in the page u, we may have a search in context (i.e., the user is viewing

a document and selects a term from it, in order to get more information about the selected term). The

context q′ consists of all terms in u if we have a search in context, and otherwise q′ consists only in the1http://www.dmoz.org/

32

http://www.dmoz.org/

query q. For each topic cj , the following quantity is computed:

P (cj |q′) =P (cj) · P (q′|cj)

P (q′)∝ P (cj) ·

∏i

P (q′|cj) (3.44)

In the formula, P (q′|cj) can be computed from the class term-vector Dj , which consists in the terms of

the documents below each of the 16 top-level categories of the Open Directory Project (ODP). Finally, a

composite topic-independent importance score sqd is computed as follows:

sqd =∑j

P (cj |q′) · rjd (3.45)

In the formula, rjd is the rank of document d, given the PageRank vector PR(α,vj), for topic cj . In its

turn, PR(α,vj) has as parameters a bias factor α and the non-uniform damping vector vj , with Tj being

the set of URLs in the ODP category cj :

vji =

1|Tj | , i ∈ Tj0, i /∈ Tj

(3.46)

The bias factor, similarly to PageRank ’s damping factor, can influence the biasing degree of the resulting

vector towards the topic vector that was used. This bias was heuristically set to α = 0.25 by the authors.

3.2.3 TwitterRank

In the context of Twitter, the popular microblogging service, there is often the need to determine which

are the influential users.

From the work of Weng et al. (2010) arose TwitterRank, an extension of the PageRank algorithm that

takes both the topic similarity between users and the link structure of the social network into account.

However, the influence of a user may vary in different topics, since a Twitter user can have interests or

expertise in many distinct areas.

In the same way that in Bibliometrics we have that citation count is the simplest method to assess the

influence of an author in an author-publication network, we have that, on Twitter, the follower count, i.e.,

the total number of people who are following a particular user, has been interpreted as a good indicator

of influence. Nevertheless, Weng et al. (2010) observed that 72.4% of the users follow more than 80%

of their followers, and that 80.5% of the users have 80% of their friends (i.e, twitterers whose updates

are being followed) following them back. This can be contradictory, because either the act of following

is so casual that a twitterer randomly follows other twitterers and they, politely, just follow them back, or

this following relationship can reflect the existence of a strong similarity among users, due to the interest

33

in the topics the twitterers tweet about. The latter denotes the homophily phenomena.

The general framework proposed for TwitterRank is depicted in Figure 3.7. First, in the topic distillation

phase, the topics twitterers are interested in are extracted with basis on what they tweet about. Then,

a topic-specific relationship network is built, based on the previously gathered topics. Finally, the Twit-

terRank algorithm is applied to measure the topic-sensitive influence of a twitterer, taking into account

both the topics that were distilled and the structure of the topic-specific relationship network. A process

of identifying top-topics is done in the order of the probabilities of topic presence, as it is captured in

matrix WT , of W unique words in tweets and T topics. For each entry Wit we have the number of times

the unique word wi has been assigned to topic T .

Figure 3.7: The general TwitterRank framework (adapted from Weng et al. (2010)).

This approach addresses two important shortcomings of PageRank, namely the fact that it does not

take into account (i) the interests of the nodes of the network, and (ii) the indegree associated with the

follower count in Twitter.

To mathematically describe the topic-specific TwitterRank algorithm, we can see the Twitter network

as a directed graph D(V,E), where the vertices V are the twitterers and the edges E are the following

connections between two twitterers. These connections are directed from follower to friend. In a random

surfer scenario, the surfer visits each twitterer with a certain topic-specific probability, by following the

appropriate edge in D. A transition matrix for topic t from follower si to friend sj , Pt, is defined as follows,

where |τj | is the number of tweets published by sj and∑a: si follows sa

|τa| is summing up the number of

tweets published by all of si’s friends.

Pt(i, j) =|τj |∑

a: si follows sa|τa|∗ simt(i, j) (3.47)

The similarity between si and sj in topic t, denoted by simt(i, j) is defined as follows:

simt(i, j) = 1− |DT ′it −DT ′jt| (3.48)

In the formula, DT ′ is the row-normalization of matrix DT , with D being the twitterers and T the topics.

In DT ′, each row is the probability distribution of twitterer si’s interest over the T topics. Thus, the

similarity between si and sj in topic t can be assessed as the difference between the probability that the

two are both interested in topic t. The higher their similarity, the higher the transition probability from si

to sj .

34

There is also the possibility of having some twitterers following one another in such a cyclic way that

they do not follow anyone outside that particular circle of following relations, which can end up in an

accumulation of high influence that is not distributed. To account with this situation, Weng et al. (2010)

introduced a teleportation vector Et that captures the probability that a random surfer would jump to

some twitterer instead of following the edges of graph D. The teleportation vector is defined as follows:

Et = DT ′′.t (3.49)

In the formula, DT ′′.t is the t-th column of DT ′, the column-normalized form of matrix DT , the latter being

part of the results from the topic distillation phase. In each entry, DT contains the number of times words

in a twitterer’s tweets have been assigned to a specific topic.

Thus, the topic-specific TwitterRank can be calculated as follows:

−−→TRt = γPt ×

−−→TRt + (1− γ)Et (3.50)

In the formula, γ is a parameter that directly controls the probability of teleportation, analogous to PageR-

anks’s damping factor, and has a value that can range from 0 to 1, usually set to γ = 0.85.

The formula from Equation 3.50 gives the representation of the topic-specific TwtiterRank vectors that

are generated. However, these vectors only refer to the twitterer’s influence in individual topics. To

measure the overall influence of a twitterer in different topics, we need to compute the aggregated

TwtiterRank vector as follows:

−→TR =

∑t

rt ·−−→TRt (3.51)

In the formula,−−→TRt is the TwitterRank vector for a topic t, and rt is the weight assigned to topic t and

associated with−−→TRt.

Weng et al. (2010) observed that the most active twitterers are not necessarily the most influential in

each topic. Also, and due to the consideration of the topical dimension, there is a higher correlation be-

tween TwitterRank and the Topic-Sensitive PageRank (Section 3.2.2) than with the indegree or with the

original PageRank algorithm. The experiments conducted by Weng et al. (2010), which used a Twitter

dataset with messages from Singapore-based twitterers, collected in April 2009, showed that Twitter-

Rank outperforms other related algorithms, including both PageRank and the algorithm that Twitter was

using by the time of their study.

35

3.3 The Influence-Passivity (IP) Algorithm

Romero et al. (2011) came to the conclusion that, if a user is to be considered influential, then he does

not only have to be popular and get attention from his peers, but he has also to overcome passivity,

a state in which a user receives information but does not propagate it through the network. Thus, this

approach determines the influence and also the passivity of a user, based on his information forwarding

activity.

The algorithm proposed by Romero et al. (2011) is similar to HITS and to PageRank. However, the dif-

ference in this approach is that the diffusion behaviour among the users is also taken into consideration.

This work was conducted on Twitter and assigns to every user both a passivity score and an influence

score, which respectively correspond to the authority and hub scores in the HITS algorithm. The use

of passivity in the algorithm comes from the evidence that users in Twitter are generally passive and

thus, when determining the influence of a user, taking into account the passivity of all the people that

are influenced by him is also very important. The following assumptions are considered by the authors:

1. The influence score of a user depends on the number of people he influences, as well as on their

passivity.

2. The influence score of a user depends on how dedicated the people that he influences are. This

dedication is measured by the amount of attention a user pays to some other user, as compared

to everyone else.

3. The passivity score of a user depends on the influence of those who he is exposed to, but not

influenced by.

4. The passivity score of a user depends on how much he rejects some other user’s influence, com-

pared to everyone else’s influence.

Given these assumptions, one should note that the network graph for this algorithm is a weighted graph

G = (N,E,W ) with N nodes, E edges and W edge weights, where weight wij represents the ratio of

influence that node i has over node j to the total influence that i attempted to have over j. The output

of the IP Algorithm is a function I : N → [0, 1] and a function P : N → [0, 1], which represent each

node’s relative influence and passivity, respectively. For each edge e = (i, j) ∈ E, the authors defined

an acceptance rate that represents the amount of influence accepted by j from all users in the network

and that, thus, can reflect the loyalty user j has to user i. The acceptance rate is defined as follows:

uij =wij∑

k:(k,j)∈E wkj(3.52)

There is also a rejection rate, which is the opposite of the acceptance rate, because 1−wji is the amount

36

of influence user i rejects from user j. Thus, the rejection rate vji is the influence that user i rejected

from user j, normalized by the total influence rejected from j by all other users in the network. The

rejection rate vji is mathematically expressed as follows:

vji =1− wji∑

k:(j,k)∈E(1− wjk)(3.53)

The IP Algorithm is thus based on two operations that relate directly to the aforementioned assumptions.

The operation Ii is related to a user’s influence and is as follows:

Ii ←∑

j:(i,j)∈E

uijPj (3.54)

In the formula, the term Pj corresponds to the passivity referred in Assumption 1, and the term uij to

the amount of dedication referred to in Assumption 2. As for operation Pi, it relates to a user’s passivity

and is as follows:

Pi ←∑

j:(j,i)∈E

vjiIj (3.55)

In the formula, the term Ij corresponds to the influence referred in Assumption 3, and vji to the rejection

rate referred in Assumption 4.

The algorithm takes as input a weighted graph and computes the IP scores for each node in m iterations,

as depicted in the pseudo-code of Algorithm 2.

Algorithm 2 The Influence-Passivity (IP) Algorithm.

G(N,E,W ): An influence graph with N nodes, E edges and W edge weightI0 ← (1, 1, ..., 1) ∈ R|N |

P0 ← (1, 1, ..., 1) ∈ R|N |

for i = 1→ m doUpdate Pi using operation Pi ←

∑j:(j,i)∈E vjiIj and the values Ii−1

Update Ii using operation Ii ←∑j:(i,j)∈E uijPj and the values Pi

for j = 1→ |N | doIj =

Ij∑k∈N Ik

Pj =Pj∑k∈N Pk

end forend for

The authors also concluded that there is a weak correlation between popularity and influence. The IP

Algorithm turned out to provide better indicators of popularity than PageRank.

37

3.4 Citation and Co-Authorship Networks

In Bibliometrics, there are two classes of ranking algorithms. In the class of collection-based ranking

algorithms, a weighted graph is used and its nodes correspond to the collections, e.g., journals and con-

ference inproceedings, having the weighted edges representing the total number of citations that point

from one collection to the other. The other class corresponds to publication-based ranking algorithms,

where the nodes of the citation graph are individual publications and the edges represent citations be-

tween papers (Sidiropoulos & Manolopoulos, 2005).

Both PageRank (Brin & Page, 1998) and HITS (Kleinberg, 1998) are part of the second class of ranking

algorithms, while the ISI Impact Factor (Bollen et al., 2006) is part of the first class.

Following the assessment that neither PageRank nor HITS are perfectly suitable for bibliometrics, the

latter due to the fact that a publication only gets a high authority score if there are good hubs that point to

it, and the former because it was designed in a way that a node’s score is mostly affected by the scores

of nodes that point to it and less by the number of incoming links, Sidiropoulos & Manolopoulos (2005)

introduced the SCEAS Rank, which is a collection-based ranking algorithm, where scores are computed

over a weighted graph where the nodes correspond to collections. SCEAS can be defined as follows:

Sj =∑i→j

Si + b

Nia−1 (a ≥ 1, b > 0) (3.56)

In the formula, Ni is the number of outgoing citations of node i, b is the direct citation enforcement factor,

which is used so that citations from zero scored nodes can also contribute to the score of their citing

publications, and a denotes the speed at which an indirect citation enforcement converges to zero. If a

change in the score of node i occurs, it is going to affect the score of node j that is x nodes away, with

a factor of a−x. Also the SCEAS approach has the following advantages over the PageRank and HITS

algorithms:

1. A node’s score is affected by the number of incoming citations.

2. The algorithm’s computation and convergence is very fast. In the experiment conducted by Sidiropou-

los & Manolopoulos (2005) with a DBLP dataset, they have verified that SCEAS needed half the

time needed by PageRank, and about 1/10 of the time needed by HITS.

3. A node’s score is less affected by the score of distant nodes and, whenever new nodes and ci-

tations are added to the network, the new score’s computation can be performed incrementally,

using the previous score vector as the input vector for the computation.

Specifically for co-authorship networks, where the graph nodes represent authors and edges repre-

sent ties between two authors, Liu et al. (2005) proposed AuthorRank, a modification to the PageRank

38

algorithm that is computed over a weighted directed co-authorship graph.

The co-authorship graph is directed and weighted in order to express the magnitude of the relationship

between two authors and is, as in the Weighted PageRank, represented by G = (V,E,W ), with a set of

V authors, a set of E co-author relationships, and a set W of normalized weights wij connecting authors

vi and vj . The normalized weights wij are such that the weights of an author sum up to one, and they

are computed as follows:

wij =cij∑nk=1 cik

(3.57)

In the formula, cij and cik correspond to the co-authorship frequency (Equation 3.58), which is also

correlated with exclusivity.

The idea behind co-authorship frequency is to assign more weight to authors that co-publish more

papers together, and do so exclusively (Liu et al., 2005). For a set of m articles, co-authorship frequency

is defined as follows:

cij =

m∑k=1

gi,j,k (3.58)

In its turn, exclusivity, i.e., giving more weight to co-authorship ties in articles with fewer total co-authors

than in articles with large number of co-authors (Liu et al., 2005), for authors vi and vj , who co-author

article ak, is defined as follows:

gi,j,k =1

f(ak)− 1(3.59)

In the formula, f(ak) is the total number of authors of article ak.

The magnitude of the connection between two authors is determined by the following factors:

1. Frequency of co-authorship: Authors that co-author frequently should have a higher co-authorship

weight;

2. Total number of co-authors on articles: Less weight should be assigned to the co-author relation-

ship if the article has many authors

Therefore, the AuthorRank of author i is expressed as follows:

AR(i) = (1− d) + d

n∑j=0

AR(j)× wj,i (3.60)

In the formula above, AR(j) is the AuthorRank score of the backlinking node j and wj,i corresponds to

39

the weight of the edge between node j and node i.

Also, when exclusivity and collaboration frequency are taken into account, one can assess that some

ties are more prestigious than others.

3.5 Temporal Issues in Ranking Scientific Articles

Citation networks are generally static networks, since a scientific article can not lose citations throughout

the years, and since articles do not disappear from the network. On the other hand, social networks are

generally characterized as dynamic networks, which change at a very fast pace, due to new users

that make new connections and former users that leave the social network, breaking the ties they have

established. Still, even in the case of citation networks, new articles are also being constantly introduced.

Therefore, time is a key factor in social network analysis.

Sayyadi and Getoor developed FutureRank, which computes the expected PageRank score of a sci-

entific article, based on the citations it will obtain in the future (Sayyadi & Getoor, 2009). This number

of future citations is referred to as the usefulness of the article, and the authors assumed that recent

articles are more useful. Nevertheless, older and highly cited articles still get a good ranking, due to

being cited by recent articles. The algorithm is computed in a network that has two different types of

nodes, namely, articles and authors, thus being unfold into two distinct networks (i) a citation network

connecting articles through citation edges, and (ii) a authorship network connecting articles and authors

through co-authorship edges. In the second network, articles can be mapped as the authorities and

authors as the hubs from the HITS algorithm. As the networks share nodes, information is passed from

and to one another.

In short, FutureRank runs one step of PageRank in the first network, in order to transfer authority from

the articles to their references, and one step of HITS in the second network. These results are repeatedly

combined until convergence is reached. The ranking of articles also involves a personalized PageRank

vector, which is pre-computed with basis on the current time and the publication time of the articles,

instead of being based on the number of nodes in the network as in the original PageRank algorithm.

The CiteRank algorithm (Walker et al., 2007) makes use of publication time in order to rank articles,

where each researcher, independently of others, is assumed to start his search with recent articles,

proceeding in a chain of citations until full satisfaction. The output of the algorithm can be seen as an

estimate of traffic to an article, i.e., the probability of encountering an article via a path of any length, and

is correlated to the number of citations in a way that the larger the number of citations, the more likely it

will be for the article to be visited via one of the incoming links. CiteRank is in all similar to PageRank

algorithm, except for the fact that CiteRank initially distributes random surfers exponentially with age and

with probability ρi = e−agei/τdir , where agei is the age of the ith article and τdir is the decay of time, thus

40

favoring recent articles.

3.6 Summary

This chaper presented what has been previously done regarding the task of finding influencers in a

network, having its main focus on the PageRank algorithm and in the different variants that have arisen

over the years. The Influence Passivity (IP) algorithm was also presented, i.e, a novel approach to

influence based on the HITS and PageRank algorithms that also takes information diffusion into account.

Finally, we glanced into a recent trending research topic that concerns the temporal issues in ranking

scientific articles, specifically, the prediction of future PageRank scores in a citation network, based on

future citations that an article may receive.

41

Chapter 4

Finding Influencers in Social Networks

This chapter presents and details the work that was developed in the context of my MSc thesis. I

focused on studying and developing techniques to identify influential nodes in a network so that,

given a network, one can characterize it and assess which are the nodes that exert more influence

over others, i.e., which are the nodes that induce others to have a particular behavior, e.g., forward a

message or visit a renowned monument or concert venue.

Two distinct experiments were conducted, each with a different type of network. In the first experiment we

collected real and up-to-date data from a location-based social networking service, namely FourSquare,

and from Twitter, a social networking and microblogging service, building social networks from the the

collected data. In the case of the network built from FourSquare’s data, it is commonly called a location-

based social network due to its inclusion of information from user’s interactions with other users, as

well as user’s interactions with locations, as they check-in in different places. The second experiment

involved data from DBLP, a digital library containing information about academic publications and their

citations, from which a citation network was built.

With the work that was developed, we wanted to prove the hypothesis that we can identify a network’s

most influential nodes, through network analysis metrics and algorithms. These techniques were applied

to different kinds of social networks, in order to explore influence in distinct contexts. On the experiment

with location-based social networks we wanted to test how good these social network analysis metrics

and algorithms are, in the task of identifying the most relevant nodes. On the other hand, when experi-

menting with academic social networks, we wanted to identify which were the most important papers in

the dataset and test if it was possible to predict the future influence scores of the nodes in the network,

based on their previous influence scores.

The remaining of this chapter is organized as follows: first we introduce the main software package that

43

were used and extended in the course of this research. Then, we describe the metrics used to char-

acterize the social networks of our experiments. In Section 4.2 we thoroughly describe the experiment

with location-based social networks, while in Section 4.3 we describe the experiment with the academic

social network developed from DBLP, going from the process of data collection, the algorithms that were

computed, and the methods to find influential nodes. We finish this chapter with a brief summary of what

has been presented.

4.1 Available Resources for Finding Influencers

To perform our experiments and fulfill the tasks of characterizing a social network and finding which

are its most influential nodes, we used several state-of-the-art algorithms and open-source software

packages for network analysis, in which the LAW Webgraph open-source software package is included.

LAW Webgraph is an open source project developed by researchers from the Laboratory of Web Algo-

rithms at the University of Milan. It contains a Java library for large-scale web graph analysis, presenting

a novel approach to graph compression that enables the creation and storage of web-scale graphs.

Among other things, the LAW Webgraph package contains an implementation of the PageRank algo-

rithm, which was the first algorithm we used for assessing the influence of nodes in our experiments. As

it was intended to extend this software package with the HITS and IP algorithms, the structure of LAW’s

PageRank algorithm implementation served as a template for our algorithmic extensions.

For the implementation of the HITS algorithm we followed the pseudo-code in Algorithm 1, in which

we have to compute two different scores - the hub score and the authority score. The computation of

these scores is based, respectively, on outlinks and inlinks from every node in the graph. Through LAW

Webgraph’s API we could only have access to the successors of a node. To overcome this limitation,

when computing the HITS algorithm, we built the graph and its transpose, instead of just the graph,

so we can access both the successors and predecessors of each node through the transpose of the

original graph (i.e., the inlinks of a node are the outlinks on the graph’s transpose).

Analogously, the Influence-Passivity (IP) algorithm involves the computation of two scores - the influence

score and the passivity score. Thus, two graphs were again built. In this implementation we followed the

pseudo-code in Algorithm 2 from Section 3.3.

4.1.1 Characterizing Networks

To understand aspects such as the dimension or how well connected are the nodes in our generated

graphs, some well-known network analysis metrics were used.

With the average path length one can assess the average distance between the nodes in our networks,

44

understanding how tightly connected they are (e.g., a small average path length indicates that all nodes

are closely connected, which means that it will be easy to spread information through the network). The

clustering coefficient allows us to assess how neighbours on our networks are close to one another, i.e.,

how our neighbours tend to create clusters with a large number of ties between them. On the other

hand, by studying the degree distribution of the nodes in a network, one can assess if we are at the

presence of a large-scale network that is characterized by a power-law distribution of the nodes degree,

i.e., in the presence of a network in which the majority of the nodes have few connections, but where

there is a smaller set of nodes holding an extremely large number of connections. These well connected

nodes are called the hubs, and they can also be seen as central points of aggregation in the network.

4.2 Analysis of Location-based Social Networks

A traditional social network comprises an unique type of nodes, which are the users in the network.

The edges between these nodes represent the friendship ties between the users. In its turn, a location-

based social network has all the properties of a social network however, we now have two types of

nodes instead of just one, namely (1) user nodes, which are the users in the network and who can be

friends with other users, and (2) location nodes, which are the locations users have visited or mentioned

in their personal messages. Therefore, one can say that a location-based social network also has two

types of edges or social ties, namely (1) user-user ties, corresponding to the edges between two users

and in all similar to the edges existing in social networks, and (2) user-location ties, corresponding to

the edges between users and locations, which are derived from a user mentioning or visiting a specific

location. Location-based social networks yield a great amount of information, because one can look at

them as according to two layers: one where the users are connected to their friends, and an underlying

layer where users are connected to locations, the latter being an intersecting layer through which one

can identify the most visited locations (i.e., locations that are connected to a larger number of users)

and, on a location perspective, which locations exert more influence to the users they are connected to

- see Figure 4.8.

Most online social networking services have public APIs, which allow the search and extraction of pub-

licly available, real and up-to-date data. In our experiments, all the considered social network platforms

permitted the access to a public API. Thus, the first step to gather information from these social net-

working services was to request data from the API and store it in a structured way, e.g., a XML file, for

subsequent processing. With the raw data organized, it was then filtered to decouple user information

from location information and also from relationship ties. The different ranking algorithms and network

analysis metrics were finally applied to a graph generated from relationship ties in the data that was

previously filtered.

45

Figure 4.8: Example of a location-based social network (adapted from Zheng & Zhou (2011)).

Data was collected from two different social network platforms: FourSquare and Twitter. FourSquare

is a location-based social network that allows users to check-in in different locations which, in their

terminology are called venues, ranging from restaurants to nightclubs, movie theaters, university campi

or a city’s most iconic monument. It was founded in 2009 and is a web application specially intended

to be used in mobile devices. With the widespread availability of smartphones and mobile gadgets with

Internet connection, FourSquare’s network and service has been growing and evolving throughout the

years, reaching the 7 million registered users milestone in 2011.

In FourSquare, registered users can search for other users or venues, e.g., one can search for Indian

Restaurant near New York and access an extensive list of restaurants, each one with an address and a

geospatial location, user uploaded photos, reviews by users that have had checked-in there, as well as a

list of venues that are similar to the searched one. Venues can be associated with categories and tags.

There is also an underlying game-play concept in this kind of social networks, encouraging continuous

interaction: (i) users earn points for checking-in at venues or adding new venues to FourSquare, (ii) users

earn badges if they check-in in various different venues or complete tasks, (iii) a user in FourSquare can

become mayor of a specific venue if he has checked-in in that venue for more days that anyone else, in

a period of 60 days.

On the other hand, Twitter is a social networking and microblogging service that allows users to post

messages 140 characters long - the tweets. Created in 2006, it has grown to be one of the most well

known social networks with over 500 million active users. Initially, Twitter was only accessible via their

website, but today one has a multitude of mobile applications at hand to manage one’s account, tweet

wherever we please, and also attach links to tweets. Nowadays, many Twitter users tweet as they arrive

(or check-in) at a specific location, deliberately attaching the geographical coordinates of that place to

their tweet. This way, we can associate Twitter users with locations, building a location-based social

network.

46

4.2.1 Data Collection from Online Services

To extract data about users and venues in FouSquare, we used the FourSquare API1, which returns

JSON2 objects that contain the result of each API call. Nevertheless, for simplicity of use, an open-

source Java implementation3 of the FourSquare API was used, providing straightforward methods to

interact with the FourSquare API. This Java API includes all methods in the official FourSquare API.

However, the functionality of the method that searches for venues (i.e., venuesSearch) was not fully

implemented, so there was the need to make a simple change to the FourSquare’s Java API in order to

extract reliable data. Even though the original API’s venuesSearch method allowed us to obtain a set of

venues that are near the provided latitude-longitude coordinates and within a specified radius ranging up

to 5 km, this radius functionality was not implemented in the open-source FourSquare Java API, which

led to a simple addition of the radius parameter to the venuesSearch API call, thus, taking full advantage

of that functionality and obtaining more venues per call - see pseudo-code in Algorithm 3. Also, we

have defined a bounding box for the New York City-Manhattan area, restricting our data collection to that

geographical area, in order to make a more contained study.

Algorithm 3 Pseudocode for the extraction of user and friend data from FourSquare.

latmax: maximum latitude for the New York City - Manhattan bounding boxlongmax: maximum longitude for the New York City - Manhattan bounding boxlatmin: minimum latitude for the New York City - Manhattan bounding boxlongmin: minimum latitude for the New York City - Manhattan bounding boxlat: current latitudelong: current longituderadius = 1000 (i.e., 1km)userSet: Set of users from a venuefor all lat ∈ [latmin, latmax] and long ∈ [longmax, longmin] do

venueSet← all venues for lat, long within radiusfor all venue ∈ venueSet do

Retrieve and store venue infouserSet← all venue’s visiting usersfor all user in userSet do

Retrieve users’ friendsStore friend information

end forend for

end for

As for Twitter, we used the Twitter Public Stream API4 that provides 1% of all the tweets that have been

published in each API second. The data collection process had the following phases:

1. From that 1% of tweets only the ones which had geographical coordinates were selected. Also, for

each tweet we collected information such as, the user id, users that he is following and users that1https://developer.foursquare.com2http://www.json.org/3http://code.google.com/p/foursquare-api-java/4https://dev.twitter.com/docs/streaming-apis

47

https://developer.foursquare.com

http://www.json.org/

http://code.google.com/p/foursquare-api-java/

https://dev.twitter.com/docs/streaming-apis

are following him. Afterwards, with the coordinates associated to a user’s tweet, we could establish

user-location ties and, with the following and follower relationships, one could establish user-user

ties.

2. From the collected user information, the users which had the greater amount of connections were

selected and the data about their friends and followers was gathered.

3. Afterwards, similarly to was done in FourSquare, all the collected data was filtered in order to keep

only the information about tweets that were within the New York City-Manhattan area.

In order to perform the discretization of geospatial coordinates, we used the Hierarchical Triangular Mesh

(HTM) approach to divide the Earth’s surface into a set of triangular regions, each roughly occupying

an equal area of the Earth (Dutton, 1996; Szalay et al., 2007). In brief, we have that the HTM offers a

multi-level recursive decomposition of a spherical approximation to the Earth’s surface. It starts at level

zero with an octahedron and, by projecting the edges of the octahedron onto the sphere, it creates 8

spherical triangles, 4 on the Northern and 4 on the Southern hemisphere. Four of these triangles share

a vertex at the pole and the sides opposite to the pole form the equator. Each of the 8 spherical triangles

can be split into four smaller triangles by introducing new vertices at the midpoints of each side, and

adding a great circle arc segment to connect the new vertices with the existing ones - see Figure 4.9.

Figure 4.9: A sequence of subdivisions of the world sphere, starting from the octahedron, down to level 5 corre-sponding to 8192 spherical triangles. The circular triangles have been plotted as planar ones, for simplicity (adaptedfrom Szalay et al. (2007)).

This sub-division process can be repeated recursively, until we reach the desired level of resolution,

as shown in Figure 4.10. The triangles in this mesh are the regions used in our representation of the

Earth, and every triangle, at any resolution, is represented by a single numeric ID. For each location

given by a pair of coordinates on the surface of the Earth, there is an ID representing the triangle, at

a particular resolution, that contains the corresponding point. Notice that the proposed representation

scheme contains a parameter k that controls the resolution, i.e. the area of the triangular regions. With

a resolution of k, the number of regions n used to represent the Earth corresponds to n = 8 · 4k.

48

Figure 4.10: The HTM recursive division process (adapted from Szalay et al. (2007)).

From the geographical coordinates found in some of the collected tweets, we computed the hierarchical

triangular mesh (HTM) so we could give to each geographical coordinate a trixel representation. Thus,

with a trixel representation instead of a latitude-longitude representation, one can have more freedom

in specifying the range of the collected locations. In our case, we managed to establish three ranges of

trixels, according to their resolution, i.e., locations with resolution 25, with resolution 20 and resolution

10.

Nevertheless, this data collection process had some limitations. The main limitation in the FourSquare

API, was that their rate limit for authenticated calls per hour is set to 500, which is a very low threshold

considering that we have performed an extensive crawl and each request for the listing of a user’s friends

is a frequent authenticated API call. As for the Twitter API, we had a rate limit of 600 calls per hour and,

exceeding that limit, we had to wait until the next hour to make more API calls. This made us disregard

a larger number of tweets during that waiting time.

4.2.2 Adaptation of the Influence-Passivity (IP) Algorithm

A major contribution of this work was the adaptation and implementation of the aforementioned Influence-

Passivity (IP) algorithm. Developed by Romero et al. (2011), the IP algorithm was part of a study on

information propagation in Twitter, where the authors came to the conclusion that most users of this

social network act as passive consumers of information, not forwarding content to the network. This al-

gorithm presents a novel way of quantifying the influence of nodes in a network by considering that each

node has an influence score, as well as, a passivity score. These scores have a mutually reinforcing

relationship, like the hub score and authority score in the HITS algorithm ( Kleinberg (1998)).

For our implementation, some changes had to be conducted to the original IP algorithm, in order to

adapt it to location-based social networks and perform an edge weight calculation that was consistent

with the datasets we were working with. From the Twitter data collected by Romero et al. (2011), the

weight of an edge e = (i, j) was assigned as the follows:

49

we =SijQij

(4.61)

In the formula, Qi represents the number of URLs that node i mentioned and Sij is the number of URLs

that were mentioned by node i and retweeted by node j.

In the case of our datasets from FourSquare and Twitter, we wanted to generate a weight exclusively

based on user-location and user-user ties, instead of URLs or retweets, as proposed by the original

authors. Thus, we built a graph that rather than having two types of nodes, i.e., locations and users,

would only have user nodes, estimating exclusively the influence of users in the network.

To calculate the weight of edges between users, we adapted the Qij and Sij parameters, having Qij as

the number of locations node i has visited and Sij as the number of locations visited by both i and j,

i.e., number of common visited locations between nodes i and j, having i visited the location before j

had visited it. From our adaptation of the algorithm, user influence is always dependent on the popularity

of the locations a user has visited.

The original graph built from our datasets is depicted in Figure 4.11, i.e., the left-most graph which

includes two types of nodes: (i) user nodes, represented by U1...U4, and (ii) location nodes, represented

by S1...S3, and has undirected user-location ties and directed user-user ties. Also, the right-most graph

in Figure 4.11 is the result of our adaptation of the IP algorithm, generating a network graph that only

has directed and weighted user-user ties and has some differences regarding its structure, e.g., the

original user-user edges no longer exist and new edges arise from common visits to locations. The

connection between two nodes is associated with a non-negative, non-zero weight if they share a visited

location, e.g., U3 and U2 both visited location S2 so there is a new edge from U3 to U2, with the weight

w2, because U3 visited S2 after U2 had visited it.

4.3 Analysis of Academic Social Networks

Alongside with social networks, this work focused on assessing the influence of nodes in an academic

social network, which is a network where the nodes either refer to authors of scientific papers connected

via co-authorship ties that form a co-authorship network, or to the scientific papers themselves con-

nected through citation ties, originating a citation network. We wanted to assess which were the most

influential papers in the scientific community, i.e., the ones that were gathering more attention either due

to the importance of their author(s), due to being about a trending topic or an important breakthrough. To

do so, we gathered the already organized data from the digital library DBLP, via the Arnetminer Project1,

which contains information about scientific papers from 1935 to 2011, including the abstract and the

1http://arnetminer.org/DBLP_Citation

50

http://arnetminer.org/DBLP_Citation

Figure 4.11: Transformation of the original network graph (left) to our IP algorithm graph (right).

number of citations. From this data we built a citation network for set of time-stamps ranging from 2007

to 2011, as depicted in Figure 4.12, in order to have a record of how the network evolved over time.

Figure 4.12: Structure of the citation graph built upon the DBLP data.

Although any other ranking algorithm could have been used, in the case of the DBLP citation network,

the most influential papers on the dataset were determined through the computation of the PageRank

algorithm. The top-10 highest ranked papers were then selected and we gathered their full information,

in order to cross-check the set of authors of each paper with the recipients of renowned computer

science and engineering awards such as the Gerard Salton award or the Turing award, identifying which

of these authors were distinguished by the scientific community.

4.3.1 Predicting Future Influence Scores and Download Counts

Instead of computing future PageRank scores of scientific papers based on their future citations, as did

Sayyadi & Getoor (2009), we created a framework to predict the future PageRank scores of scientific

papers in a citation network for a specific year, based on their previous PageRank scores, among other

51

features. The same principle was also applied to the prediction of download counts for scientific articles

downloaded from the ACM Digital Library website, in the year of 2011.

In the framework depicted in Figure 4.13, and in order to predict the future PageRank scores and future

download counts, we have three distinct phases:

1. Feature Vector Creation

The first phase is to prepares the input for further computation related to the prediction of impor-

tance scores. Having the dataset, either for paper citations or downloads counts, one generates

the different features, namely the text, age and PageRank scores, and store them in a relational

database, so then feature vectors can be generated.

2. Prediction

In a second phase, one creates training and test files from the generated feature vector files, in

order to proceed with the computation of a machine learning technique intended for predicting the

future PageRank scores and the future download counts.

3. Accuracy Assessment

Finally, to assess the quality of the obtained results, one proceeds with the computation of various

evaluation metrics.

Figure 4.13: Framework for predicting future PageRank scores and download counts.

Each aforementioned phase is a preparation to following one. To predict the PageRank scores and the

download counts we relied on features that can represent the characteristics of the information in the

dataset. The following types of features was considered:

1. Absolute Scores - Includes the PageRank score resulting from the computation of the algorithm

for papers that were published until a specific year, inclusive. Regarding the PageRank score of a

52

paper, we defined 5 different cumulative time-stamps, from 2007 to 2011, so we could have access

to the respective PageRank scores in each k previous year.

2. Differential Scores - Includes the Rank Change Rate (Racer), representing the change rate of

PageRank score between two consecutive years, capturing the evolution of PageRank scores.

The Rank Change Rate between to time-stamps ti and ti+1, for paper p is given by the following

equation:

racer(p, ti) =rank(p, ti+1)− rank(p, ti)

rank(p, ti + 1)(4.62)

3. Profile Information - Includes the Average PageRank Score, that represents the average of the

PageRank score of all publications that have an author in common with the paper’s set of au-

thors, and the Maximum PageRank Score, which represents the maximum PageRank score of all

publications that have an author in common with the paper’s set of authors.

4. Age - Includes the difference between the present year and the publication year of a paper, i.e., its

age.

5. Text - Includes the term frequency score for the top 100 most frequent tokens in abstracts and titles

of publications, not having in consideration the terms from the Standard English stop-word list.

For each aforementioned type of feature, except age and text, its value for the previous k years, with

k ranging from 1 to 3 was considered, e.g., when predicting the future PageRank score for year 2010,

one predicted that score only with information from the PageRank score of the previous year (k =1, i.e.,

2009), then with information from the two previous years (k = 2, i.e., 2009 and 2008) and finally from the

three previous years (k=3, i.e., 2009, 2008, 2007).

In order to enrich the way we made our predictions, we made a structured combination of the previously

enumerated types of features, which fit into three different groups:

• 1 - In this group we used exclusively the PageRank scores of the paper as features.

• 1 + 2 - In this group we used both PageRank and Racer scores of the paper as features.

• 1 + 2 + 3 - In this group we used PageRank scores, Racer scores, Average Author scores and

Maximum Author scores as features.

The remaining text and age features were separately added to the aforementioned combination of fea-

tures enabling the creation of two distinct subsets of results. Thus, alongside with the different range of k

used, one could assess if for that particular type of feature or group of features, adding more information

about previous years would improve or deviate the accuracy of our results. Also, for a straightforward

53

computation of the Racer, Average PageRank score, Average PageRank score an feature vectors, the

PageRank scores for each paper in each time-stamp and information about the authors of the papers

and the information about download counts was stored in a relational database.

4.3.2 The Learning Approach

To predict future PageRank scores and future download counts, we used an ensemble machine learning

technique included in the RT-Rank1 package, which is an open-source project consisting of the imple-

mentation of various machine learning algorithms based on regression trees.

The algorithm we used, called Initialized Gradient Boosting Regression Trees (IGBRT) is essentially a

point-wise machine learning algorithm developed by the team from Washington University of St. Louis

for the 2010 Yahoo Learning-To-Rank Challenge. The algorithm is shown in Algorithm 4, and it is based

on Gradient Boosting Regression Trees (GBRT) (Mohan et al., 2011). GBRT is a machine learning

technique based on tree averaging, which uses a set of trees to classify a new object, instead of the

single best tree (Oliver & Hand, 1995). It sequentially adds small trees (d≈ 4), each with high bias and,

in each iteration, the new tree to be added focuses strictly on the objects that are responsible for the

current remaining regression error. IGBRT follows the guidelines of SVM light 2, proposed by Joachims

(1999, 2002).

Algorithm 4 Initialized Gradient Boosted Regression Trees (Squared Loss)

Input: data set D = {(x1, y1), ..., (xn, yn)}, Parameters: α, MB ,d, KRF , MRF

F← RandomForests(D,KRF , MRF )Initialization: ri = yi − F (xi) for i = 1→ nfor i = 1→MB doTt ← Cart({(x1, r1), ..., (xn, rn)} , f, d) {Build Cart of depth d, with all f features, and targets ri}for i = 1→MB dori ← ri − αTi(xi) {Update residual of each sample xi}T (·) + α

∑MB

t=1 Tt (·) {Combine the Regression Trees T1, ..., Tm with the RF F}end for

end forreturn T (·)

With the intention of addressing GBRT’s main weakness, i.e., the inherent trade-off between the step-

size and the early stopping, Mohan et al. (2011) proposed an ensemble algorithm that starts-off at a point

very close to the global minimum and refines the already good predictions. Thus, instead of initializing

the algorithm with an all-zero function, as occurred in GBRT, the IGBRT algorithm is initialized with

the predictions of Random Forests (Breiman, 2001), due to the latter being known as being resistant

towards overfitting, insensitive to parameter settings, and not implying additional parameter tunning.

IGBRT uses GBRT to further refine the results of Random Forests, which are regarded by the authors

1https://sites.google.com/site/rtranking/2http://svmlight.joachims.org/

54

https://sites.google.com/site/rtranking/

http://svmlight.joachims.org/

as a good starting point for the algorithm.

4.4 Summary

In this chapter I detailed of the two types of experiments that were conducted within my MSc thesis. I

began explaining the characteristics of location-based social networks and of academic social networks,

emphasizing their peculiarities. Then, for each experiment, I described the datasets, the data collection

technique, and the methodology for finding the influencers in the network, alongside with the algorithms

that were used. For the particular case of academic social networks, a novel approach to predicting

future PageRank scores and future download counts was also presented.

55

Chapter 5

Validation Experiments

This chapter presents the results of the undertaken experiments and the evaluation methodology

used to assess the veracity of the obtained results. Beginning with a concise characterization

of all the datasets that were used and their respective networks, the evaluation methodology is then

presented, comprising all the metrics that were used to assess the quality and veracity of the results.

Finally, the obtained results for each experiment are presented and further discussed. The results

comprise the experiments for finding influencers in FourSquare and Twitter, and the citation network

built upon the DBLP dataset, as well as, the experiments for predicting the future PageRank score of

a scientific papers from 2010 and 2011 in the DBLP citation network and the prediction of download

counts for the scientific papers published in 2011, downloaded from the ACM Digital Library.

5.1 The Considered Datasets

This section includes the dataset and network characterization of all the datasets that we used.

In order to understand the structural differences between a location-based social network and a social

network that only consists in relationships between users, and how this structure affects influence esti-

mation, we created two different graphs for both FourSquare and Twitter datasets. First we considered

a graph consisting in the original location-based network built upon the data that was crawled, which we

called the User+Spot Graph. Afterwards, we disregarded all the user-location relationships and built a

graph consisting only in user-user ties, which we called the User Graph.

In the case of the DBLP dataset, the distinction between two graph was not needed, because our focus

was on creating a citation network upon which we could estimate the PageRank scores of their nodes

and use them as features for the algorithm that predicts future influence scores of papers and future

download counts. As for FourSquare and Twitter, this structural difference presents interesting results

57

when estimating user influence.

FourSquare Twitter

Spots

Total 48,257 1,358HTM Resolution 10 — 13HTM Resolution 20 — 1,277HTM Resolution 25 — 1,358

UsersTotal 447,545 2,603,505Relations 970,587 3,218,997Visiting Spots 16,960 1,017

ArcsPageRank & HITS (User+Spot Graph) 2,539,986 3,757,555PageRank & HITS (User Graph) 1,017,887 3,576,157IP Algorithm 1,017,887

NodesPageRank & HITS (User+Spot Graph) 451,664 2,604,863PageRank & HITS (User Graph) 403,407 2,603,505IP Algorithm 447,545

InDegree

Minimum (User+Spot Graph) 0 1Maximum (User+Spot Graph) 3,166 38,542Average (User+Spot Graph) 2.8626 5.6162Minimum (User Graph) 0 1Maximum (User Graph) 3,166 38,452Average (User Graph) 2.5478 5.6256

OutDegree

Minimum (User+Spot Graph) 0 1Maximum (User+Spot Graph) 1,000 460,466Average (User+Spot Graph) 74.8821 1.5615Minimum (User Graph) 0 1Maximum (User Graph) 1,000 460,466Average (User Graph) 60.5829 1.5618

Average Degree

Total (User+Spot Graph) 5.4640 3.8868Users (User+Spot Graph) 5.6714 2.8878Spots (User+Spot Graph) 5.7118 1.0376Total (User Graph) 5.0488 2.8872

Average Path LengthUsers+Spot Graph 4.7369 3.9776User Graph 4.7764 3.9823

Clustering CoefficientUsers+Spot Graph 0.2987 0.1156User Graph 0.3718 0.1152

Table 5.1: Characterization of the FourSquare and Twitter networks.

Regarding the characteristics of both graphs in the FourSquare and Twitter datasets depicted in Ta-

ble 5.1, one can acknowledge that while the first dataset is more complete in terms of user-location ties

and quantitative spot information, the latter is more complete in terms of user-user ties and user friend-

ship information. We have this behaviour, since FourSquare is a pure location-based network focused

on sharing the locations users have visited, while Twitter is a microblogging and social network platform

focused on the exchange of messages between users, thus giving priority to the relationship between

the user and his friends and followers. In what regards the HTM resolution, we used a resolution of 26.

When considering the average path length and the clustering coefficient, one can assess that while

the nodes in FourSquare network are more close to each other, neighbours of nodes in Twitter are

more close to one another than in FourSquare. The latter phenomena has to do with the fact that we

could collect a greater extent of data for friends of users in the Twitter dataset, resulting in the scenario

58

where friends of different users can, themselves, be friends and/or have friends in common. Also, one

can observe that the User-Graph has naturally a greater average path length and a greater clustering

coefficient than the User+Spot Graph, because the User-Graph as less nodes and, thus, shortens the

distance between users and neighbourhoods of users, previously parted by the spots between them.

The academic citation network built upon DBLP data comprises scientific papers from 1935 to 2011 and,

from Table 5.2, one can also have an idea of the dimension of the dataset for each of the considered

time-stamps, as well as, how complete the information about the scientific papers is.

Regarding the degree distribution in the FourSquare and Twitter networks in both User+Spot Graph and

the User Graph, one can acknowledge from Figure 5.14 that the degree distribution for these datasets

follows a power-law distribution, which a characteristic of large-scale networks, i.e., networks in which

the majority of the nodes very few connections, while very few nodes have a high number of connections.

Nevertheless, from the values of average path length and clustering coefficient, one can say that both

FourSquare and Twitter networks are not representative of large-scales networks, because in large-

scale networks, besides the power-law distribution for the degree, the average path length must be

much smaller than the clustering coefficient, revealing that the nodes are very close to each other and

their neighbourhoods are highly clustered.

Publications Citations Authors Papers with Papers with Average TermsDownloads Abstract Per Paper

Overall 1,572,277 2,084,019 601,339 17,973 529,498 1042007 135,277 1,150,195 330,001 15,516 343,837 952008 146,714 1,611,761 385,783 17,188 419,747 982009 155,299 1,958,352 448,951 17,973 504,900 1012010 129,173 2,082,864 469,719 17,973 529,201 1032011 8,418 2,083,947 469,917 17,973 529,498 104

Table 5.2: Characterization of the DBLP dataset.

On the other hand, one can acknowledge from the network characterization in Table 5.3 that the aca-

demic social network that was built naturally grows in each time-stamp, although this growth is not as

significant in the last two time-stamps as it is in the first two.

Focusing on the average path length and the clustering coefficient, one can conclude that as we include

more papers in the network, i.e., at each time-stamp, papers are closer to one another through the

existence of more citation relationships between them, even though they tend not to be as clustered

together over time.

From the plots in Figure 5.15, one can acknowledge that the number of papers increases trough the

years. However, these new papers have tend to have few citations, and so the tail of the plots get ticker

throughout the years, i.e., new fewer cited papers are frequently added to the dataset, while the number

highly cited paper remains almost unaltered.

59

●

●

●

●

●

●●

●●

●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●

●●●●●●●●●

●

●●●

●

●●●●●●●●●●●

●●●

●●●●●●●●●●●●

●

●●

●●●●●●

●

●●●●

●●●●●●●●●●●

●

●●●●●●●●●●

●●●●●

●

●

●

●

●●

●

●●●●

●●●●

●●

●●●●

●●●●

●

●

●●

●

●●●●●●

●●●

●

●

●●●●●

●●●

●●●

●●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●●

●●●

●

●

●●●

●●

●

●

●

●

●

●

●●●

●●

●

●

●

●●

●●●●

●●

●

●●●●

●

●

●

●●

●

●●●

●

●

●

●●

●

●●●●●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●●

●

●

●

●

●

●

●

●●●●

●

●

●

●●●

●

●

●

●●●●

●

●

●

●

●

●

●

●●●●

●

●●

●●●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●●

●●●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●●

●

●

●●

●●

●

●

●

●

●

●●

●●

●●

●●●●

●

●

●

●

●

●●●

●●

●●

●●●

●

●

●●

●

●

●●●

●●●

●

●

●

●

●●

●

●

●

●●●●

●

●●●

●●

●●●●

●●

●

●

●

●●●●●●

●

●●

●

●

●

●●●●●

●●●

●

●

●

●

●●

●

●●●●●●●●●

●

●●●

●

●

●

●●●

●

●●●●●●

●

●●●●●

●●●

●●●●●●●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●●●●●

●

●●

●

●●

●

●●●●●●

●

●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●

●

●●●●

●

●●●●●●●●●●●●●●

●●

●●●

●

●●

●

●●

●●●●●●●●

●

●●●●

●

●●●●●●

●●●●●●

●●

●●

●●

●

●

●

●

●

●

●●

●

●

●●

●●●●

●

●

●

●

●

●●●

●

●●●●●

●

●●●●●●●●●●●●●●●●●●

●

●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●

1 5 10 50 100 500

110

010

000

Degree Distribution in FourSquare (User+Spot Graph)

Node id

Deg

ree

●

●

●

●

●

●

●

●●

●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●

●●●●

●●●●●●●●●●●

●

●●●●●●●●●●●

●●

●●●

●

●

●●●●

●●

●●●

●●●●●●●

●

●

●

●●●●●●

●●●●

●

●●●●●

●

●●●

●●●

●

●

●

●●●●●●

●

●●

●●●

●

●

●

●

●●

●

●●●●●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●●●●●●●

●

●

●

●●

●

●●●●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●●●

●●

●●●●

●

●

●

●

●

●

●●●

●●●

●

●●

●●●

●●●

●●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●●

●●●

●

●

●

●●

●

●●●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●●●

●●

●

●●●●

●●●

●●

●●●

●

●

●

●

●

●●●●●●

●

●●

●

●

●

●

●●●●●

●●

●

●

●

●

●

●●●●

●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●●●

●

●●

●

●●●

●

●●●

●

●●●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●●●

●●

●

●●●

●

●

●●

●

●

●

●

●●●●●

●

●●●●

●

●●●

●●

●●●

●●

●

●●

●●●●●●

●

●

●

●●●

●

●●●●

●●

●●●●●●●●

●

●●

●

●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●

●

●

●

●●●●●

●●●

●●

●

●●●●●●●

●

●●●

●

●●●●●●●●●●●●●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●●

●

●

●

●

●●

●

●●●●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

1 5 10 50 100 500

110

010

000

Degree Distribution in FourSquare (User Graph)

Node id

Deg

ree

●

●

●

●

●●

●●

●●

●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●

●

●

●

●●

●

●●

●●●

●●●●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●●

●●

●

●●●

●

●

●●●

●

●

●●

●●●

●

●●

●●

●

●

●

●

●●●●●●●●●●

●

●●

●●

●

●

●●

●●●●●

●

●●●

●

●●●

●

●●●●●

●

●●●●●

●

●

●

●

●●●●

●

●●●●●●

●

●●

●●

●●●●

●

●●●

●

●●●

●●

●●●

●

●●

●

●

●

●

●

●

●

●

●●●●●

●

●

●

●

●●●●●

●

●

●●

●●●●●●●

●

●

●

●●●●●●

●●

●

●

●●●

●●●

●●●

●

●

●

●●

●

●

●●●●●

●

●

●●●●

●

●●●

●

●●●●

●●●●

●

●●●

●

●●

●●

●●●●

●

●

●●●

●

●●●●

●●

●●●●

●

●

●

●

●●

●

●●

●

●●●●

●

●

●

●

●

●

●

●

●●●●●

●

●●●●●

●

●●●●

●

●●●●●

●

●●●●●●●

●

●

●

●●●●

●

●●●●●●●●●●

●

●

●

●●

●

●●●

●●

●●

●

●●

●

●●●●●●

●

●●●●●●●●●●●

●

●

●

●●●

●●

●●●●

●

●●●●●●●●●●

●

●●●●●●●●●

●

●●●●●●

●

●●

●●

●●●●●●

●

●●●●●●●●●●●●●●●●

●

●●

●

●●●●

●●

●●

●

●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●

●

●

●●

●●●●●●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●

●

●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●

●

●●●●

●

●●

●

●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●

●

●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●

●

●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●

●

●

●

●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●●

1 100 10000

1e+

001e

+02

1e+

041e

+06

Degree Distribution in Twitter (User+Spot Graph)

Node id

Deg

ree

●

●

●

●

●●

●●

●●

●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●

●

●

●

●●

●●●●●●

●●●●

●

●

●●●

●

●

●

●●

●●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●●●

●

●

●

●●

●

●●●

●●

●●●●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●●●●●●●●

●

●

●

●

●

●

●●●●●

●

●●●

●●

●●

●

●●●

●

●●●●●

●

●●

●

●

●●●●

●●

●●

●●●●

●●

●●

●●

●●●●●●●●

●

●●●●●

●

●

●

●

●

●●●●●

●

●

●

●●

●●●●

●●●

●

●

●●●●●●

●

●

●●

●●●●●●

●

●

●●

●

●

●●

●

●

●●●●●

●●

●●●

●

●

●●

●

●

●●●●●

●

●

●

●

●●●

●

●●●

●

●

●●

●

●●●●●

●

●●

●

●●●

●●●

●

●●●

●●●●

●

●●●●●●●●●●●●●

●

●●

●●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●●●●●

●

●●●●●●

●

●●●

●●

●●●

●

●●●●●●

●

●

●

●●

●

●

●●●●●

●

●●●

●

●

●

●●

●

●●

●

●

●●●●●●

●

●●●●

●

●

●●●●●●●●●

●●●

●●●●

●

●●●●●

●

●●●●

●

●●●●

●

●●●●●●●

●●

●●●●●●●●●●

●

●●●●●●●

●

●●●●●●●●●●●●●

●

●●

●

●●

●

●●●●

●

●●●●●

●

●●●●●●●

●

●●●●●

●

●●●●●●

●

●●●●●●●●●●●●●●●●●●

●

●

●

●

●●

●●●●●

●

●●●●

●

●●●●●●●●●●●●●●●●

●

●●●●●●●●

●

●●●●●●●●●

●

●

●

●

●●

●●●●●●

●

●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●

●

●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●

●

●●●●

●

●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●●

1 100 10000

1e+

001e

+02

1e+

041e

+06

Degree Distribution in Twitter (User Graph)

Node id

Deg

ree

Figure 5.14: Degree distribution for nodes in the User+Spot Graph and the User Graph, from the FourSquare andTwitter datasets.

5.2 Evaluation Methodology

When assessing the quality and veracity of the results for the top-10 highest ranked users and spots

in the FourSquare and Twitter datasets, we conducted an empirical analysis and relied on profile infor-

mation, due to the fact that, this research area is still evolving and there are not strict parameters or

ground-truth lists to truly assess the influence of a node in these networks. On the other hand, when

assessing the veracity of the DBLP top-10 highest ranked papers, we empirically analyzed our results

against a list of recipients of renowned scientific awards, like the Gerard Salton Award and the Turing

Award, and if they were not part of that list, we also checked their academic publication profiles1 in order

to assess if they were renowned scientists.

In the case of the experiment of future PageRank and future download count prediction, we used a set

of error metrics. One of these metrics is Kendall’s Tau, which corresponds to a value ranging between

1http://academic.research.microsoft.com/

60

http://academic.research.microsoft.com/

In-Degree Out-Degree Degree Average ClusteringMin Max Avg Min Max Avg Min Max Avg Path Length Coefficient

2007 0 1,508 2.9153 0 227 2.9153 0 1,508 5.8329 0.1323 6.18002008 0 1,875 3.5357 0 266 3.5357 0 1,875 7.0790 0.1319 6.10472009 0 2,207 3.6993 0 269 3.6993 0 2,207 7.4012 0.1314 6.08332010 0 2,306 3.7670 0 269 3.7670 0 2,306 7.5430 0.1312 6.06652011 0 2,311 3.7673 0 269 3.7673 0 2,311 7.5367 0.1310 6.0676

Table 5.3: Characterization of the DBLP network.

[−1, 1] and is defined as follows:

τ =2ci

12ni(ni − 1)

− 1 (5.63)

In the formula, ci is the number of concordant pairs between the produced ranked list and the ground

truth list, and ni is the length of the two lists Li (2011). The aforementioned LAW-Webgraph software

package includes an implementation of this metric.

We can also assess the level of correlation between two ranked lists using Spearman’s Correlation (i.e

Spearman’s ρ), according to the formula bellow:

ρ = 1−6∑ni=1(xi − yi)2

n3 − n(5.64)

In the formula, x1, ..., xn and y1, ..., yn are the two rankings of n objects (Best & Roberts, 1975). This

metric was computed via its implementation in the R-Project open source statistical software1. Both

Kendall’s Tau and Spearman’s Correlation measure the strength of the association between two ranked

lists Cha et al. (2010). The correlation ranges between [−1, 1] and, hence, if it is close to −1, one

can determine the variables are negatively correlated, whereas if it is close to +1 they are positively

correlated. To perform the Spearman’s Correlation we used the R-Project for statistical computing, which

a specific statistical language and open-source software package that includes various mathematical

and statistical techniques, being also suitable for large amounts of data.

In order to measure the accuracy of the prediction models, we used the normalized root-mean-squared

error (NRMSE) metric between our predictions and the true values, which is given by the formula:

NRMSE =

√∑ni=1(x1,i−x2,i)2

N

xmax − xmin(5.65)

The average of absolute error, which is the average of the difference between the inferred, i.e., pre-

dicted value and the actual value, was also used and specially relevant for assessing the quality of the

predictions of download counts.

1http://www.r-project.org/

61

http://www.r-project.org/

●

●●

●●

●●

●●

●●

●●●●

●●●●●●

●●●●●

●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●

●●●●●●●●●

●●

●●

●

●

●

●

●●

●

●

●●●●●●●●

●

●●●●●●

●

●

●

●

●

●●

●●

●

●

●

●●●●●●●

●

●

●

●

●●●

●

●

●●

●●

●●●

●●

●

●

●●

●

●

●●

●●

●

●●

●

●●●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●●●

●

●

●●●

●●

●

●

●

●

●

●

●

●●●

●●

●●●●●

●

●●

●●●

●●

●

●

●●●

●

●●

●

●

●●●

●

●●●

●

●●●●●●●

●

●●●●●●●●●●●

●●

●●●●●

●

●●●●●●●●●●

●

●

●

●

●

●●●

●

●●●●●●

●

●●●●●●●

●

●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●● ●●

1 5 10 50 100 500 1000

110

010

000

Degree Distribution in DBLP (2008)

Node id

Deg

ree

●

●●

●●

●●

●●

●●

●●●●●

●●●

●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●

●●●●●●●●●●●●●●●

●●●●●●●●

●

●●●●●●

●●●●

●

●●●●●●●

●

●

●●

●●●●●●

●

●

●●

●

●

●●●●●●

●

●

●

●●●●●●●●

●

●●●●●

●

●

●●

●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●●

●

●●●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●●●●

●●

●●

●

●

●●●●

●

●●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●●

●●

●●

●●

●●

●

●●

●

●●

●●●●●●

●

●●

●

●

●●

●●●●

●

●●●

●

●●●●

●●

●●

●●

●

●

●

●

●●●

●

●●●●

●●

●●●●●●●●●

●

●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●

1 5 10 50 100 500 1000

110

010

000


Node id

Deg

ree

●

●●

●●

●●

●●

●●

●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●

●●●●●

●●●

●

●

●●●

●●

●

●

●

●

●●●●●●●●●●●

●●

●

●

●

●

●●

●

●●

●●●●

●●

●

●

●●

●●

●

●

●

●●

●

●

●●

●●

●●

●

●

●●●

●●●●

●

●●

●●●

●

●

●

●

●

●●

●

●●●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●●●

●●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●●●●

●●●

●

●

●

●●●●●●

●

●

●●

●●

●

●●●●

●

●

●

●●

●●●●

●

●

●

●●●●●●

●

●●●●●●●●●●●

●

●●●●●●●●●

●

●

●

●●

●

●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●

●

●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●● ● ●

1 5 10 50 100 500 1000

110

010

000


Node id

Deg

ree

●

●●

●●

●●

●●

●●

●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●

●●

●

●●●●●●

●●●●

●

●●●●

●

●●●

●

●●●●

●●

●

●

●

●

●●●●●●

●●●●●

●

●

●

●

●

●

●●●

●●●

●●●●●●

●

●●

●●

●

●

●

●●●

●

●●

●●

●●

●

●

●●●

●●●●

●●

●

●●●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●●●

●●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●●●●

●

●●●●●

●●●

●

●

●●●●●●●●

●

●

●●

●●

●

●●●●

●

●●●

●

●

●●●●●●●●

●

●●●●●●

●●

●●●●●●●●●

●

●●●●●●●●●●

●

●

●●●

●●●

●

●●●●●●●

●

●●●●●●●●●●●●●●●●●●

●

●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●

1 5 10 50 100 500 1000

110

010

000


Node id

Deg

ree

Figure 5.15: Degree distribution for the DBLP dataset from 2008 to 2011.

5.3 The Obtained Results

This section exhibits the results obtained from the various conducted experiments, alongside with their

discussion. First of all, the results from the experiments for finding influencers in FourSquare and Twitter,

as well as, for the BDLP citation network are presented and further discussed, where we assess the

quality of these results and if the top-10 highest ranked list of individuals and spots produced by the

different algorithms really corresponds to the top-10 of influencers and influential spots in the network.

The results for the experiment of predicting future PageRank scores and download counts are then

presented, alongside with their discussion, where we compare the output of the different evaluation

metrics that were computed for the different groups of features, in order to understand if the task of

predicting a future PageRank score and the future download counts could be successfully accomplished

with the framework that was developed.

62

5.3.1 Finding Influencers

In the following sections the results of the computation of PageRank, HITS and IP algorithms for the

FourSquare and Twitter datasets are presented, as well as, the results of the computation of PageRank

algorithm for the DBLP dataset. While the first two datasets comprise the top-10 highest ranked users

and the top-10 highest ranked spots in the network, the results from the DBLP highlight solely the most

influential papers in the DBLP digital library dataset.

We begin by exposing and discussing the results from the experiments with, respectively, the FourSquare

and Twitter datasets, then we present and discuss the influence estimation for the DBLP dataset, closing

this section with the results from the future PageRank scores and download counts experiment.

In order to identify the most influential users and spots in FourSquare and Twitter datasets, aver-

age anonymous users and spots (e.g., streets) are identified, respectively, by Person − XXXX and

Spot − Y Y : ZZ, where XXXX corresponds to the real user id, Y Y corresponds the latitude and ZZ

to the longitude associated with that spot id in the network, while publicly well-known companies, loca-

tions/venues and people are identified by their real name, e.g., Ellen DeGeneres for users and Dunkin’

Donuts for spots.

5.3.1.1 Location-based social networks: FourSquare & Twitter

From the user influence scores for PageRank and HITS algorithm depicted in Table 5.4, one can ac-

knowledge that the addition of spots to the network reveals well-known influentials, such as worldwide

celebrities, TV channels or magazines.

PageRank HITS - Authority HITS - HubName Friends Likes Name Friends Likes Name Friends LikesTimeOut NY — 122,172 ZAGAT — 328,189 ZAGAT — 328,189Lucky Mag. — 164,323 TimeOut NY — 122,172 MTV — 731,067ZAGAT — 328,189 MTV — 731,067 Bravo TV — 375,363NYPL — 61,132 Bravo Tv — 375,363 History Chnl — 541,847MTV — 731,067 History Chnl — 541,847 The NY Times — 367,008Person-12935563 956 20 Starbucks — 929,915 Starbucks — 929,915Bravo TV — 375,363 The NY Times — 367,008 VH1 — 380,987Person-1478079 981 96 Lucky Mag. — 164,323 People Mag. — 372,008NYC Parks — 17,429 VH1 — 380,987 TimeOut NY — 122,172History Chnl — 541,847 NYPL — 61,132 The WSJ — 227,894

Table 5.4: User influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built from theFourSquare dataset.

Meanwhile, when we have the User Graph, as depicted in Table 5.5, the average users of social plat-

forms are distinguished both in the PageRank and the HITS algorithms, the latter when ordered by hub

scores. In this case, average users are highlighted through their great amount of mayorships, checkins,

tips about locations and friends. Mostly through their outlinks, they become network users that other

users want to follow and listen to.

63

PageRank HITS - Authority HITS - HubName Friends Likes Name Friends Likes Name Friends LikesPerson-11890308 794 84 ZAGAT — 328,189 Person-2630685 110 817Person-449480 1,000 374 MTV — 731,067 Person-1127366 39 749Person-1544684 987 144 Bravo TV — 375,363 Person-4148169 77 899Person-619656 823 8 History Chnl — 541,847 Person-634270 216 755Person-4071912 1,004 860 Starbucks — 929,915 Person-42695 128 775NYCHA 807 59 The NY Times — 367,000 Person-1011520 39 723Person-6935835 990 275 VH1 — 380,987 Person-3231666 14 713Person-6004767 958 319 Ellen DeGeneres — 457,155 Person-7991820 3 767Person-10934560 1,001 64 TimeOut NY — 122,172 Person-3290360 62 632Person-10554269 985 4 People Mag. — 372,008 Person-6483868 95 765

Table 5.5: User influence scores for PageRank and HITS algorithms, for the User Graph, built from the FourSquaredataset.

When the location-based network was reshaped to connect only the users that have visited at least one

location in common, for the IP algorithm, the average user of FourSquare is distinguished, yet again due

to a combination of factors that include their great amount of mayorships, checkins, tips about locations

and friend counts, as one can acknowledge from Table 5.6.

In brief, the fact that worldwide TV channels, magazines, and celebrities are highlighted in a network

that contains both users and spots reveals a strict connection between these well known influentials and

the spots, through a continuous activity that is intended to gather and retain their followers. When these

ties are removed, the connections between real users prevail.

Name Friends LikesPerson-9797197 52 10Person-9726342 5 —Person-9615360 25 9Person-9578554 34 —Person-9553862 4 —Person-9450025 47 7Person-9264407 43 —Person-8956766 28 —Person-8916830 47 4Person-884020 95 32

Table 5.6: User influence scores for the IP algorithm, built from the FourSquare dataset.

As for the most influential spots in the FourSquare dataset, the top-10 highest ranked spots resulting

from the computation of both PageRank and HITS algorithms, either with authority or hub sort, was the

same. Focusing on the type of spots that were highlighted, they mainly include bars, boardwalks and

other spots near the New York coastline due to the fact that the data collection was done during the

months of August and early September of 2012.

64

Name CheckinsTattoo Shot Lounge 227Dunkin’ Donuts 970Gargiulo’s Restaurant 697The Freak Bar 540Ruby’s Bar & Grill 2,025Coney Island Beach & Boardwalk 36,206Cha Cha’s 1,142Denny’s Delight 84Coney Island Sound 280Coney Island Polar Bear Club 85

Table 5.7: Spot influence scores for PageRank and HITS algorithms (that present the exact same top-10), for theUser+Spot Graph, built from the FourSquare dataset.

When finding influencers in the Twitter dataset, one must acknowledge that users tweet wherever they

are, may it be at home, while waiting for a doctor’s appointment, etc ... Therefore many of the locations

that we could identify are not necessarily venues, i.e., the geographic coordinates associated with a

tweet may point to a street or avenue, and not a theater, museum or restaurant like it happened in the

FourSquare experiment. Nevertheless, this is only due to the inner characteristics of the Twitter social

network, which is content and user-centered and not location-centered like FourSquare. Due to the fact

that social networks have a dynamic behaviour, i.e., they can change over time with the addition or loss

of users and relationship ties, the third highest ranked user for HITS - Authority, from Tables 5.8 and 5.9

had a profile on Twitter and was active during our crawl, between July and August of 2012, nevertheless,

he no longer has a Twitter profile thus, being marked with a *, after the user id.

In the case of the Twitter dataset, the results from the computation of IP algorithm are not be presented,

because the obtained results were not coherent and not nearly comparable with the ones that were

obtained in FourSquare.

From Table 5.8, we can observe that HITS algorithm, with influence sorted by authority or hub score,

reveals Twitter users that are well-known to the public and whom exert significant influence due to their

roles on society, e.g., by being an entrepreneur, a journalist or an actor. Also, due to their professional

activity and media exposure, one can say that they can shape conversations, they are users other

network users want to listen to. Conversely, from the top-10 generated by PageRank algorithm, one can

acknowledge that friendship ties among anonymous (to public) users are highlighted.

Regarding the User Graph, we can see that the output from HITS an PageRank algorithms, depicted in

Table 5.9, is exactly the same as in the User+Spot Graph. This enhances the fact that in this particular

dataset there is a greater number of relationships among users than between users and locations, so

when these location ties are disregarded the strong ties between users naturally prevail. Also, one can

see from Tables 5.8 and 5.9 that, yet again, the total number of follower and friends is not necessarily

correlated with influence on Twitter.

65

PageRank HITS - Authority HITS - HubName Followers Following Name Followers Following Name Followers FollowingPerson-67779865 45,702 41,870 J. Wortham 463,772 3,424 J. Lupton 301,965 276,780J. K. Pulver 469,092 38,542 J. K. Pulver 469,092 38,542 NOH8 Campaign 426,079 251,158JobsDirectUSA.com 17,075 18,782 Person-325410549* — — Person-25915690 595,404 192,241Person-479562736 16,703 16,241 B. Thurston 124,722 5,707 M. Allen 144,540 55,678America Hires 11,824 13,006 StumbleUpon 72,133 10,370 Person-203455506 188,527 41,190Person-52306188 9,989 9,878 DL Hughley 73,835 886 NY Daily News 85,821 10,681Person-35844123 10,030 9,761 J. Rampton 47,593 578 Person-18704291 19,212 21,098Person-24883913 11,191 9,583 Person-51560438 103,721 14.766 J. Calacanis 151,155 112,248Person-213105865 8,531 9,965 Person-67779865 45,699 41,868 92YTribeca 13,015 10,560Person-30735143 7,837 8,513 Person-1536651 34,216 456 C.C. Chapman 34,512 28,505

Table 5.8: User influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built from the Twitterdataset.

PageRank HITS - Authority HITS - HubName Followers Following Name Followers Following Name Followers FollowingPerson-67779865 45,702 41,870 J. Wortham 463,772 3,424 J. Lupton 301,965 276,780J. K. Pulver 469,092 38.542 J. K. Pulver 469,092 38,542 NOH8 Campaign 426,079 251,158JobsDirectUSA.com 17,075 18,782 Person-325410549* — — Person-25915690 595,404 192,241Person-479562736 16,703 16,241 B. Thurston 124,722 5,707 M. Allen 144,540 55,678America Hires 11,824 13,006 StumbleUpon 72,133 10,370 Person-203455506 188,527 41,190Person-52306188 9,989 9,878 DL Hughley 73,835 886 NY Daily News 85,821 10,681Person-35844123 10,030 9,761 J. Rampton 47,593 578 Person-18704291 19,212 21,098Person-24883913 11,191 9,583 Person-51560438 103,721 14.766 J. Calacanis 151,155 112,248Person-213105865 8,531 9,965 Person-67779865 45,699 41,868 92YTribeca 13,015 10,560Person-30735143 7,837 8,513 Person-1536651 34,216 456 C.C. Chapman 34,512 28,505

Table 5.9: User influence scores for PageRank and HITS algorithms, for the User Graph, built from the Twitterdataset.

As one can observe from Table 5.10, a great majority of the top-10 highest ranked scores are not venues

per se, the geographical locations associated with these tweets correspond to streets or avenues, due

to the use of Twitter in various mobile applications. Nevertheless, some well known spots like Times

Square and JFK are naturally highlighted. Also, one can acknowledge that, in this particular case, the

spots with greater number of checkins turn out to be the most influential spots in the dataset.

PageRank HITS - Authority HITS - HubName Checkins Name Checkins Name CheckinsBroadway - Times Square 4 Pace University 8 Spot40.71498749:-73.95485289 2JFK Airport 2 Spot40.679254:-73.8632521 1 Spot40.7827699:-73.95211752 1JFK Airport (Subway Station) 1 Spot40.67982674:-73.86344992 1 Spot40.76619859:-73.91322359 1Spot40.80567362:-73.91862858 1 Spot40.6792906:-73.8622276 1 Skin Magic Ltd 1Spot40.66931554:-74.20359207 1 Park Lane Hotel 1 Spot40.76614592:-73.91323331 1Spot40.73262798:-73.98359375 1 Astoria Bowl 1 Spot40.76616717:-73.91319381 1Rosa Mexicano (Restaurant) 1 Spot40.7166368:-73.9543937 1 Broadway - Times Square 1The Abyssinian Baptist Church 1 Columbus Circle 1 Spot40.75612638:-73.90477465 1St Luke’s School 1 Spot40.86745661:-74.12978901 1 Spot40.76113205:-73.97952078 1Spot40.742727:-73.994372 1 Spot40.89064994:-73.89948689 1 JFK Airport 1

Table 5.10: Spot influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built from theTwitter dataset.

5.3.1.2 Academic social network: DBLP

In Table 5.11 are the top-10 highest ranked papers from the citation network built upon DBLP data,

where recipients of scientific awards are highlighted in bold. From this table one can acknowledge that

the top-10 remained unaltered for scientific papers published until 2010 and until 2011, and that the

66

majority of these publications are authored by recipients of one or more of the renowned awards from

the list in Appendix A.

Focusing on the title of these scientific papers, one can also verify that this top-10 comprises publications

that can be considered breakthroughs in a specific research area, e.g., Gerard Salton’s leading work in

information retrieval, or inevitable textbook references, e.g., Cormen et al.’s Introduction to Algorithms.

Nevertheless, even if the authors aren’t recipients of renowned scientific awards, the fact that they col-

laborate with many other authors lead them to be cited in a greater number of publications, reinforcing

their PageRank score.

PageRankPaper Authors 2010 2011

A Unified Approach to Functional Philip A. Bernstein, J. Richard Swenson, 0,000903919 0,000903646Dependencies and Relations Dennis Tsichritzis

On the Semantics of the Hans Albrecht Schmid, J. Richard Swenson 0,000891394 0,000891123Relational Data Model

Database Abstractions: Aggregation John Miles Smith, Diane C. P. Smith 0,000860181 0,00085993and Generalization

Smalltalk-80: The Language Adele Goldberg, David Robson 0,000763314 0,000763174and Its Implementation

A Characterization of Ten Hidden-Surface Ivan E. Sutherland, Robert F. Sproull, 0,000716136 0,000716507Algorithms Robert A. Schumacker

An algorithm for hidden line elimination R. Galimberti 0,000706674 0,000707118

Introduction to Modern Information Retrieval Gerard Salton, Michael McGill 0,000699671 0,000699584

C4.5: Programs for Machine Learning J. Ross Quinlan 0,000635416 0,000636705

Introduction to Algorithms Thomas H. Cormen, Charles E. Leiserson, 0,000592198 0,000592414

Ronald L. Rivest

Compilers: Princiles, Techniques, and Tools Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman 0,000528325 0,000528235

Table 5.11: PageRank scores for top-10 highest ranked papers of the DBLP dataset.

5.3.2 Predicting Future PageRank Scores and Download Counts

In this section, the experiment regarding the prediction of future influence scores and future download

counts is detailed and thoroughly discussed. For a better understanding, we call the model for predicting

future PageRank scores and download counts that includes the age of each article the age model and

the model that includes age of the article and the term frequency of the 100 most frequent words in the

abstract and title of each paper the text model - see Table 5.12.

From Table 5.12 and considering the experiment of predicting the PageRank scores for the year of

2010, both models have provided very similar results, both improving as we added more information, i.e,

comparing the three groups of features (PageRank Scores, PageRank scores with racer scores, and

PageRank scores with Racer scores, Average PageRank score of the author and Maximum PageRank

score of the author) and also comparing within the same groups, the quality of the results improves

67

consistently. Only for the set of features that combine the PageRank score of one previous year with its

respective Racer and the author’s Average and Maximum PageRank score, the age model is outper-

formed by the text model. Comparing the error rate for the same year, one can assess that, for both

models, as we add more information the error rate increases, resulting in the deviation of the results.

Nevertheless, for the first two groups of features, the text model has a lower error rate than the age

model, while the opposite happens for the third group of features.

Having computed the absolute error for all the groups of features in both models, the results show that,

on average, the text model has always a lower absolute error than the age model.

PageRank 2010 PageRank 2011Features ρ τ NRMSE ρ τ NRMSE

Age

Rank k = 1 0.9725065 0.9163994 0.0003224 0.9929880 0.9837121 0.0001057Rank k = 2 0.9836493 0.9381865 0.0006161 0.9999050 0.9994758 0.0000995Rank k = 3 0.9890716 0.9506366 0.0006391 0.9999002 0.9993787 0.0004768Racer + Rank k = 1 0.9724540 0.9173649 0.0003469 0.9998887 0.9994037 0.0002322Racer + Rank k = 2 0.9837098 0.9387564 0.0006520 0.9999004 0.9992955 0.0001634Racer + Rank k = 3 0.9888725 0.9493687 0.0006605 0.9952435 0.9866206 0.0005492A + R + Rank k = 1 0.9675213 0.9098510 0.0005354 0.9998529 0.9994497 0.0002530A + R + Rank k = 2 0.9840530 0.9355465 0.0008336 0.9998353 0.9993422 0.0002962A + R + Rank k = 3 0.9892456 0.9468673 0.0006986 0.9938021 0.9828511 0.0005317

Text

Rank k = 1 0.9708719 0.9101722 0.0003608 0.9992124 0.9979693 0.0002479Rank k = 2 0.9831039 0.9310399 0.0006268 0.9997962 0.9992362 0.0004543Rank k = 3 0.9886945 0.9451537 0.0006276 0.9995012 0.9983375 0.0005800Racer + Rank k = 1 0.9711170 0.9098901 0.0005515 0.9994290 0.9984499 0.0001590Racer + Rank k = 2 0.9832037 0.9314405 0.0006747 0.9997300 0.9990720 0.0001919Racer + Rank k = 3 0.9887959 0.9470102 0.0006667 0.9994104 0.9980729 0.0006416A + R + Rank k = 1 0.9705230 0.9984499 0.0001590 0.9997019 0.9990583 0.0002480A + R + Rank k = 2 0.9837012 0.9990720 0.0001919 0.9998617 0.9993443 0.0002800A + R + Rank k = 3 0.9888386 0.9980729 0.0006416 0.9998793 0.9993885 0.0006987

Table 5.12: Results for the prediction of impact PageRank scores for papers in the DBLP dataset.

For the year of 2011, as we add more information to the models, the text model outperforms the age

model, as shown in the last two sets of features from the third group. Also, in the scenario in which

the models only have the information about the immediately previous PageRank score, the age model is

again outperformed by the text model. Nevertheless, when considering the error rate for both models for

this year, the text model has an overall higher error rate than the age model showing that, even though

the quality of the predicted results is lower in the age model, the results are more accurate.

As occurred for the computation of the absolute error for the year 2010, in all groups of features in both

models, the results for the year of 2011 show that, on average, the text model has a lower absolute error

than the age model.

Regarding the prediction of download counts depicted in Table 5.13, one can acknowledge that using a

text model increases the quality of our results. In the age model, we can verify that adding information

about the Racer to the previous PageRank scores affects the results negatively, while combining previ-

ous PageRank scores with Racer, and the author’s Average and Maximum PageRank scores provides

better results with a lower error rate. From this fact, we can conclude that the age model provides a

68

more accurate prediction as it becomes more complete. The opposite happens in all groups of the text

model, i.e., as we, within the same group, add more information to the model, one can acknowledge that

the quality of the results decreases, even though they are far better than the corresponding results in

the age model.

We can also verify that the age model, for the groups of features that only include previous PageRank

scores, and for the ones that combine previous PageRank scores with Racer and author’s Average and

Maximum PageRank scores, have a lower error rate than the corresponding groups in the text model.

And even though text model has better overall results, the error rate is greater than in the age model for

download counts prediction.

As for the absolute error the results showed that, generally, the text model has a lower absolute error

rate than the age model in all groups, except the third.

Features ρ τ NRMSE

Age

Rank k = 1 0.3864814 0.2742998 0.0080585Rank k = 2 0.4221492 0.3001470 0.0029377Rank k = 3 0.4323201 0.3080974 0.0028074Racer + Rank k = 1 0.4396605 0.3076576 0.0076713Racer + Rank k = 2 0.3370149 0.4747241 0.0078403Racer + Rank k = 3 0.3313412 0.4612442 0.0088301A + R + Rank k = 1 0.3377553 0.2558403 0.0147155A + R + Rank k = 2 0.5335481 0.3894899 0.0088093A + R + Rank k = 3 0.5406937 0.3962472 0.0078576

Text

Rank k = 1 0.5250188 0.3837016 0.0086955Rank k = 2 0.5261168 0.3849615 0.0087775Rank k = 3 0.5060003 0.3674801 0.0091976Racer + Rank k = 1 0.5325432 0.3887987 0.0085328Racer + Rank k = 2 0.5224018 0.3822982 0.0089440Racer + Rank k = 3 0.5087407 0.3703400 0.0091979A + R + Rank k = 1 0.5709764 0.4234845 0.0076071A + R + Rank k = 2 0.5651282 0.4180070 0.0079000A + R + Rank k = 3 0.5608946 0.4148554 0.0088935

Table 5.13: Results for the prediction of download numbers for papers in the DBLP dataset.

In brief, from the results in Tables 5.12 and 5.13, we can acknowledge that predicting the number of

downloads is an harder task than predicting the future PageRank scores. We can also see that, when

predicting future PageRank scores, as more information is added to the model, the more the results

deviate. Nevertheless, the opposite happens when we are trying to predict the number of downloads.

Comparing the years of 2010 and 2011, we can acknowledge that predicting the PageRank scores of a

more recent year is easier than if we progressively go back in time to predict the PageRank score of a

more distant year.

69

5.4 Summary

In this chapter I presented and discussed the results obtained from the experiments of finding influ-

encers in FourSquare and Twitter, as well as, in the DBLP citation network, and from the experiments for

predicting future PageRank scores and future download counts for scientific papers downloaded from

the ACM Digital Library.

Regarding location-based social networks, one can acknowledge that, most of the time, the most influ-

ential users in a network are not the ones who have more followers. From the results one can see that

in the User Graph, the relationships between unknown (to the public) users prevails, while TV channels,

celebrities or worldwide magazines are highlighted and, thus, among the most influential users in the

User+Spot Graph.

As for the experiment with the DBLP citation network, results have shown that the proposed frame-

work, based on an ensemble regression model, offers highly accurate predictions, providing an effective

mechanism to support the future ranking of papers in academic digital libraries.

70

Chapter 6

Conclusions

In my MSc thesis I proposed to explore the task of finding influential users in a social network, with

the aid of network analysis techniques and algorithms. As I intended to perform experiments with

different types of social networks, I began by collecting real and up-to-date data from both FourSquare

and Twitter, in order to build two distinct social networks based on location, and gathered a dataset from

the DBLP digital library, already structured in the context of the Arnetminer project, so an academic

citation network could be built.

Influence was then estimated through the computation of ranking state-of-the-art algorithms, such as

PageRank, HITS and IP. In the particular case of the IP algorithm, and concerning location-based social

networks, we wanted to estimate exclusively user influence, thus instead of building a network with

user-user and user-location ties, the original implementation of the IP algorithm was adapted so that the

resulting network graph consisted solely in weighted user-user ties.

Regarding the academic citation network, besides an influence estimation for all the papers in the

dataset, we also addressed a recent research topic and developed a framework to predict the future

influence scores of scientific papers and the future download counts of papers downloaded from the

ACM digital library for a specific year, based on the previous years’ influence scores. In this experiment

we could test and combine different sets of features, resulting in two different models for the prediction

of future influence scores: (1) a model including the age of the paper, and (2) a model including the 100

most frequent words in all papers’ titles and abstracts.

Rank Aggregation was also part of the initial objectives of this work, in order to combine the output of

the different algorithms nonetheless, due to some difficulties with the completion of the remaining tasks

included in the MSc thesis work,this task could not be addressed in time.

With the results of our experiments we could perform a detailed characterization of the aforementioned

71

social networks, and verify that social network analysis techniques can be used to assess the most in-

fluential nodes of a network. As for the prediction of future influence scores, we can conclude that the

framework that was developed for academic citation networks provides reliable and accurate estima-

tions, very close to the real values.

A major limitation of this work resides in the evaluation of the results regarding location-based networks.

Unlike academic social networks, where one can either assess the validity of the most influential authors

or the most influential articles through an extensive list of renowned scientific awards that have been

earning prestige throughout the years, social network analysis and, most specifically, location-based

networks is a recent area of studies in which one does not yet have a list of characteristics that indicate

without flaws that a user or a spot is influential, or a series of public prizes that award people, companies

or spots due to their relevance and influence in a specific context. Therefore, this task had to be done

by comparison to well known state-of-the-art social network analysis metrics. Also, social networks are

dynamic, so that set of users or spots that can be considered influent or trendy today, might be different

if we make the same estimation, within the same conditions, in a couple of months or a year.

6.1 Summary of Results

In brief, the following are the most important contributions of my MSc thesis, according to their relevance:

Crawling software

I implemented crawlers to extract data from FourSquare and from Twitter, using their respective

APIs. From the data that was collected I built two location-based networks, from which I extracted

its most influential nodes. The source code for the FourSquare crawler was made available as an

open-source project1, so it can be re-used by others researching this topic.

Implementation and adaptation of the Influence-Passivity (IP) algorithm

Having conducted a thorough study regarding ranking algorithms, with special focus in the PageR-

ank algorithm and its variants, I implemented Influence-Passivity (IP) algorithm. The originality in

the implementation of IP resides in the fact that the network is built in such way that it only con-

tains user-user arcs and the weights assigned to each edge depend on the number of spots that

the two users have visited in common. This adaptation of IP bears the fact that in location-based

networks the information is spread differently than in average social networks. The code for the

implementation of the IP algorithm was made available as an open-source project2, so it can be

used and improved by others researching this topic.

1http://code.google.com/p/fscrawler/2http://code.google.com/p/ezgraph/

72

http://code.google.com/p/fscrawler/

http://code.google.com/p/ezgraph/

Academic Citation Network

From the already structured data from DBLP, organized in the context of the Arnetminer Project,

I built an academic citation network and was able to extract its most influential papers, through

the computation of the PageRank algorithm. The results were validated against an extensive list of

renowned scientific awards, coming to the conclusion that the majority of the top-10 highest ranked

papers in the network are either authored by recipients of the aforementioned awards or represent

breakthroughs, unquestionable text books on a specific topic or are authored by scientists who

have collaborated and co-authored with a great number of other scientists.

Framework for prediction of future PageRank scores and future download counts

I developed a framework to predict the future PageRank score of scientific papers and the future

download counts of a scientific paper for a specific year, using the academic citation network

mentioned in the previous item.

This task was address through an ensemble learning regression algorithm, the IGBRT. I also as-

sessed the impact that different features and the combination of different features like previous

PageRank scores or the age of the paper, have in the accuracy of the results. Our predictions

were compared to the real PageRank scores and the real number of downloads in the ACM Digital

Library for that specific paper and year and we concluded that in some cases, depending on the

combination of features that we used, having some information can negatively deviate the results,

while in others, as we combine more information, the predictions become closer to the real values.

Globally, this approach to future PageRank prediction proved to be accurate, with the predicted

results very close to the real values.

6.2 Future Work

In terms of future work, it would be important to address all the tasks that I initially intended fulfill, namely,

conduct rank aggregation in the aforementioned experiments. It would also be very interesting to find

the most influential users and spots for more complete datasets, which could result in a much richer

network and subsequent analysis.

Taking advantage of the fact that this research area is still in its infancy, we could combine the work of

this MSc thesis with the work of Lima & Musolesi (2012), which adapts well known local and global social

network analysis metrics like degree or clustering coefficient that are location-agnostic, giving them a

spatial context, e.g., to calculate the degree of a node in the network, but only considering the friends of

this node that are associated with a specific geographical location, such as a city or a state.

Also, due to the fact that social networks are dynamic networks, i.e, its structure can change overtime

with the addition or loss of nodes and relationships, we could integrate state-of-the-art frameworks

73

and algorithms in order to include the passage of time in the networks we have studied. Even though

dynamic networks have been frequently addressed regarding network visualization (Demoll & Mcfarland,

2005), works such as of Berger-Wolf & Saia (2006) break away from conventional networks analysis, by

proposing a mathematical framework for dynamic network analysis.

On the other hand, we could also extend our work with the implementation of temporal distance metrics

proposed by Tang et al. (2009), that could be applied to networks that change over time and allow

us to capture the properties of these time-varying graphs, such as delay, duration and time order of

interactions between nodes.

74

Bibliography

AGARWAL, N., LIU, H., TANG, L. & YU, P.S. (2008). Identifying the influential bloggers in a community.

In Proceedings of the 2008 International Conference on Web Search and Web Data Mining.

ANAGNOSTOPOULOS, A., KUMAR, R. & MAHDIAN, M. (2008). Influence and correlation in social net-

works. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery

and Data Mining.

ANDERSON, L.R. & HOLT, C.A. (1995). Information cascades in the laboratory. American Economic

Review , 87.

ARGUELLO, J., BUTLER, B.S., JOYCE, E., KRAUT, R., LING, K.S., ROSE, C. & WANG, X. (2006). Talk

to me: foundations for successful individual-group interactions in online communities. In Proceedings

of the 2006 SIGCHI Conference on Human Factors in Computing Systems.

BAKSHY, E., HOFMAN, J.M., MASON, W.A. & WATTS, D.J. (2011). Everyone’s an influencer: quantifying

influence on twitter. In Proceedings of the 4th ACM International Conference on Web Search and Data

Mining.

BASTIAN, M., HEYMANN, S. & JACOMY, M. (2009). Gephi: An open source software for exploring and

manipulating networks. In Proceedings of the 3rd International AAAI Conference on Weblogs and

Social Media.

BERBERICH, K., BEDATHUR, S. & WEIKUM, G. (2006). Rank synopses for efficient time travel on the

web graph. In Proceedings of the 15th ACM International Conference on Information and Knowledge

Management .

BERGER-WOLF, T.Y. & SAIA, J. (2006). A framework for analysis of dynamic social networks. In Pro-

ceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.

BEST, D.J. & ROBERTS, D.E. (1975). Algorithm as 89: The upper tail probabilities of spearman’s rho.

Journal of the Royal Statistical Society. Series C (Applied Statistics), 24.

BOLDI, P. & VIGNA, S. (2004). The webgraph framework I: compression techniques. In Proceedings of

the 13th International Conference on World Wide Web.

75

BOLDI, P., SANTINI, M. & VIGNA, S. (2005). Pagerank as a function of the damping factor. In Proceed-

ings of the 14th International Conference on World Wide Web.

BOLLEN, J., RODRIGUEZ, M.A. & VAN DE SOMPEL, H. (2006). Journal status. Scientometrics, 69.

BOLLEN, J., VAN DE SOMPEL, H., HAGBERG, A. & CHUTE, R. (2009). A principal component analysis

of 39 scientific impact measures. Public Library of Science, 4.

BONACICH, P. (2007). Some unique properties of eigenvector centrality. Social Networks, 29.

BONDY, J.A. & MURTY, U.S.R. (1976). Graph Theory with Applications. Macmillan.

BRAUER, A. (1952). Limits for the characteristic roots of a matrix. IV: Applications to stochastic matrices.

Duke Mathematical Journal , 19.

BREIMAN, L. (2001). Random forests. Machine Learning, 45.

BRIN, S. & PAGE, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proceed-

ings of the 7th International Conference on World Wide Web.

CHA, M., HADDADI, H., BENEVENUTO, F. & GUMMADI, K.P. (2010). Measuring user influence in twitter:

The million follower fallacy. In Proceedings of the 2010 International AAAI Conference on Weblogs

and Social Media.

CHEN, C. (2006). Citespace II: Detecting and visualizing emerging trends and transient patterns in

scientific literature. Journal of the American Society for Information Science, 57.

CHEN, P., XIE, H., MASLOV, S. & REDNER, S. (2007). Finding scientific gems with google’s pagerank

algorithm. Journal of Informetrics, 1.

CLARK, J. & HOLTON, D.A. (1991). A First Look at Graph Theory . World Scientific.

CONITZER, V. (2006a). Computational Aspects of preference aggregation. Ph.D. thesis, Carnegie Mellon

University.

CONITZER, V. (2006b). Computing slater rankings using similarities among candidates. In Proceedings

of the 21st National Conference on Uncertainty in Artificial Intelligence.

CONITZER, V. & SANDHOLM, T. (2005). Common voting rules as maximum likelihood estimators. In

Proceedings of the 2005 National Conference on Uncertainty in Artificial Intelligence.

CORMEN, T.H., LEISERSON, C.E., RIVEST, R.L. & STEIN, C. (2001). Introduction to Algorithms. The

MIT Press, 2nd edn.

DEMOLL, B.S. & MCFARLAND, D. (2005). The Art and Science of Dynamic Network Visualization. Jour-

nal of Social Structure, Volume 7.

76

DEVEZAS, J., NUNES, S. & RIBEIRO, C. (2011). Using the H-index to Estimate Blog Authority. In Pro-

ceedings of the 5th International AAAI Conference on Weblogs and Social Media.

DIESTEL, R. (2005). Graph Theory , vol. 173. Springer-Verlag, Heidelberg, 3rd edn.

DING, Y. & CRONIN, B. (2011). Popular and/or prestigious? measures of scholarly esteem. Information

Processing and Management , 47.

DING, Y., YAN, E., FRAZHO, A. & CAVERLEE, J. (2009). Pagerank for ranking authors in co-citation

networks. Journal of the American Society for Information Science and Technology , 60.

DUTTON, G. (1996). Improving locational specificity of map data - a multi-resolution, metadata-driven

approach and notation. International Journal of Geographical Information Science, 10.

EASLEY, D. & KLEINBERG, J. (2010). Networks, Crowds, and Markets: Reasoning About a Highly Con-

nected World . Cambridge University Press.

EGGHE, L. (2006). Theory and practise of the g-index. Scientometrics, 69.

EGGHE, L. (2009). Lotkaian informetrics and applications to social networks. The Bulletin of the Belgian

Mathematical Society , 16.

FIALA, D., ROUSSELOT, F. & JEZEK, K. (2008). PageRank for bibliographic networks. Scientometrics,

76.

FRANCK, G. (1999). Essays on Science and Society: Scientific Communication–A Vanity Fair? Science,

286.

FREEMAN, L.C. (1978). Centrality in social networks conceptual clarification. Social Networks, 215.

GELLER, C. (2002). Single transferable vote with Borda elimination: A new vote counting system. Tech.

rep., Deakin University, Faculty of Business and Law, School of Accounting, Economics and Finance.

GHOSH, R., LERMAN, K., SURACHAWALA, T., VOEVODSKI, K. & TENG, S.H. (2011). Non-conservative

diffusion and its application to social network analysis. Arxiv article pre-print.

GIBBONS, A. (1985). Algorithmic Graph Theory . Cambridge University Press.

HAGBERG, A.A., SCHULT, D.A. & SWART, P.J. (2008). Exploring network structure, dynamics, and

function using NetworkX. In Proceedings of the 7th Python in Science Conference.

HARARY, F. (1962). The determinant of the adjacency matrix of a graph. Society for Industrial and Ap-

plied Mathematics, 4.

HAVELIWALA, T.H. (2002). Topic-sensitive pagerank. In Proceedings of the 11th international conference

on World Wide Web.

77

HEIDEMANN, J., KLIER, M. & PROBST, F. (2010). Identifying key users in online social networks: A

pagerank based approach. In Proceedings of the 31st International Conference on Information Sys-

tems.

HIRSCH, J.E. (2010). An index to quantify an individual’s scientific research output that takes into ac-

count the effect of multiple coauthorship. Scientometrics, 85.

HUBERMAN, B.A., ROMERO, D.M. & WU, F. (2009). Crowdsourcing, attention and productivity. Journal

of Information Science, 35.

JOACHIMS, T. (1999). Advances in kernel methods. chap. Making large-scale support vector machine

learning practical, MIT Press.

JOACHIMS, T. (2002). Learning to classify text using support vector machines. Kluwer, dissertation.

KAISER, M. (2008). Mean clustering coefficients: the role of isolated nodes and leafs on clustering

measures for small-world networks. New Journal of Physics, 10.

KISELEV, V. (2008). On eligibility by the Borda voting rules. International Journal of Game Theory , 37.

KLEINBERG, J.M. (1998). Authoritative sources in a hyperlinked environment. In Proceedings of the 9th

Annual ACM-SIAM Symposium on Discrete Algorithms.

LEAVITT, A., BURCHARD, E., FISHER, D. & GILBERT, S. (2009). The influentials: New approaches for

analyzing influence on twitter. Webecology Project.

LEBANON, G. & LAFFERTY, J.D. (2002). Cranking: Combining rankings using conditional probability

models on permutations. In Proceedings of the 19th International Conference on Machine Learning.

LI, H. (2011). Learning to Rank for Information Retrieval and Natural Language Processing. Morgan &

Claypool Publishers.

LIMA, A. & MUSOLESI, M. (2012). Spatial dissemination metrics for location-based social networks.

In Proceedings of the 4th ACM International Workshop on Location-Based Social Networks (LBSN

2012). Colocated with ACM UbiComp 2012.

LIU, T.Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Re-

trieval , 3.

LIU, X., BOLLEN, J., NELSON, M.L. & VAN DE SOMPEL, H. (2005). Co-authorship networks in the digital

library research community. Information Processing and Management , 41.

LOTKA, A.J. (1926). The frequency distribution of scientific productivity. Journal of the Washington

Academy of Science, 16.

78

LUCIANO, RODRIGUES, F.A., TRAVIESO, G. & BOAS, V.P.R. (2005). Characterization of complex net-

works: A survey of measurements. Advances in Physics, 56.

LUCIANO, RODRIGUES, F.A., TRAVIESO, G. & BOAS, V.P.R. (2006). Characterization of complex net-

works: A survey of measurements. Advances in Physics, 56.

MACSKASSY, S.A. & PROVOST, F. (2007). Classification in networked data: A toolkit and a univariate

case study. Journal of Machine Learning Research, 8.

MCPHERSON, M., SMITH-LOVIN, L. & COOK, J.M. (2001). Birds of a feather: Homophily in social

networks. Annual Review of Sociology , 27.

MIHALCEA, R. (2004). Graph-based ranking algorithms for sentence extraction, applied to text summa-

rization. In Proceedings of the 2004 Annual Meeting of the Association for Computational Linguistics.

MILLEN, D.R. & PATTERSON, J.F. (2002). Stimulating social engagement in a community network. In

Proceedings of the 2002 ACM Conference on Computer Supported Cooperative Work .

MOHAN, A., CHEN, Z. & WEINBERGER, K.Q. (2011). Web-search ranking with initialized gradient

boosted regression trees. Journal of Machine Learning Research - Proceedings Track , 14.

NEWMAN, M.E.J. (2003). A measure of betweenness centrality based on random walks. Social Net-

works, 27.

NEWMAN, M.E.J. (2004). Analysis of weighted networks. Physical Review E , 70.

OLIVER, J.J. & HAND, D.J. (1995). On pruning and averaging decision trees. In In Proceedings of the

Twelfth International Conference on Machine Learning, Morgan Kaufmann.

PAGE, L., BRIN, S., MOTWANI, R. & WINOGRAD, T. (1998). The pagerank citation ranking: Bringing

order to the web. In Proceedings of the 7th International World Wide Web Conference.

PAPAGELIS, M., BANSAL, N. & KOUDAS, N. (2009). Information cascades in the blogosphere: A look

behind the curtain. In Proceedings of the 3rd International AAAI Conference on Weblogs and Social

Media.

PERRA, N. & FORTUNATO, S. (2008). Spectral centrality measures in complex networks. Physical Re-

view E , 78.

PROCACCIA, A.D., ZOHAR, A. & ROSENSCHEIN, J.S. (2006). Automated design of voting rules by

learning from examples. In In Proceedings of the 1st International Workshop on Computational Social

Choice.

REKA, A. & BARABASI (2002). Statistical mechanics of complex networks. Reviews of Modern Physics,

74.

79

ROMERO, D.M., GALUBA, W., ASUR, S. & HUBERMAN, B.A. (2011). Influence and passivity in social

media. In Proceedings of the 20th International Conference Companion on World Wide Web.

SAYYADI, H. & GETOOR, L. (2009). Futurerank: Ranking scientific articles by predicting their future

pagerank. In Proceedings of the 2009 SIAM International Conference on Data Mining.

SHANNON, P., MARKIEL, A., OZIER, O., BALIGA, N.S., WANG, J.T., RAMAGE, D., AMIN, N.,

SCHWIKOWSKI, B. & IDEKER, T. (2003). Genome Research, 13.

SIDIROPOULOS, A. & MANOLOPOULOS, Y. (2005). A citation-based system to assist prize awarding.

ACM SIGMOD Record , 34.

SIDIROPOULOS, A., KATSAROS, D. & MANOLOPOULOS, Y. (2007). Generalized hirsch h-index for dis-

closing latent facts in citation networks. Scientometrics, 72.

SZABO, G. & HUBERMAN, B.A. (2010). Predicting the popularity of online content. Communications of

the ACM, 53.

SZALAY, A.S., GRAY, J., FEKETE, G., KUNSZT, P.Z., KUKOL, P. & THAKAR, A. (2007). Indexing the

sphere with the hierarchical triangular mesh. Techinical Report.

TANG, J., MUSOLESI, M., MASCOLO, C. & LATORA, V. (2009). Temporal distance metrics for social

network analysis. In Proceedings of the 2nd ACM workshop on Online social networks.

WALKER, D., XIE, H., YAN, K.K. & MASLOV, S. (2007). Ranking scientific publications using a simple

model of network traffic. Journal of Statistical Mechanics.

WATTS, D.J. & DODDS, P.S. (2007). Influentials, networks, and public opinion formation. Journal of

Consumer Research, 34.

WATTS, D.J. & STROGATZ, S.H. (1998). Collective dynamics of ’small-world’ networks. Nature, 393.

WENG, J., LIM, E.P., JIANG, J. & HE, Q. (2010). Twitterrank: finding topic-sensitive influential twitterers.

In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining.

WU, F., WILKINSON, D.M. & HUBERMAN, B.A. (2009). Feedback loops of attention in peer production.

In Proceedings of the 2009 International Conference on Computational Science and Engineering.

XIA, L., LANG, J. & MONNOT, J. (2011). Possible winners when new alternatives join: New results com-

ing up. In In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent

Systems.

XING, W. & GHORBANI, A. (2004). Weighted pagerank algorithm. In Proceedings of the 2004 Annual

Conference on Communication Networks and Services Research.

80

YAN, E. & DING, Y. (2011). Discovering author impact: A pagerank perspective. Information Processing

and Management , 47.

YANG, J. & COUNTS, S. (2010). Predicting the speed, scale, and range of information diffusion in twitter.

In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media.

YOUNG, H.P. (2009). Innovation Diffusion in Heterogeneous Populations: Contagion, Social Influence,

and Social Learning. American Economic Review , 99.

ZHANG, C.T. (2009). The e-index, complementing the H-index for excess citations. Public Library of

Science, 4.

ZHENG, Y. & ZHOU, X., eds. (2011). Computing with Spatial Trajectories. Springer.

81

Appendix A

Important Awards in Computer

Science

The following renowned award lists were used as ground-truth lists in the task of assessing the veracity

of the PageRank scores obtained for the DBLP dataset:

• A. M. Turing Award1

• Knuth Prize2

• IEEE John von Neumann Medal3

• IEEE Emanuel R. Piore Award4

• ACM SIGMOD Edgar F. Codd Innovations Award5

• ACM SIGMOD Best Paper Award6

• ACM SIGMOD Test of Time Award7

• ACM Software System Award8

• ACM Innovation Award9

• National Science Foundation Presidential Young Investigator Award10

1http://amturing.acm.org/2http://www.sigact.org/Prizes/Knuth/3http://www.ieee.org/about/awards/medals/vonneumann.html4http://www.ieee.org/about/awards/tfas/piore.html5http://www.sigmod.org/sigmod-awards/sigmod-awards#innovations6http://www.sigmod.org/sigmod-awards/sigmod-awards#bestpaper7http://www.sigmod.org/sigmod-awards/sigmod-awards#time8http://awards.acm.org/homepage.cfm?srt=all&awd=1499http://www.sigkdd.org/awards_innovation.php

10http://www.nsf.gov/awards/presidential.jsp

83

http://amturing.acm.org/

http://www.sigact.org/Prizes/Knuth/

http://www.ieee.org/about/awards/medals/vonneumann.html

http://www.ieee.org/about/awards/tfas/piore.html

http://www.sigmod.org/sigmod-awards/sigmod-awards#innovations

http://www.sigmod.org/sigmod-awards/sigmod-awards#bestpaper

http://www.sigmod.org/sigmod-awards/sigmod-awards#time

http://awards.acm.org/homepage.cfm?srt=all&awd=149

http://www.sigkdd.org/awards_innovation.php

http://www.nsf.gov/awards/presidential.jsp

• SIGIR Gerard Salton Award1

1http://www.sigir.org/awards/awards.html

84

http://www.sigir.org/awards/awards.html

finding inﬂuencers in social networks · inﬂuential nodes in a social network. with two...

Documents