finding influencers in social networks · influential nodes in a social network. with two...
TRANSCRIPT
Finding Influencers in Social Networks
Carolina de Figueiredo Bento
Dissertation submitted to obtain the Master Degree in
Information Systems and Computer Engineering
Jury
President: Prof. Dr. Mario Jorge Costa Gaspar da Silva
Supervisor: Prof. Dr. Bruno Emanuel da Graca Martins
Co-Supervisor: Prof. Dr. Pavel Pereira Calado
Member: Prof. Dr. Alexandre Paulo Lourenco Francisco
November 2012
Abstract
From the millions of users that social platforms have, one can acknowledge that the activities of a
selected number of users are more rapidly perceived and spread through the network, than those
of others. These users are the influencers. They generate trends and shape opinions in social networks,
being crucial in areas such as marketing or opinion mining.
In my MSc thesis, I studied network analysis methods to identify influencers, experimenting with different
types of networks, namely location-based networks from services like FourSquare or Twitter, that include
relationships between users and between users and the locations they have visited, and academic
citation networks, i.e., networks that relate scientific papers through citations.
Within location-based networks I estimated the most influential nodes, through a set of network analysis
techniques. Assessing the veracity of these results was done by comparison to traditional measures
(e.g., the number of friends a user has) because there is no ground-truth list, i.e., a list containing a
set of well known accepted influencers. The majority of the influencers are not the ones who have the
highest number of friends.
Within academic citation networks, the most influential papers identified were really considered impor-
tant publications, due to being authored by renowned authors and recipients of important awards, being
fundamental reading or recent developments on a topic. I also developed a framework to predict fu-
ture influence scores and download counts through a combination of features. Accurate estimates were
obtained through the use of learning methods such as the RT-Rank.
Keywords: Social Networks, Network analysis, Impact Scores, Finding Influencers,
i
Resumo
Os servicos de social networking tem milhoes de utilizadores contudo, percebemo-nos que a activi-
dade de um grupo selecto de utilizadores e mais rapidamente captada e propagada pela rede, do
que a de outros. Chamamos a este grupo os influentes. Eles criam tendencias e dominam as opinioes
nas redes sociais, sendo cruciais em areas como o marketing ou opinion mining.
Na minha tese, estudei metodos de analise de redes para a identificar influentes, analizando dois tipos
de redes, nomeadamente, redes baseadas na localizacao, provindas de servicos como o FourSquare
ou o Twitter, que incluem relacoes entre os utilizadores e entre estes e os locais que estes visitaram, e
redes de citacoes academicas, i.e., relacionando artigos cientıficos atraves de citacoes.
Em redes baseadas na localizacao, estimaram-se quais os nos mais influentes, atraves de um conjunto
de tecnicas de analise de redes. A veracidade destes resultados foi aferida comparando medidas
tradicionais (e.g., o numero de amigos de um utilizador) dado nao existir uma lista de influentes para
validacao, i.e., uma lista contendo um conjunto de influentes unanimemente reconhecidos.
Em redes de citacoes academicas, os artigos obtidos como mais influentes sao realmente publicacoes
importantes, devido a serem da autoria de cientistas de renome galardoados passado, por serem
publicacoes essenciais ou desenvolvimentos recentes num topico especıfico. Desenvolvi tambem uma
framework que preve futuros valores de influencia e o futuro total de downloads efectuados, combinando
caracterısticas como valores de influencia anteriores. Atraves da utilizacao de metodos de aprendiza-
gem com o RT-Rank, e possıvel realizar estimativas precisas.
Palavras-chave: Redes Sociais, Analise de Redes, Valores de Influencia, Encontrar Influentes
iii
Acknowledgments
First and foremost I have to thank my parents, sister and brother-in-law for the unconditional support
and selflessness throughout these years, and specially during my MSc thesis.
I must thank my advisors, Prof. Dr. Bruno Martins and Prof. Dr. Pavel Calado, for all the support,
motivation, patience and availability. It is very comforting to be able to share ideas and openly discuss
new ways of addressing a problem with such ease. Also, I must thank them for giving me the oppor-
tunity of being part of projects, such as, the European Digital Mathematics Library (EuDML) and the
Services for Intelligent Geographical Information Systems (SInteliGIS), both funded by the Portuguese
Foundation for Science and Technology (FCT) through the project grants with reference 250503 in CIP-
ICT-PSP.2009.2.4 and PTDC/EIA-EIA/109840/2009, respectively.
I thank all the colleagues and close friends that have accompanied me throughout the years, and spe-
cially, the ones who have filled these last couple of years with so much joy, laughter and camaraderie.
So, to Ana Silva, Joao Lobato Dias, Luıs Santos, Joao Amaro, Pedro Cruz, Jacqueline Jardim, Maria
Rosa, Luıs Luciano, Carlos Simoes, Mafalda Abreu, Celia Tavares and, thankfully many others, I express
my enormous gratitude for keeping me (in)sane.
Last, but definitely not the least, I must thank my boyfriend, Joao Fernandes, for the unconditional love,
support, patience and confidence, for helping me being more creative and acute during the stressful
times and for showing me there is always a light at the end of the tunnel.
v
Contents
Abstract i
Resumo iii
Acknowledgments v
1 Introduction 1
1.1 Hypothesis and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Fundamental Concepts 5
2.1 Fundamental Concepts in Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Influencers in Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Prestige, Popularity and Attention in Social Networks . . . . . . . . . . . . . . . . . . . . . 9
2.4 Recognition, Novelty, Homophily and Reciprocity . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Active versus Inactive Users, User Retention, Confounding, Social Influence and Social
Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Information Cascades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7 Information Diffusion Models and Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.8 Graph Centrality Measures and Bibliographic Indexes . . . . . . . . . . . . . . . . . . . . 14
2.9 Unsupervised Rank Aggregation Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.10 Supervised Learning for Rank Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
vii
3 Related Work 27
3.1 The Hyperlinked Induced Topic Search (HITS) Algorithm . . . . . . . . . . . . . . . . . . . 27
3.2 The PageRank algorithm and its Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Weighted PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2 Topic-Sensitive PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.3 TwitterRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 The Influence-Passivity (IP) Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Citation and Co-Authorship Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Temporal Issues in Ranking Scientific Articles . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Finding Influencers in Social Networks 43
4.1 Available Resources for Finding Influencers . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.1 Characterizing Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Analysis of Location-based Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 Data Collection from Online Services . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 Adaptation of the Influence-Passivity (IP) Algorithm . . . . . . . . . . . . . . . . . . 49
4.3 Analysis of Academic Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Predicting Future Influence Scores and Download Counts . . . . . . . . . . . . . . 51
4.3.2 The Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 Validation Experiments 57
5.1 The Considered Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3 The Obtained Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.1 Finding Influencers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.2 Predicting Future PageRank Scores and Download Counts . . . . . . . . . . . . . 67
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
viii
6 Conclusions 71
6.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Bibliography 75
Apendices 83
A Important Awards in Computer Science 83
ix
List of Tables
5.1 Characterization of the FourSquare and Twitter networks. . . . . . . . . . . . . . . . . . . 58
5.2 Characterization of the DBLP dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3 Characterization of the DBLP network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 User influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built
from the FourSquare dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5 User influence scores for PageRank and HITS algorithms, for the User Graph, built from
the FourSquare dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.6 User influence scores for the IP algorithm, built from the FourSquare dataset. . . . . . . . 64
5.7 Spot influence scores for PageRank and HITS algorithms (that present the exact same
top-10), for the User+Spot Graph, built from the FourSquare dataset. . . . . . . . . . . . . 65
5.8 User influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built
from the Twitter dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.9 User influence scores for PageRank and HITS algorithms, for the User Graph, built from
the Twitter dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.10 Spot influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built
from the Twitter dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.11 PageRank scores for top-10 highest ranked papers of the DBLP dataset. . . . . . . . . . 67
5.12 Results for the prediction of impact PageRank scores for papers in the DBLP dataset. . . 68
5.13 Results for the prediction of download numbers for papers in the DBLP dataset. . . . . . . 69
xi
List of Figures
2.1 A graph with the set of vertices V={1, ..., 8}, the set of edges E={(1, 2), (2, 4), (3, 4), ...}
and encoding a path P with length 6 (adapted from (Diestel, 2005)). . . . . . . . . . . . . 7
2.2 Graph with three components and two SCC’s denoted by dashed lines (adapted from
Easley & Kleinberg (2010) and Cormen et al. (2001)). . . . . . . . . . . . . . . . . . . . . 8
2.3 Flowchart for the Single Transferable Vote rule. . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Learning-To-Rank (L2R) Framework (adapted from Liu (2009)). . . . . . . . . . . . . . . . 24
3.5 A graph with hubs and authorities (adapted from Kleinberg (1998)). . . . . . . . . . . . . . 28
3.6 A graph illustrating the computation of PageRank (adapted from Page et al. (1998)). . . . 29
3.7 The general TwitterRank framework (adapted from Weng et al. (2010)). . . . . . . . . . . 34
4.8 Example of a location-based social network (adapted from Zheng & Zhou (2011)). . . . . 46
4.9 A sequence of subdivisions of the world sphere, starting from the octahedron, down to
level 5 corresponding to 8192 spherical triangles. The circular triangles have been plotted
as planar ones, for simplicity (adapted from Szalay et al. (2007)). . . . . . . . . . . . . . . 48
4.10 The HTM recursive division process (adapted from Szalay et al. (2007)). . . . . . . . . . . 49
4.11 Transformation of the original network graph (left) to our IP algorithm graph (right). . . . . 51
4.12 Structure of the citation graph built upon the DBLP data. . . . . . . . . . . . . . . . . . . . 51
4.13 Framework for predicting future PageRank scores and download counts. . . . . . . . . . . 52
5.14 Degree distribution for nodes in the User+Spot Graph and the User Graph, from the
FourSquare and Twitter datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.15 Degree distribution for the DBLP dataset from 2008 to 2011. . . . . . . . . . . . . . . . . 62
xiii
Chapter 1
Introduction
The rise of social media platforms such as Twitter1 or Google+2, with their focus on user-generated
content and social networks, has brought the study of authority and influence over social networks
to the forefront of current research. For companies and other public entities, identifying and engaging
with influential authors in social media is critical, since any opinions they express can rapidly spread far
and wide. For users, when presented with a vast amount of content relevant to a topic of interest, sorting
content by the source’s authority or influence can also assist in information retrieval.
There has been a substantial amount of recent work studying influence and the diffusion of informa-
tion in social networks. Moreover, there has also been much work in the field of network analysis that
has focused explicitly on sociometry, including quantitative measures of influence, authority, centrality or
prestige. These measures (e.g., degree centrality or betweenness centrality) are essentially heuristics,
usually based on intuitive notions such as access and control over resources, or brokerage of informa-
tion.
In the context of my MSc thesis I conducted a thorough study on the problem of identifying the most
influential nodes in a social network. With two different types of networks at hand, namely location-based
social networks from services such as FourSquare or Twitter, and academic citation networks encoding
relations between papers, the main focus was to use well-known social network analysis techniques and
algorithms.
One of the most important contributions of this work consisted in adapting the Influence-Passivity (IP)
algorithm, initially strictly intended for Twitter data and relying on re-tweets to capture information flow,
to be used in the context of location-based social networks, where the propagation of information is
done via the locations that users visited, i.e., exploring patterns related to having a new user j visiting a
location l, after one of his friends i had already visited l.
1http://twitter.com/2https://plus.google.com/
1
In what regards the study of influence in academic social networks, I studied techniques for estimat-
ing the future influence scores and future download counts. In this context, I specifically developed a
framework to predict the future PageRank scores and future download counts of scientific articles that
were downloaded in the ACM Digital Library1, for a specific year, through a combination of features that
include the age of the article and previous PageRank scores.
1.1 Hypothesis and Methodology
In the context of my MSc thesis, I focused on the task of identifying the most influential users in a social
network, working with two types of networks, namely (1) location-based networks from services like
FourSquare or Twitter, that include relationships between users in the network and between users and
the locations they have visited, and (2) academic citation networks, i.e., networks that relate scientific
papers according to their citation relationships. The main hypothesis I tried to validate was that we
can identify the most influential users through social network analysis techniques and algorithms. More
specifically, with location-based networks, the presence of locations aids in the propagation of influence
scores through the network and, on the other hand, with academic citation networks, one can focus on
the temporal dynamics of networks and use networks from the past to predict future networks, assessing
how influence scores evolve through time.
In order to validate the research hypothesis, we began by collecting real and up-to-date data from two
social networking platforms, namely, FourSquare2 and Twitter. To assess the accuracy of our results
for social networks based on location, we made an empirical analysis of our top-10, looking into the
user profiles and spot check-ins, in order to understand how their profile characteristics were related to
their influence in the network. Regarding academic social networks, a citation network was built with
data from the DBLP3 digital library. In location-based social networks, different ranking algorithms were
computed and the top-10 highest ranked users and the top-10 highest ranked spots were extracted and
analyzed. To assess the accuracy of our results over location-based social networks, conducted an
empirical analysis, relying on profile information.Regarding the quality of the results from the academic
citation network, it was assessed by crossed-checking the authors of the top-10 highest ranked scientific
papers in the DBLP collection with the recipients of various renowned scientific awards - see Appendix A.
Considering the experiments for estimation of future influence scores of scientific papers and future
download counts for these scientific papers, a set of evaluation metrics, including the normalized root
mean squared error and the spearman correlation, was used to assess the quality of our predictions
comparing to the real influence scores.
1http://dl.acm.org/2https://foursquare.com/3http://www.informatik.uni-trier.de/˜ley/db/
2
1.2 Main Contributions
The following are the most important contributions of this thesis, according to their relevance:
• I conducted a thorough study regarding ranking algorithms, with special focus in the PageRank
algorithm and its variants. I specifically implemented the HITS and the Influence-Passivity (IP)
algorithms. The IP algorithm was adapted to the context of location-based social networks. I
computed the influence for each node and extracted the highest ranked nodes in each type of
network. The code implementation of HITS and IP algorithms was made available as an open-
source project1, so that it can be re-used by others researching this topic.
• I implemented crawlers to extract data from FourSquare and from Twitter, from which I built net-
works with two types of nodes, namely users and spots (i.e., the locations users have visited).
These networks were used in the context of experiments for finding the most influential nodes
through algorithms such as HITS, PageRank or IP. The source code for the FourSquare crawler
was made available as an open-source project2, so it can be re-used by others researching this
topic.
• I built a citation network with data from the DBLP digital library, being able to extract its most
influential papers, after computing the PageRank algorithm. The accuracy of these results was
assessed by cross-checking the authors of these papers against a list of the recipients of various
renowned scientific awards. From this experiment, we could conclude that the majority of the most
influential papers in this network are authored by recipients of important scientific awards.
• I developed a framework to predict the future PageRank scores and the future download counts of
scientific papers, for a specific year, using the citation network built from DBLP data. This task was
addressed through an ensemble learning regression algorithm. I assessed the impact that different
features have in the accuracy of the results. Our predictions were compared to the real PageRank
scores and the real number of downloads from the ACM Digital Library for each specific paper
and year. We concluded that in some cases, depending on the combination of features that we
used, having more information can deviate negatively the results, while in others, as we combine
more information, predictions become closer to the real values. Globally, this prediction approach
proved to be accurate, with the results being very close to the real values.
1http://code.google.com/p/ezgraph/2http://code.google.com/p/fscrawler/
3
1.3 Organization of the Dissertation
The structure for the rest of this document is the following: Chapter 2 presents fundamental concepts
in social network analysis. Chapter 3 describes the most significant work related to the task of finding
influencers in social networks, and related to the analysis of location-based social networks. Chapter 4
details the work that was developed in the context of my MSc thesis, namely, the methodology for data
collection, how the networks were built, the specific implementation and adaptation of the IP algorithm,
as well as the methodology to find the influential nodes in the networks. Regarding the experiment on
the prediction of future PageRank scores, Chapter 4 also includes the description of the features and
the learning approach that was used. Chapter 5 describes the validation experiments and the obtained
results, alongside with a brief discussion. Finally, Chapter 6 closes this document, highlighting the most
important conclusions of this MSc thesis, and presenting possible paths for improvement and future
work.
4
Chapter 2
Fundamental Concepts
This chapter introduces the fundamental concepts related to the problem of finding influencers in
social networks. After a brief introduction to graph theory, more specific concepts are then pre-
sented, such as what it is to be an influencer, the distinction between popularity and prestige, what does
one mean when discussing social gestures, and the social gestures that are more relevant in the context
of this MSc thesis, namely homophily and reciprocity. Finally, this chapter introduces fundamental con-
cepts behind graph centrality measures, bibliometric indexes and rank aggregation approaches, these
latter concerning the combination of the output of various ranking methods, to generate a consensual
ranked list.
2.1 Fundamental Concepts in Graph Theory
A graph G can be represented as a pair G = (V,E), where V or V (G) is the set of vertices or nodes
and E or E(G) is the set of edges or links between the nodes (Figure 2.1). The number of vertices of a
graph indicates the graph’s order (Diestel, 2005). Graphs are usually used when representing networks,
either undirected (Figure 2.1) or directed (i.e., digraphs in which the edges have a direction from node a
A to node a B). A way of representing a directed graph D is with an adjacency matrix, which is a square
matrix A = A(D) where each cell (i, j) has a value equal to 1 if there is an edge from i to j, and a value
equal to 0 otherwise (Harary, 1962).
In what regards graph measures, the degree dG(i) or valency of a vertex i in an undirected graph G
is the number |E(i)| of edges at i, which is equal to the number of neighbours of i, i.e., the number of
vertices that are adjacent to i. It can be mathematically expressed as follows, where a(i, j) denotes a
cell in the graph’s adjacency matrix:
5
dG(i) =∑j
a(i, j) =∑j
a(j, i) (2.1)
In what regards directed graphs, we have the same notation as in undirected graphs, with the exception
that, when specifying the set of edges E, all pairs of connected vertices have to be oriented. Besides
the measure of degree, one can also measure the in-degree dG(i)in and out-degree dG(i)out of a vertex
i, which are, respectively, the number of incoming edges and outgoing edges of that vertex (Clark &
Holton, 1991). The indegree and outdegree can also represent the cardinality of, respectively, the set of
predecessors and successors of a node, and can be formally expressed as follows:
dG(i)in =∑j
a(j, i) (2.2)
dG(i)out =∑j
a(i, j) (2.3)
One might also want to represent a weighted network, i.e., a network in which each edge is assigned
with a specific weight. A weighted network can be expressed as an adjacency matrix with each entry
indicating the weight of the edges (wij), as follows (Newman, 2004):
Aij = wij (2.4)
When representing a weighted network with a graph, one just has to add the weights to each edge, thus
defining a weighted graph. For a weighted network, besides the in-degree and out-degree for a vertex
i, one is usually more interested in the strength of i, i.e., the sum of the weights w of the corresponding
edges. The in-strength Sini and out-strength Souti of a vertex i are expressed as follows (Luciano et al.,
2005):
sini =∑j
w(j, i) (2.5)
souti =∑j
w(i, j) (2.6)
Also important in graph analysis is the notion of stochastic matrix. A square matrix A = (akλ) can
only be called stochastic if all its elements are non-negative and if the following conditions are verified
(Brauer, 1952):
6
n∑λ=1
akλ = 1 (k = 1, 2, , ..., n) (2.7)
Stochastic matrices can be used to encode weighted graphs where the indegree or the outdegree cor-
respond to probability distributions.
A path within a graph is a non-empty sub-graph P = (V,E) such that V = {x0, x1, ..., xk}, where
E={(x0x1), (x1x2), ..., (xk−1xk)}, and where xi are all distinct from one another – see Figure 2.1. The
nodes x0 and xk are called the ends of path P (Bondy & Murty, 1976). For undirected and unweighted
graphs, the number of edges (|E|) in a path is the length of the path.
Figure 2.1: A graph with the set of vertices V={1, ..., 8}, the set of edges E={(1, 2), (2, 4), (3, 4), ...} and encoding apath P with length 6 (adapted from (Diestel, 2005)).
One might also be interested in determining the geodesic path, i.e., the shortest path, between two
vertices. The geodesic path between vertices i and j is the path between them that has the minimum
length (Luciano et al., 2005).
When describing the structure of a graph, one can parcel it out into components or connected compo-
nents, i.e., subsets of nodes in which every node has a path to every other node, but are not part of
a larger set that is also internally connected (Gibbons, 1985) – see Figure 2.2. A directed graph can
have strongly connected components (SCCs), which are sets of nodes such that, for any nodes i and j
belonging to the set, there is an acyclic path from i to j and from j to i (Gibbons, 1985). Dangling nodes
are defined as nodes that have no outlinks. Figure 2.2 illustrates both these concepts in a graph.
2.2 Influencers in Social Networks
Influence in social networks is very important, not only from the perspective of information flow, but also
for network analysis applications aimed at business and marketing purposes. In terms of what it is to be
influential, many authors have their particular definitions.
Watts & Dodds (2007) define an influential person or an opinion leader as an individual that is part of a
minority and who has influence over a great number of peers. This influential individual belongs to the
top q% of the influential distribution p(n), having as a premise that an individual i, within a population of
7
Figure 2.2: Graph with three components and two SCC’s denoted by dashed lines (adapted from Easley & Kleinberg(2010) and Cormen et al. (2001)).
size N , influences ni other randomly chosen individuals, where ni comes from p(n) and refers to how
many people i influences, regarding a specific topic X.
From work developed in the Web Ecology Project, in the context of Twitter, an influential is defined as
a user who, from his actions (i.e., from interactions such as replies, retweets, mentions or attributions)
has the potential to initiate an action from another user (Leavitt et al., 2009). These actions are called
markers of influence and should be taken into account in the task of assessing influence on Twitter
users, instead of the primordial measure of the follower count, which states that the user with the greater
amount of followers is the most influential.
Bakshy et al. (2011), also on Twitter, consider that if a person B is following a person A, if person A
posted an URL earlier than person B did, and if person A is the only of B’s friends who has posted that
specific URL, then person A has influenced person B to post that URL. Regarding the computation
of influence, the authors recognize that three different approaches can be considered, if person B has
more than one friend who has posted the same URL:
i. First Influence, crediting exclusively the person who first posted the content, thus assuming that
individuals are influenced when they first see novel information, even if they do not act on it imme-
diately;
ii. Last Influence, crediting the last person who posted the content;
iii. Split Influence, crediting equally all friends that posted that specific content before its most recent
post. This last approach assumes that either the likelihood of noticing novel content or the intention
of acting upon it steadily accumulates, as the information is reposted by more and more friends.
On their turn, and still in the realm of Twitter, Cha et al. (2010) defined three types of influence for a user,
instead of just one. These metrics are directly related to interpersonal activities:
8
i. Indegree Influence, counting the total number of followers to determine the size of the user’s audi-
ence in the network;
ii. Retweet Influence, counting the total amount of retweets with a user’s name to measure the ability
of a user to generate content that is spread by others through the network (i.e., his pass-along
value);
iii. Mention Influence, counting the total amount of mentions with a user’s name to measure the ability
of engaging other users in a conversation.
Another important aspect of influence in social networks relates to the fact that influence is determined
by the information flow through the network, i.e., the flow of user content and its propagation through the
network (Romero et al., 2011).
2.3 Prestige, Popularity and Attention in Social Networks
Although popularity and prestige are two distinct concepts, they are commonly mistaken one for the
other. Both these concepts are related to influence, since prestigious and/or popular users are more
likely to be influential.
One can define popularity as a direct quantification of the level of attention someone, in a social network,
has (Romero et al., 2011). Regarding social networks, one can, for instance, assess the popularity in
Digg1 or in Youtube2, respectively, by the number of votes (Diggs) and the number of views that the
content of a given user has (Szabo & Huberman, 2010).
As for the notion of prestige, it is most commonly associated with scholar networks, such as paper cita-
tion and journal citation networks. In this realm, there is also the distinction between journal popularity
and journal prestige, as the former considers journals that are frequently cited by journals with little pres-
tige, and the latter considers journals that have few citations, but only from highly prestigious journals
(Bollen et al., 2006).
Regarding the popularity and prestige of authors, journal or paper, in a scholar network, the popularity
of an author, journal or paper, is the quantification of the number of times he was cited by other nodes
in the network, while prestige is the number of times the node was cited by other highly cited nodes on
the network (Ding & Cronin, 2011).
In the academic realm, attention is seen as a payment mode, as well as, the main input to scientific
production (Franck, 1999). Scientific publications earn attention when cited by other authors, in their
1http://digg.com/2http://youtube.com/
9
publications. Also in other social networks, attention is regarded as a form of value and as a catalyst for
more contributions in the social network (Wu et al., 2009).
2.4 Recognition, Novelty, Homophily and Reciprocity
Influential and popular people are recognized by their peers and also by many others outside their
communities. As for recognition, may it be in blogs, academia or social media, it is done by referencing
a person’s work, opinions or ideas, and it can have a bidirectional relationship with influence, since the
more influential is what a user references, the more influential the user can become (Agarwal et al.,
2008).
Novelty is also correlated with influence, in the way that novel ideas generally exert more influence. In
the blogosphere, novelty is also correlated with the number of outlinks of a blog post. Nevertheless, this
is a negative correlation, as a greater number of outlinks indicates that the post refers to many other
blog posts, revealing that the post is not likely to be novel (Agarwal et al., 2008).
In the context of human interaction, being at the presence of homophily involves recognizing that similar
people or people with similar characteristics, interests and/or preferences, tend to be more in contact
with each other than with people with less characteristics and/or preferences in common. As stated in
the work of McPherson et al. (2001), homophily implies that distance in terms of social characteristics
translates into network distance, the number of relationships through which a piece of information must
travel to connect two individuals.
Another important social phenomena is reciprocity, rising from the following relationships in social net-
works, such as Twitter, where a user has the tendency of following back a user that followed him in the
first place. This is revealed by the high correlation existing between the number of friends and followers,
meaning that the more friends a user has, the more followers he usually has, and vice-versa (Weng
et al., 2010).
Weng et al. (2010), in the study of TwitterRank, addressed the presence of homophily and reciprocity
on Twitter, considering that these characteristics are behind the following relationships, giving more
meaning to social ties and to the identification of influential people on Twitter.
2.5 Active versus Inactive Users, User Retention, Confounding,
Social Influence and Social Correlation
When a user performs an action for the first time, such as purchasing a product or visiting a website,
one can state that the user has become active. With a total number of a already active friends, a user
10
has an activation probability p(a), which can be modeled with a logistic function expressed as follows:
p(a) =eα ln(α+1)+β
1 + eα ln(α+1)+β(2.8)
In the formula, α and β are coefficients, with α measuring social correlation. Both can be estimated
using maximum likelihood logistic regression (Anagnostopoulos et al., 2008).
An active user can become a retained user if he stays active in the network, therefore affecting the
retention of other users and keeping them from leaving the network (Heidemann et al., 2010). This can
also be used as an evaluation metric to identify influential users in social networks, as Heidemann et al.
(2010) proposed to do.
Also, one can state that two adjacent nodes u and v in a social network have a social correlation tie if
the events that turned u into an active user are correlated with the events that turned v into an active
user as well. This behavioral correlation can be accounted by homophily, confounding factors (i.e., the
environment) and social influence (Anagnostopoulos et al., 2008).
Confounding factors are the influences from external elements, which end up affecting the individuals
that are closer in a social network. It can be mathematically expressed as the presence of a confounding
variable X and a set of active individuals W , both in a social network G, and the fact that the set of
active individuals W comes from a distribution that is correlated with X (Anagnostopoulos et al., 2008).
In confounding, the individuals’ choices of becoming friends with others and of becoming active are
exclusively affected by the same unobserved variable X.
The phenomena of social influence is also one of the causes for social correlation. With social influence,
the actions of individuals can induce their friends in acting the same way, which can occur via (i) an
example to their friends, (ii) informing friends about the action taken, or (iii) increasing the value of an
action for their friends (Anagnostopoulos et al., 2008).
2.6 Information Cascades
In the theory behind information cascades, we assume that agents observe private signals of some in-
herent state and make public decisions. The following decision-makers will face the difficulty of knowing
if their own private signal is significant in the choice of a state that is unlikely, given the public decisions
that were previously observed (Anderson & Holt, 1995).
We are at the presence of information cascades when all decisions (initial and subsequent) coincide in
the way that it is optimal for the following decision-makers to ignore their private signals and follow a
11
pattern that has been established. For example, suppose that a worker is not hired by several prospec-
tive employees because of poor interview performances. With this pubic decision information, we have
that a following prospective employee may not hire the worker, due to the fact the worker’s information
is dominated by negative signals inferred by previous rejections, even if the candidate does well in his
interview (i.e., a positive private signal). Therefore, an information cascade can result from rational in-
ferences that others’ decisions are based on information that dominates one’s own signal (Anderson &
Holt, 1995).
From the work developed by Papagelis et al. (2009) in the context of the blogosphere, we have that a
cascade can be characterized by its (i) size, i.e., the number of nodes involved in the cascade, excluding
its initiator; (ii) height, i.e., the height of the resulting spanning tree, after a depth first search traversal
on the cascade; (iii) minimum reaction time of all posts in the cascade, excluding its initiator; (iv) mean
reaction time of all posts in the cascade, excluding its initiator; and (v) maximum reaction time of all
posts in the cascade, excluding its initiator.
In social networks, there are many factors that influence information cascades, such as the graphical
interface used to interact with the network (Millen & Patterson, 2002), the fact that an in-topic conversa-
tion/interaction is being maintained (Arguello et al., 2006), or positive attention and feedback (Huberman
et al., 2009).
The analysis of information cascades can provide insight on public opinion over a variety of topics
(Papagelis et al., 2009). Therefore, this is related to the task of finding influential users on a social
network, since those influential users are the ones who tend to shape, i.e., influence, the opinions of
other users the social network.
2.7 Information Diffusion Models and Measures
Arising, respectively, from the realms of marketing, sociology and economics, Young (2009) presents
three information diffusion models, namely, (i) social contagion, (ii) social influence and (iii) social learn-
ing.
In social contagion, information spreads like in an epidemic, i.e., people spread information when they
contact with others who have already been in contact with that same information (Young, 2009). This
model is, thus, based on exposure. The homogeneous contagion model at time t can be mathematically
described as the following ordinary differential equation:
p(t) = (λp(t) + γ)(1− p(t)) (2.9)
12
In the formula, λ and γ are non-negative parameters, not both equal to zero, and respectively corre-
sponding to the instantaneous rate at which a current non-adopter hears about the information from a
previous adopter within and outside the group.
In social influence, users spread information when enough other people in their group have already
been in contact with it. In a standard model, it is assumed that users have different social thresholds,
which determine if they will spread that information or not, as a function of the number of others that
have already spread it. Users are, thus, moved by social pressure, in a way that the aforementioned
thresholds refer to their degree of responsiveness to social influence. Also, the threshold of user i is
the minimum proportion ri ≥ 0, such that i will only spread information if, at least, a proportion ri of the
members of the group already have done the same. If ri > 1, it is implied that, for user i to spread the
information, at least, the whole group had to have spread it as well. Therefore, in this latter case, i never
spreads the information. With F (r) being the cumulative distribution function of thresholds in some given
population, at time t, the proportion of people whose thresholds have been crossed is F (p(t)). Having
λ as the instantaneous rate at which people are converted to spread the information, and assuming that
p(t) have already spread it, the proportion of users whose thresholds have been already crossed, but
have not yet spread information is F (p(t)) − p(t) (Young, 2009). Thus, this model can be expressed as
follows:
p(t) = λ[F (p(t))− p(t)], λ > 0 (2.10)
In a social learning model, users spread information once they have enough empirical evidence to
convince them that the information is worth spreading. Thus, users make rational use of previously
gathered evidence in order to reach a decision (e.g., when a new smartphone is out in the market,
people tend to see how it works for others over some period of time before trying for themselves).
Due to sources of heterogeneity, such as discrepancies in their prior beliefs, the amount of information
they have gathered, or idiosyncratic costs, people may spread information at different times. In this
type of model, which gives us the reason why people would spread information, given that others have
already spread it, the adoption decision flows directly from the rational evaluation of evidence. There
are two types of social learning models, namely (i) social learning models with direct observation, where
the evidence comes from other people’s experiences, i.e., people believe that the information is worth
spreading because other people have done it, and their spreading payoff is fully observable, and (ii)
herding models, where only the spreading act is observable (Young, 2009).
In a social learning model with direct observation, one can assume that (Young, 2009)):
i. Payoffs are observable;
ii. Payoffs generated by different individuals and/or at different points in time are independent and
13
equally informative;
iii. Users are risk-neutral and myopic (i.e., they only see close items);
iv. There is no idiosyncratic component to payoffs due to differences in user’s types, although users
may have different costs (not necessarily observable);
v. There are differences in users’ prior beliefs about how good the information is relative to the status
quo;
vi. There are differences in the average number of people users observe, and hence in the amount of
information they have;
vii. The population is fully mixed.
In this case, the system becomes very simple and the various types of heterogeneity are reduced to
a composite index that measures the probability of a given user spreading, conditional on the amount
of information that has been generated so far, in the population (Young, 2009). Regarding information
diffusion measures, the most commom are (i) speed, which considers when the diffusion instance will
take place and if it will take place or not, (ii) scale, i.e., the number of instances that were affected at a
first degree, and (iii) range, which measures how far the diffusion chain can continue on its depth (Yang
& Counts, 2010).
2.8 Graph Centrality Measures and Bibliographic Indexes
In graph theory, graph centrality measures provide a way of measuring the varying importance of network
vertices, according to specific criteria and the role played by the nodes of a network. In Bibliometrics,
an area concerned with the analysis of patterns in scientific literature, bibliometric indexes are used to
evaluate the quality, impact and relevance of the work of a particular scientist, usually by analyzing the
citation graph. In the context of this MSc thesis, both these areas are particularly important, because
they can provide robust approaches for estimating influence. Some of the most important network
centrality metrics are as follows:
i. Degree Centrality : Degree centrality is a measure of the popularity of a node in a network (New-
man, 2003). It is defined according to the number of edges connected to a particular vertex in the
network, and is mathematically expressed as follows:
CD(v) =dG(v)
n− 1(2.11)
In the formula, dG(v) is the degree of vertex v and n is the total number of vertices in the network.
14
ii. Betweenness Centrality: This measure is based on the number of shortest paths that pass through
a vertex. For instance, the betweenness of a vertex i is the fraction of geodesic paths between
pairs of vertices of the network that happen to be passing through i. In case of more than one
shortest path between a pair of vertices, each path is given an equal weight such that their sum is
equal to one (Newman, 2003). Assuming that g(jk)i is the number of geodesic paths from vertex
j to vertex k that are passing through i, assuming that njk is the total number of geodesic paths
from vertex j to vertex k, and assuming that n is the total number of vertices in the network, the
betweenness of vertex i is computed as follows:
bi =
∑j<k g
(jk)i /njk
(1/2)n(n− 1)(2.12)
With the betweenness measure, the extent to which a node has control over the information that
flows between others can be estimated.
iii. Closeness Centrality: This measure is defined as the average geodesic distance, i.e., the average
shortest path, between a vertex and all the other vertices that are reachable from it. By measuring
a vertex’s closeness, we can measure how long it will take to spread information from this par-
ticular vertex to the other vertices in the network (Freeman, 1978). Closeness Centrality can be
mathematically expressed as follows:
CC(i) =1∑
j∈V \i g(i,j)
(2.13)
In the formula, V represents the total set of vertices of the network and g(i,j) is the distance of the
geodesic path between vertices i and j.
iv. Eigenvector Centrality: This measure weights the contacts according to their centralities, taking
into account the whole pattern of the network and computing the weighted sum of both direct and
indirect connections of every length. Therefore, having the graph G(E, V ), the adjacency matrix
A, λ as the largest eigenvalue of A, and n as the number of vertices, the eigenvector centrality xi
of node i can be expressed as follows (Bonacich, 2007):
λxi =
n∑j=1
aijxj i = 1, ..., n (2.14)
v. Clustering Coefficient: As a measure for transitivity, Watts & Strogatz (1998) introduced the clus-
tering coefficient. This coefficient measures the degree to which neighbours on a network can be
closer to one another, and it can be globally expressed as follows (Kaiser, 2008):
15
C =
∑i∈V Γi∑
dG(i)(dG(i)− 1)(2.15)
In the formula, i is a vertex of graph G that has V as its set of vertices, dG(i) is the degree of i and
Γi is the number of edges between vertex i and its neighbours. The above global definition of the
clustering coefficient is obtained through the computation of a local clustering coefficient which,
for undirected graphs is defined as in Equation 2.16, and for directed graphs is as expressed in
Equation 2.17 :
C(i) =2|ejk|
dG(i)(dG(i)− 1)(2.16)
C(i) =|ejk|
dG(i)(dG(i)− 1)(2.17)
In both formulas, i, j and k are vertices of graph G, dG(i) is the degree of i, and |ejk| represents
the total number of existing edges between the neighbours of vertex i.
vi. Average Path Length: This network topology measure determines the distance between any pair
of vertices, and it can be used to determine if the graph is characteristic of a social network (Reka
& Barabasi, 2002). It is computed as the average length over all shortest paths between pairs of
vertices (Luciano et al., 2006), and it can be mathematically expressed as follows:
〈L〉 =1
n(n− 1)
∑i,k∈V
gik (2.18)
In the formula, V is the set of vertices in the network, gik represents the distance of the geodesic
path between vertices i and k, and the parameter n represents the total number of vertices in the
graph.
To compute these network centrality measures, some readily available open-source libraries can be
used. These include:
i. Gephi1 (Bastian et al., 2009): A Java library for social network analysis and data visualization;
ii. NetworkX 2 (Hagberg et al., 2008): A Python library to create, manipulate and analyze complex
networks;
iii. Network Workbench3: A Java framework for large-scale network analysis and data visualization;
1http://gephi.org/developers/2http://networkx.lanl.gov/3http://nwb.cns.iu.edu/
16
iv. iGraph1: A C library for graph analysis which integrates with the R package2 for data visualization
and statistical computing, which also provides other methods for social network analysis;
v. CiteSpace3 (Chen, 2006): A Java application for visualizing and analyzing trends and patterns in
scientific literature;
vi. NetKit-SL4 (Macskassy & Provost, 2007): A set of Java packages which provide an implementation
of several graph centrality measures;
vii. CytoScape5 (Shannon et al., 2003): A Java software platform for complex network visualization,
which also provides network analysis via plugins.
As for bibliometric indexes, some of the most widely used are as follows:
i. The h-index and its variants : Proposed by Hirsch (2010) to quantitatively represent the output
of a researcher, this index measures the productivity and total impact of a scientist, supporting
comparisons between scientists of different ages (Hirsch, 2010). A researcher has an h-index of h
if h of his/her Np papers (i.e., the total number of published papers) have at least h citations each,
and the other (Np − h) papers have ≤ h citations each.
Several variants of this metric have been proposed, in order to deal with some of the problems of
the original h-index. One such extension is the contemporary h-index (Sidiropoulos et al., 2007),
which takes into account the age of an article and allows us to acknowledge the work of young
promising scientists and of senior scientists, who happen to still be active. The contemporary
h-index score Sc(i) for article i depends on the value of:
Sc(i) = γ ∗ (Y (now)− Y (i) + 1)−δ ∗ |C(i)| (2.19)
In the formula, Y (i) represents the year of publication of the article i and C(i) represents the
number of articles that cite article i. The parameter δ is set to 1, so that Sc(i) is the total number
of citations received by article i, divided by the age of the article. By introducing this parameter,
we have that the score Sc(i) will be too small, so the coefficient γ is set to 4, making the citations
of an article of the current year account as four times more important and, consequently, an article
published 4 years ago will have its citations in account only once. With this approach and, as time
goes by, older articles gradually lose their value.
1http://igraph.sourceforge.net/2http://www.r-project.org/3http://cluster.cis.drexel.edu/˜cchen/citespace/4http://netkit-srl.sourceforge.net/5http://www.cytoscape.org/
17
In brief, a researcher has a contemporary h-index of hc if hc of his/her Np articles have a score of
Sc(i) ≥ hc each, and the remaining (Np − hc) articles each have a score of Sc(i) ≤ hc.
Another variant is the trend h-index, addressing the fact that the h-index does not take into account
the age of a citation (Sidiropoulos et al., 2007). Articles that continue to be cited along the years
indicate that the topic/solution is still up to date and that the respective scientist can be an influential
mind, who still has an impact on younger scientists. As an article is continually cited, we can also
be in the presence of a trend-setter, i.e., a scientist whose work is, in some way, pioneering and/or
is currently working on something that is considered as trendy. Hence, the trend h-index, with γ, δ
, Y (i) and S(i) as defined in Equation 2.19, can be expressed with basis in the value of:
St(i) = γ ∗∑∀x∈C(i)
(Y (now)− Y (x) + 1)−δ (2.20)
In brief, a researcher has a trend h-index of ht if ht of his/her Np articles have a score of St(i) ≥ ht
each, and the remaining (Np − ht) articles each have a score of St(i) ≤ ht.
There is also the normalized h-index, which mitigates the fact that scientists from different research
areas do not publish the same number of articles, providing a fairer h-index metric (Sidiropoulos
et al., 2007). A researcher has a normalized h-index of hn = h/Np if h of its Np articles have
received at least h citations each, and the remaining (Np − h) articles have received no more than
h citations.
Recent work developed by Devezas et al. (2011) applied the h-index to the task of ranking web
blogs. Analogously to Bibliometrics, blogs can be seen as the authors and the posts as the papers
published by them. Therefore, a blog has a index h if h of its N posts have at least h inlinks each
and the remaining (N − h) posts have no more than h in-links each. The h-index turned out to be
a more balanced metric, comparing to the use of the indegree, to assess the importance of a blog.
ii. The g-index: This index is an improvement over the h-index, measuring the global citation perfor-
mance of a list of articles (Egghe, 2006). A set of papers has a g-index of g if g is the highest
unique rank such that the top g papers have, together, at least g2 citations. This is valid if the list
of articles is sorted in decreasing order of the number of citations received by each article, and
the top g + 1 papers have less than g2 citations. Thus, with α > 2 denoting the Lotkaian exponent
(Lotka, 1926) and with T denoting the total number of sources, i.e., articles, the g-index can be
mathematically expressed as follows:
g =
(α− 1
α− 2
)α−1α
T1α , with α = 1 +
ln(growth rate of sources)
ln(growth rate of items)(2.21)
18
In the formula, the sources are the scientific articles and the items are the citations between those
articles (Egghe, 2009).
iii. The a-index: The a-index is a derived index, dependent on the h-index, and is a constant ranging
between 3 and 5 that helps us to better understand the relation between the total number of
citations of an article (Nc,tot) and the h-index (Zhang, 2009). The a-index allows us to describe the
magnitude of the hit contributions of individual scientists and is defined as follows (Sidiropoulos
et al., 2007):
Nc,tot = ah2 (2.22)
This index can be used as a secondary metric to rank and evaluate scientists, due to the fact that
h2 underestimates the Nc,tot of the h most cited papers, which is usually greater than h2, and
disregards the papers that have less than h citations (Hirsch, 2010).
iv. The e-index: This metric was proposed by Zhang (2009) to address two specific drawbacks from
the original h-index, namely:
• Loss of citation information - excess citations are ignored, making the comparisons based
only on the h-index misleading;
• Low resolution - the h-index is composed of natural numbers, instead of real numbers, hence,
confining a relatively narrow range to the results.
The e-index can formally be defined as follows:
e2 =
h∑j=1
citj − h2 (2.23)
In the formula, citj is the number of received citations by the jth paper and the e2 value is ex-
pressed as a real number. This index is also related to the aforementioned a-index in the following
way:
a = h+e2
h(2.24)
v. The ISI Impact Factor: This index measures the popularity of a journal referring to a specific year.
It is defined as the mean number of citations that have occurred in the specified year, to articles
that were published in the journal during the prior two years (Bollen et al., 2006).
IF (vi, t) =∑j
c(vj , vi, t)
n(vi)(2.25)
19
In the formula, c(vj , vi, t) is the number of citations from journal vj to journal vi in year t, and n(vi)
corresponds to the number of publications in journal vi during the two years prior to t, which ends
up normalizing the resulting citation count of an article in a mean 2-year citation rate (Bollen et al.,
2006).
vi. The Y-Factor: Due to the fact that there can be some discrepancies between the values of the
ISI Impact Factor and the Weighted PageRank, introduced in Section 3.2.1, (i.e., a journal may
have a high ISI Impact Factor, but a low Weighted PageRank value) this measure results from the
multiplication of both these values. The Y-factor of journal vj can be mathematically expressed as
follows:
Y (vj) = ISI IF × PRw(vj) (2.26)
When assessing the authority of an individual, it is important to use these measures not only individually,
but also in combination, since scientific impact can be seen as a multi-dimensional construct (Bollen
et al., 2009).
2.9 Unsupervised Rank Aggregation Approaches
Given that each metric introduced in the previous section produces an ordering for the nodes in a graph,
we can leverage on the rank aggregation methods from social choice theory (i.e., voting protocols) to
combine the individual rankings.
In the realm of voting protocols, we consider that there are voters who submit votes over their favorite
alternatives, i.e., the candidates. Determining the winner, or the best ordering of candidates, requires
the aggregation of the rankings of all voters. This process depends on the voting rule that is used, and
it can be defined as follows: Let C be the set of candidates, R(C) the set of all possible rankings of the
candidates, and n the number of voters. A voting rule is a mapping from R(C)n to C, if one wishes to
produce a winner, and from R(C)n to R(C), if one wishes to produce an aggregate ranking (Conitzer,
2006a). The most common voting rules are as follows:
i. Scoring Rules - Borda Rule, Plurality Rule and Veto Rule: Let ~α = 〈α1, ..., αm〉 be a vector of
integers. For each voter, α1 is the number of points that a candidate gets if the voter ranks him
first, α2 the number of points that candidate gets if the voter ranks him second, and so on.
With the Plurality Rule, candidates are ranked simply in terms of how often voters have ranked
them in the first place, thus having a system of scores corresponding to ~α = 〈1, 0, ..., 0〉. With this
rule, it is irrelevant how voters rank the candidates that are below the top candidate.
20
The Veto Rule is the opposite of the Plurality Rule, because it is based on a system of scores with
~α = 〈1, 1, ..., 1, 0〉, i.e., it only takes into account how often the candidate is not ranked in last place.
As such, each voter vetoes a single candidate and the least vetoed candidate wins the election
(Procaccia et al., 2006).
The Borda Rule is based on a system of scores with ~α = 〈m− 1,m− 2, ..., 0〉, which means that a
candidate obtains m− 1 points for the first position in the preference of a voter, m− 2 points for the
second position, and so forth, with m representing the total number of candidates. The candidate
who sums the maximum number of points from all voters is the winner (Kiselev, 2008).
ii. Single Transferable Vote (STV): This is a method to calculate the result of an election with the
guarantee of proportional representation, under reasonable conditions, for the sets of voters who
share a set of most preferred candidates (Geller, 2002). Running through m − 1 rounds, the STV
voting rule is based upon three principles:
• Order of preference - the candidates are listed in ordinal preference by the voters (i.e., in
descending order).
• Quota - the number of votes needed for a candidate to win the election must be calculated in
the following way:
q =
⌊|V |e+ 1
⌋+ 1 (2.27)
In the formula, |V | represents the total number of voters and e the number of seats available
in the election (i.e., the number of candidates to elect). In each round, if a candidate gets a
greater number of votes than the quota, that candidate is automatically elected.
• Transfer - When a candidate c is elected and there are still more seats to be filled, the surplus
of the votes from that newly elected candidate must be redistributed to each voters’ next
ranked candidate. The transfer value fc takes into account the quota q and the number of
votes wc that candidate c has. It is computed as follows:
fc =wc − qwc
(2.28)
When, in each round, the top voted candidate does not have enough votes to be elected (i.e.,
the total number of votes is less than the quota), the last placed candidate is eliminated, and
those candidate’s votes are redistributed to the next highest ranked candidate, for each voter
for whom the recently eliminated candidate was the top preference.
The flowchart in Figure 2.3 depicts the steps taken in each round, in order to conclude the election.
It is based on the additional information provided by an online simulation of the Single Transferable
21
Vote1 system.
Figure 2.3: Flowchart for the Single Transferable Vote rule.
iii. Plurality Rule with Run-Off : This rule proceeds in two rounds. In the first, all candidates are
eliminated, except the ones with the highest plurality scores, i.e., the candidates with the first
and second highest number of votes in the election. Then, as in the STV Rule, all the votes are
transfered to these selected candidates. The second round, which is called the run-off, is used
to define the final winner of the election, from the two remaining candidates. All candidates are
ranked according to their Plurality scores, except the top two whose relative ranking is determined
according to the results of the second round.
iv. Maximin: Letting N(c1, c2) be the number of votes that show the preference of candidate c1 over
candidate c2, the maximin score (also known as the Simpson Score) assigned to a candidate c1 is
as follows:
s(c1) = minc1 6=c2N(c2, c1) (2.29)
In the formula, s(c1) is the worst score of candidate c1 in a pairwise election. As all candidates are
ranked by their scores, the winner of the election is the candidate with the highest maximin score.
v. Copeland : For any two candidates c1 and c2 we simulate a pairwise election, so we can determine
how many voters prefer c1 over c2 and how many prefer c2 over c1 (Xia et al., 2011). All candidates
are ranked by their score, and they gain or lose a Copeland point for, respectively, each election
they win or lose (Conitzer & Sandholm, 2005). If there is a tie, Copeland points are also assigned
to the candidates. Therefore, for a pairwise election between candidates c1 and c2, a score is
assigned according to the following procedure:1http://stv.humancube.com/
22
C(c1, c2) =
1, N(c1,c2) > N(c2,c1)
12 , N(c1,c2) = N(c2,c1)
0, N(c1,c2) < N(c2,c1)
(2.30)
Then, the Copeland Score of candidate c1 is given by:
s(c1) =∑c2 6=c1
C(c1, c2) (2.31)
The candidate who has the highest score wins the election.
vi. Bucklin: The Bucklin Score of a candidate c is the smallest number lc such that more than half of
the voters rank c among the top positions, i.e., the Bucklin score B(lc) >n2 (Xia et al., 2011). The
winner is the candidate with the lower Bucklin Score. All candidates are ranked in inverse order by
lc and if there is a tie, B(lc) is used as a tie-breaker.
vii. Slater : In the Slater voting rule, we choose a ranking of candidates that is inconsistent with the out-
comes of as few pairwise elections as possible (Conitzer, 2006b). An inconsistency corresponds
to the case in which, for each pair of candidates c1 and c2, c1 is ranked higher than c2 and c2
defeats c1 in their pairwise election. Therefore, the intent of the Slater ranking is to minimize such
inconsistencies.
viii. Kemeny : Similarly to the Slater Rule, a ranking is a Kemeny ranking if it minimizes the number
of inconsistencies. However, this rule produces a ranking that aims at minimizing the number of
times that the ranking is inconsistent with a vote on the ranking of two candidates. Therefore, an
inconsistency in the terminology of the Kemeny ranking is defined as follows: Given the ranking r
of two candidates, given each combination of candidates (c1, c2), and given a voter ra, we have an
inconsistency if ra ranks c1 higher than c2, but ri ranks c2 higher than c1.
ix. Cup and its variants: Cup Rule runs a single-elimination contest to decide which candidate wins
the election. It does not produce a full aggregate ranking of the candidates, and it requires an
additional schedule for matching up the remaining candidates. The rule is defined by a balanced
binary tree T , where each candidate is assigned to a leaf through the aforementioned schedule. To
each of the remaining non-leaf nodes is assigned the winner of the pairwise election of that node’s
children. There is a winner whenever a candidate is assigned to the root node.
As for Cup Rule’s variations, we have the regular cup, which assumes that all voters know to
which leaf a candidate is assigned to, prior to their voting, and the randomized cup, in which the
assignment of candidates to leaves is uniformly chosen at random, after the voting. Votes can
be weighted and thus there can also be a different interpretation to the weight, such that it can
represent the decision power of a voting agent in a setting where not all agents are considered
23
equal, e.g., a weight of K counting as K votes of weight 1.
2.10 Supervised Learning for Rank Aggregation
In the previous section we presented unsupervised techniques to perform rank aggregation. Neverthe-
less, one can also use supervised learning techniques to address this task. In order to do that, Learning
to Rank (L2R) has emerged as a way of using machine learning techniques for rank aggregation (Li,
2011).
In L2R, there are two general phases, namely, learning and ranking. The learning phase takes training
data as input, which corresponds to ranked lists of objects, with each object being described by a set of
features (i.e., a set of simple ranking measures that we want to combine). Given a new set of objects, one
aims at predicting the best possible ranking, combining the available information. Figure 2.4 illustrates
the general framework.
Figure 2.4: Learning-To-Rank (L2R) Framework (adapted from Liu (2009)).
Learning-to-Rank methods can be categorized according to three different types of approaches, namely,
pointwise, pairwise, and listwise (Li, 2011; Liu, 2009).
In the pointwise approach, the ranking problem is transformed into a classification, regression or an
ordinal classification problem. The input space has each object’s feature vector, while the output space
contains the ranking order predicted to each object (Liu, 2009). The loss function is said to be pointwise
because it is defined on a single object’s feature vector (Li, 2011) and inspects the ground truth ranking
order for each single object. The hypothesis space on a pointwise approach contains the functions that
take the feature vector of an object as input and predict the ranking order of that same object (Li, 2011).
In the pairwise approach, the ranking problem is transformed into a pairwise classification problem, i.e.,
24
one classifies a given pair of objects as if the pair were in a correct ranking order or not. In this approach,
the loss function is pairwise, due to it being defined on a pair of feature vectors.
The listwise approach takes ranked lists of objects as instances and, unlike the aforementioned ap-
proaches, it maintains the group structure of the ranked lists. This approach also learns a ranking model
from the given training data, which can later assign scores to feature vectors, and then ranks these
feature vectors using those scores.
One particular supervised listwise ranking method is CRanking (Lebanon & Lafferty, 2002) which applies
the following probabilistic model:
P (π|θ,Σ) =1
Z(θ,Σ)exp(
k∑j=1
θj · d(π, σj)) (2.32)
In the formula, π is the final ranking, Σ = (σ1, ..., σk) are the basic rankings being combined, d is the
distance between the two rankings (e.g., Kendall’s τ ) and θ is a weighting parameter. Z is a normalization
factor over all the possible rankings, and can be defined as follows:
Z(θ,Σ) =∑π
exp(
k∑j=1
θj · d(π, σj)) (2.33)
When learning, the algorithm is given S = {(Σi, πi)}mi=1 as training data, in order to build a model for rank
aggregation. Maximum Likelihood Estimation is used to learn the model’s parameters. Considering that
both the final ranking and the basic rankings are all full ranking lists in the training data, the likelihood
function can be computed as follows:
L(θ) =
m∑i=1
logexp(
∑kj=1 θj · d(πi, σi,j))∑
πi∈∏ exp
∑kj=1 θj · d(πi, σi,j)
(2.34)
For the final step of prediction, the algorithm is given the learned model and the basic rankings Σ. Then,
the probability distribution π : P (π|θ,Σ) of the final ranking is calculated, in order to be later used when
calculating the expected rank for each object. Objects are finally sorted according to their expected rank,
the latter being defined as follows:
E(πi|θ,Σ) =
n∑r=1
r · P (πi = r|θ,Σ) =
n∑r=1
r ·∑
π∈∏,π(i)=r
P (π|θ,Σ) (2.35)
25
2.11 Summary
In this chapter, the fundamental concepts regarding the tasks of characterizing a network and finding
the network’s most influential nodes were introduced. Broader concepts such as prestige, popularity
or recognition were also explored, distinguishing them from what it is to be an influencer. Other related
network analysis topics were introduced, namely information cascades and information diffusion models,
since the most influential nodes in a network have the capacity to disseminate information through the
network at a much faster pace, reaching a greater number of other nodes. Learning-To-Rank and rank
aggregation techniques were also introduced as ways of combining different ranking lists to produce a
single, global and uniform ranking list.
26
Chapter 3
Related Work
This chapter presents the most important related work in the context of my MSc thesis. The chap-
ter starts by presenting the HITS algorithm and Google’s PageRank algorithm for ranking web
pages, discussing how the latter evolved from its original implementation to more detailed and specific
approaches, such as the Weighted PageRank algorithm and the Topic-Sensitive PageRank algorithm.
Then, the section introduces the IP Algorithm, a recent development that extends the benefits of PageR-
ank and determines the influence and passivity of network nodes based on their capacity to forward
information. In the specific realm of Twitter we present TwitterRank, an approach to measure the influ-
ence of a Twitter user, based on the principle of homophily regarding the topics that users write about.
Finally, we take a deeper look at the work that has been done in Bibliometrics, in order to find influencers
in citation and co-authorship networks, and also describing works that take into account the temporal
evolution of graphs.
3.1 The Hyperlinked Induced Topic Search (HITS) Algorithm
The HITS algorithm, a Web page ranking method developed by Kleinberg (Kleinberg, 1998), is based
on the notion of authorities and hubs. The authorities, i.e., pages that have a greater amount of inlinks,
have a mutually reinforcing relationship with hubs, i.e., the pages that have outlinks to many related
authorities, in a way that a good hub is a page that points to many good authorities, and a good authority
is a page that is pointed by many good hubs – see Figure 3.5. This relationship is put into use through
the iterative procedure shown in Algorithm 1, which maintains and updates the weights of each page
(Kleinberg, 1998).
27
Figure 3.5: A graph with hubs and authorities (adapted from Kleinberg (1998)).
Algorithm 1 The Hyperlinked Induced Topic Search (HITS) Algorithm
G: A graph with n interlinked pagesk: A constant corresponding to the number of iterationsz: The vector (1,1,1,...,1) ∈ RnSet x0 := zSet y0 := zfor i = 1, 2, ..., k do
Apply xp =∑q,q→p yq to (xi−1, yi−1), obtaining new x-weights x
′
i
Apply yp =∑q,p→q xq to (x
′
i, yi−1), obtaining new y-weights y′
i
Normalize x′
i, obtaining new authority scores xiNormalize y
′
i, obtaining new hub scores yiend for
In order to compute the HITS algorithm, the aforementioned Gephi1, NetworkX 2 and Network Work-
bench3 software packages can be used.
3.2 The PageRank algorithm and its Variants
The PageRank algorithm arose in the context of the development of Google’s search engine, at the
time described as a prototype of a large-scale search engine that made heavy use of the hyperlinked
structure of the web (Brin & Page, 1998).
PageRank is based on principles from academic citation analysis, applied to the web. It can be mathe-
matically expressed as follows:
PR(A) =(1− d)
N+ d
∑i
PR(Ti)
C(Ti)(3.36)
A page A has T1, ..., Tn pages that point to it (i.e., that cite page A) and, C(T1), ..., C(Tn) is the number
of outlinks from page A to pages T1, ..., Tn. The term N corresponds to the total number of pages in
1http://gephi.org/developers/2http://networkx.lanl.gov/3http://nwb.cns.iu.edu/
28
the network. The free parameter d is called the damping factor and controls the performance of the
algorithm, being usually set to 0.85. In a random web surfer scenario, the surfer can restart his search
with probability 1 − d by jumping to another page that is randomly and uniformly chosen, instead of
following a random link, which can be done with probability d (Chen et al., 2007). Figure 3.6 depicts the
computation of the PageRank score for a three-node network.
Figure 3.6: A graph illustrating the computation of PageRank (adapted from Page et al. (1998)).
From Figure 3.6, one can acknowledge that page A has an inlink from page C and two outlinks to pages
B and C. Therefore, page A is going to split its PageRank score of 0.4 for its two outlinks, equally
transfering a value of 0.2 to pages B and C. In its turn, page B has a PageRank score of 0.2 that A
transfered to it. Because B only has an outlink to C, this page entirely transfers its PageRank score to
page C. Finally, page C, that receives PageRank scores of 0.2 from A and B, accumulates a PageRank
score of 0.4, which is entirely transfered to its only outlink, page A.
A page can achieve a high PageRank score if it has many other pages pointing to it, i.e., if it is highly
cited, or if some of the pages that point to it have themselves a high PageRank score.
Even though PageRank works over networks originally corresponding to directed graphs, the works
of Perra & Fortunato (2008) and of Mihalcea (2004) revealed that PageRank can also be applied to
undirected graphs, hence having vertices with equal indegrees and outdegrees.
In the realm of Bibliometrics, PageRank is used as a complementary method to citation analysis, due
to the fact it mitigates citation count’s drawback of not taking into account the importance of a paper.
PageRank allows us to identify publications that are being referenced by highly cited articles (Ding
et al., 2009).
Authors such as Chen et al. (2007) suggested to set d = 0.5, due to the hypothesis that, in the context of
citation networks, the entries in a reference list of a typical paper are collected following, approximately,
an average length of 2. Chen et al.’s justification is based on the empirical observation that about 50%
of the articles that are in the references list of a paper A have at least one citation following the pattern
B → C, in which the article C is part of A’s reference list. Thus, the author assumes that there is a
feed-forward loop among A, B and C, such that A→ B, B → C and, consequently, A→ C.
Due to its probabilistic nature, and also to the fact that each node is guaranteed to be visited, PageRank
29
scores are not comparable across different graphs. To mitigate this, Berberich et al. proposed a normal-
ization of the PageRank scores, which eliminates any dependency on the size of the graph (Berberich
et al., 2006). The normalized PageRank score can be computed as follows:
PR(v) =PR(v)
1|V |(d+ (1− d)
∑d∈D PR(d)
) (3.37)
In the formula, the denominator represents the lower-bound for Equation 3.36, while |V | is the total
number of vertices in the graph and D ⊆ V is the set of dangling nodes.
Alternatively to the random surfer model, and specifically for social phenomena such as epidemics or
word-of-mouth recommendation, Ghosh et al. (2011) proposed a broadcast-based non-conservative
diffusion model, due to the fact that this phenomena can be modeled as contact processes, in which
an active (infected) node will activate its neighbours, via broadcast, with some probability. The differ-
ence between this model and the random surfer model is that, while the latter conserves the amount
of substance that is being diffused on the network, the former is non-conservative in a way that the
information changes while it spreads from an individual to his neighbours. Ghosh et al. (2011) state that
PageRank is a steady state solution of conservative diffusion and, therefore, a conservative metric, while
Alpha-Centrality, a non-conservative metric, which measures the total number of paths from a node ex-
ponentially attenuated by their length, is a steady state solution of linear non-conservative diffusion. In
their study, the authors propose an efficient algorithm for computing the Alpha-Centrality.
To compute the PageRank algorithm, we can use some readily available open-source software li-
braries, such as the aforementioned Gephi, NetworkX and Network Workbench packages, or the LAW-
Webgraph1 Java library for large-scale web graph analysis (Boldi & Vigna, 2004).
3.2.1 Weighted PageRank
In the original PageRank algorithm from Equation 3.36, we have no notion of hyperlink weight, and thus
all hyperlinks express the same degree of relationship between the pages they link (Bollen et al., 2006).
However, in many practical applications, we have that not all links express the same type of relationship.
Acknowledging that some links in a web page may be more important than others, Xing & Ghorbani
(2004) proposed a Weighted PageRank algorithm that assigns higher scores to more important links,
instead of the traditional even division among the outlinks of a page. Each link is assigned with a value
that is proportional to the popularity of the destination node, i.e., proportional to its number of inlinks and
outlinks.
In this approach, there is an inlink weight W in(v,u) and an outlink weight W out
(v,u). The inlink weight of link
1http://webgraph.dsi.unimi.it/
30
(v, u) is based on the number of inlinks of page u and the number of inlinks from all the pages that are
referenced by page v. The outlink weight is analogous. They are calculated as follows:
W in(v,u) =
Iu∑p∈R(v) Ip
W out(v,u) =
Ou∑p∈R(v)Op
(3.38)
In the formulas, Iu and Ip represent, respectively, the number of inlinks of pages u and p, while Ou
and Op, represent the number of outlinks of pages u and p. R(v) is the set of outlinks from page v.
Considering the introduction of these two weights in the computation of a Weighted PageRank algorithm,
the latter can be mathematically expressed as follows:
PR(u) = (1− d) + d∑
v∈B(u)
PR(v)W in(v,u)W
out(v,u) (3.39)
The studies conducted within the work of Xing and Ghorbani revealed that their Weighted PageRank
algorithm has a better performance than the original PageRank.
Fiala et al. (2008) also proposed modifications to the original PageRank algorithm, ensuring its applica-
tion in bibliographic networks. The authors take into account the citation and co-authorship information,
in the way that each edge (u, v) ∈ E, withE corresponding to the set of edges between the vertices of the
graph where nodes correspond to the authors of the papers, is associated with weights wu,v, cu,v, bu,v.
The value wu,v is the number of citations from author u to author v, the value cu,v is the number of
common publications by u and v and bu,v can assume different values, depending on the semantics of
edge weights that we want to stress. The new ranking for authors is defined as follows:
R(u) =1− d|A|
+ d∑
(v,u)∈E
R(v)
wv,ucv,u+1
bv,u+1
∑(v,j)∈E wv,j∑
(v,k)∈Ewv,k
cv,k+1
bv,k+1
∑(v,j)∈E wv,j
(3.40)
In the formula, |A| is the set of vertices (e.g., the set of authors of the papers) and d is a damping factor,
empirically set to d = 0.9.
In this approach, a Weighted PageRank algorithm is reached if, according to Equation 3.40, the coeffi-
cients b and c equal to zero.
Bollen et al. (2006), when applying the Weighted PageRank algorithm to journal citation networks, took
into account journal citation frequencies in the transfer of PageRank values, so that the prestige of a jour-
nal can be accordingly transfered along the iterations of the algorithm. They referred to this transfered
value as the Propagation Proportion between journals and defined it as follows:
w(vj , vi) =W (vj , vi)∑kW (vj , vk)
(3.41)
31
In the formula, W (vj , vi) is the weight of the link between journals vj and vi, which are then normalized
by the weights of journal vj ’s outlinks. In the application of the Weighted PageRank algorithm described
by Bollen et al. (2006)., the number of outlinks C(Ti) from Equation 3.36 has been replaced with the
Propagation Proportion, resulting in the following equation:
PRw =(1− d)
N+ d
∑PRw(vj)× w(vj , vi) (3.42)
On the other hand, within the work of Yan & Ding (2011), citation counts are incorporated with the
network topology, resulting in the following integrated Weighted PageRank algorithm:
PRw = (1− d)CC(p)∑Nj=1 CC(pj)
+ dPRw(pi)∑ki=1 C(pi)
(3.43)
In the formula, CC(p) represents the number of citations pointing to an author p,∑Nj=1 CC(pj) is the sum
of the citation counts for all the nodes in the network, and (1− d), as in previous PageRank definitions,
ensures that results sum up to one. Yan & Ding (2011) pointed out two extreme scenarios regarding the
variation of d. If d = 0, then each node would have its relative citation score equal to∑Nj=1 CC(pj), which
equals to the normalized citation counts. Also, and in accordance with Boldi et al. (2005), when d→ 1−
PageRank becomes unstable and its convergence rate slows.
3.2.2 Topic-Sensitive PageRank
The link-structure of the Web is used in the original PageRank algorithm to pre-compute topic-independent
scores that reflect the importance of web pages. The pre-computed importance scores can afterwards
be combined with other Information Retrieval scores, e.g., term frequency, to produce a ranking of the
pages towards specific user queries (Brin & Page, 1998).
Haveliwala (2002) proposed a Topic-Sensitive PageRank algorithm, where one computes offline a set
of PageRank vectors, which are biased towards a set of representative basis topics from the Open
Directory Project1. For each page, and regarding the considered set of topics, a set of importance
scores is created and, at query-time, the similarity of the query and/or user context is calculated. To
achieve the final ranking, one linearly combines the topic-sensitive vectors, which are weighted with the
similarity of the query towards the topics.
The mathematical approach to this Topic-Sensitive PageRank is as follows. Considering q the query
and q′ its respective context in the page u, we may have a search in context (i.e., the user is viewing
a document and selects a term from it, in order to get more information about the selected term). The
context q′ consists of all terms in u if we have a search in context, and otherwise q′ consists only in the1http://www.dmoz.org/
32
query q. For each topic cj , the following quantity is computed:
P (cj |q′) =P (cj) · P (q′|cj)
P (q′)∝ P (cj) ·
∏i
P (q′|cj) (3.44)
In the formula, P (q′|cj) can be computed from the class term-vector Dj , which consists in the terms of
the documents below each of the 16 top-level categories of the Open Directory Project (ODP). Finally, a
composite topic-independent importance score sqd is computed as follows:
sqd =∑j
P (cj |q′) · rjd (3.45)
In the formula, rjd is the rank of document d, given the PageRank vector PR(α,vj), for topic cj . In its
turn, PR(α,vj) has as parameters a bias factor α and the non-uniform damping vector vj , with Tj being
the set of URLs in the ODP category cj :
vji =
1|Tj | , i ∈ Tj0, i /∈ Tj
(3.46)
The bias factor, similarly to PageRank ’s damping factor, can influence the biasing degree of the resulting
vector towards the topic vector that was used. This bias was heuristically set to α = 0.25 by the authors.
3.2.3 TwitterRank
In the context of Twitter, the popular microblogging service, there is often the need to determine which
are the influential users.
From the work of Weng et al. (2010) arose TwitterRank, an extension of the PageRank algorithm that
takes both the topic similarity between users and the link structure of the social network into account.
However, the influence of a user may vary in different topics, since a Twitter user can have interests or
expertise in many distinct areas.
In the same way that in Bibliometrics we have that citation count is the simplest method to assess the
influence of an author in an author-publication network, we have that, on Twitter, the follower count, i.e.,
the total number of people who are following a particular user, has been interpreted as a good indicator
of influence. Nevertheless, Weng et al. (2010) observed that 72.4% of the users follow more than 80%
of their followers, and that 80.5% of the users have 80% of their friends (i.e, twitterers whose updates
are being followed) following them back. This can be contradictory, because either the act of following
is so casual that a twitterer randomly follows other twitterers and they, politely, just follow them back, or
this following relationship can reflect the existence of a strong similarity among users, due to the interest
33
in the topics the twitterers tweet about. The latter denotes the homophily phenomena.
The general framework proposed for TwitterRank is depicted in Figure 3.7. First, in the topic distillation
phase, the topics twitterers are interested in are extracted with basis on what they tweet about. Then,
a topic-specific relationship network is built, based on the previously gathered topics. Finally, the Twit-
terRank algorithm is applied to measure the topic-sensitive influence of a twitterer, taking into account
both the topics that were distilled and the structure of the topic-specific relationship network. A process
of identifying top-topics is done in the order of the probabilities of topic presence, as it is captured in
matrix WT , of W unique words in tweets and T topics. For each entry Wit we have the number of times
the unique word wi has been assigned to topic T .
Figure 3.7: The general TwitterRank framework (adapted from Weng et al. (2010)).
This approach addresses two important shortcomings of PageRank, namely the fact that it does not
take into account (i) the interests of the nodes of the network, and (ii) the indegree associated with the
follower count in Twitter.
To mathematically describe the topic-specific TwitterRank algorithm, we can see the Twitter network
as a directed graph D(V,E), where the vertices V are the twitterers and the edges E are the following
connections between two twitterers. These connections are directed from follower to friend. In a random
surfer scenario, the surfer visits each twitterer with a certain topic-specific probability, by following the
appropriate edge in D. A transition matrix for topic t from follower si to friend sj , Pt, is defined as follows,
where |τj | is the number of tweets published by sj and∑a: si follows sa
|τa| is summing up the number of
tweets published by all of si’s friends.
Pt(i, j) =|τj |∑
a: si follows sa|τa|∗ simt(i, j) (3.47)
The similarity between si and sj in topic t, denoted by simt(i, j) is defined as follows:
simt(i, j) = 1− |DT ′it −DT ′jt| (3.48)
In the formula, DT ′ is the row-normalization of matrix DT , with D being the twitterers and T the topics.
In DT ′, each row is the probability distribution of twitterer si’s interest over the T topics. Thus, the
similarity between si and sj in topic t can be assessed as the difference between the probability that the
two are both interested in topic t. The higher their similarity, the higher the transition probability from si
to sj .
34
There is also the possibility of having some twitterers following one another in such a cyclic way that
they do not follow anyone outside that particular circle of following relations, which can end up in an
accumulation of high influence that is not distributed. To account with this situation, Weng et al. (2010)
introduced a teleportation vector Et that captures the probability that a random surfer would jump to
some twitterer instead of following the edges of graph D. The teleportation vector is defined as follows:
Et = DT ′′.t (3.49)
In the formula, DT ′′.t is the t-th column of DT ′, the column-normalized form of matrix DT , the latter being
part of the results from the topic distillation phase. In each entry, DT contains the number of times words
in a twitterer’s tweets have been assigned to a specific topic.
Thus, the topic-specific TwitterRank can be calculated as follows:
−−→TRt = γPt ×
−−→TRt + (1− γ)Et (3.50)
In the formula, γ is a parameter that directly controls the probability of teleportation, analogous to PageR-
anks’s damping factor, and has a value that can range from 0 to 1, usually set to γ = 0.85.
The formula from Equation 3.50 gives the representation of the topic-specific TwtiterRank vectors that
are generated. However, these vectors only refer to the twitterer’s influence in individual topics. To
measure the overall influence of a twitterer in different topics, we need to compute the aggregated
TwtiterRank vector as follows:
−→TR =
∑t
rt ·−−→TRt (3.51)
In the formula,−−→TRt is the TwitterRank vector for a topic t, and rt is the weight assigned to topic t and
associated with−−→TRt.
Weng et al. (2010) observed that the most active twitterers are not necessarily the most influential in
each topic. Also, and due to the consideration of the topical dimension, there is a higher correlation be-
tween TwitterRank and the Topic-Sensitive PageRank (Section 3.2.2) than with the indegree or with the
original PageRank algorithm. The experiments conducted by Weng et al. (2010), which used a Twitter
dataset with messages from Singapore-based twitterers, collected in April 2009, showed that Twitter-
Rank outperforms other related algorithms, including both PageRank and the algorithm that Twitter was
using by the time of their study.
35
3.3 The Influence-Passivity (IP) Algorithm
Romero et al. (2011) came to the conclusion that, if a user is to be considered influential, then he does
not only have to be popular and get attention from his peers, but he has also to overcome passivity,
a state in which a user receives information but does not propagate it through the network. Thus, this
approach determines the influence and also the passivity of a user, based on his information forwarding
activity.
The algorithm proposed by Romero et al. (2011) is similar to HITS and to PageRank. However, the dif-
ference in this approach is that the diffusion behaviour among the users is also taken into consideration.
This work was conducted on Twitter and assigns to every user both a passivity score and an influence
score, which respectively correspond to the authority and hub scores in the HITS algorithm. The use
of passivity in the algorithm comes from the evidence that users in Twitter are generally passive and
thus, when determining the influence of a user, taking into account the passivity of all the people that
are influenced by him is also very important. The following assumptions are considered by the authors:
1. The influence score of a user depends on the number of people he influences, as well as on their
passivity.
2. The influence score of a user depends on how dedicated the people that he influences are. This
dedication is measured by the amount of attention a user pays to some other user, as compared
to everyone else.
3. The passivity score of a user depends on the influence of those who he is exposed to, but not
influenced by.
4. The passivity score of a user depends on how much he rejects some other user’s influence, com-
pared to everyone else’s influence.
Given these assumptions, one should note that the network graph for this algorithm is a weighted graph
G = (N,E,W ) with N nodes, E edges and W edge weights, where weight wij represents the ratio of
influence that node i has over node j to the total influence that i attempted to have over j. The output
of the IP Algorithm is a function I : N → [0, 1] and a function P : N → [0, 1], which represent each
node’s relative influence and passivity, respectively. For each edge e = (i, j) ∈ E, the authors defined
an acceptance rate that represents the amount of influence accepted by j from all users in the network
and that, thus, can reflect the loyalty user j has to user i. The acceptance rate is defined as follows:
uij =wij∑
k:(k,j)∈E wkj(3.52)
There is also a rejection rate, which is the opposite of the acceptance rate, because 1−wji is the amount
36
of influence user i rejects from user j. Thus, the rejection rate vji is the influence that user i rejected
from user j, normalized by the total influence rejected from j by all other users in the network. The
rejection rate vji is mathematically expressed as follows:
vji =1− wji∑
k:(j,k)∈E(1− wjk)(3.53)
The IP Algorithm is thus based on two operations that relate directly to the aforementioned assumptions.
The operation Ii is related to a user’s influence and is as follows:
Ii ←∑
j:(i,j)∈E
uijPj (3.54)
In the formula, the term Pj corresponds to the passivity referred in Assumption 1, and the term uij to
the amount of dedication referred to in Assumption 2. As for operation Pi, it relates to a user’s passivity
and is as follows:
Pi ←∑
j:(j,i)∈E
vjiIj (3.55)
In the formula, the term Ij corresponds to the influence referred in Assumption 3, and vji to the rejection
rate referred in Assumption 4.
The algorithm takes as input a weighted graph and computes the IP scores for each node in m iterations,
as depicted in the pseudo-code of Algorithm 2.
Algorithm 2 The Influence-Passivity (IP) Algorithm.
G(N,E,W ): An influence graph with N nodes, E edges and W edge weightI0 ← (1, 1, ..., 1) ∈ R|N |
P0 ← (1, 1, ..., 1) ∈ R|N |
for i = 1→ m doUpdate Pi using operation Pi ←
∑j:(j,i)∈E vjiIj and the values Ii−1
Update Ii using operation Ii ←∑j:(i,j)∈E uijPj and the values Pi
for j = 1→ |N | doIj =
Ij∑k∈N Ik
Pj =Pj∑k∈N Pk
end forend for
The authors also concluded that there is a weak correlation between popularity and influence. The IP
Algorithm turned out to provide better indicators of popularity than PageRank.
37
3.4 Citation and Co-Authorship Networks
In Bibliometrics, there are two classes of ranking algorithms. In the class of collection-based ranking
algorithms, a weighted graph is used and its nodes correspond to the collections, e.g., journals and con-
ference inproceedings, having the weighted edges representing the total number of citations that point
from one collection to the other. The other class corresponds to publication-based ranking algorithms,
where the nodes of the citation graph are individual publications and the edges represent citations be-
tween papers (Sidiropoulos & Manolopoulos, 2005).
Both PageRank (Brin & Page, 1998) and HITS (Kleinberg, 1998) are part of the second class of ranking
algorithms, while the ISI Impact Factor (Bollen et al., 2006) is part of the first class.
Following the assessment that neither PageRank nor HITS are perfectly suitable for bibliometrics, the
latter due to the fact that a publication only gets a high authority score if there are good hubs that point to
it, and the former because it was designed in a way that a node’s score is mostly affected by the scores
of nodes that point to it and less by the number of incoming links, Sidiropoulos & Manolopoulos (2005)
introduced the SCEAS Rank, which is a collection-based ranking algorithm, where scores are computed
over a weighted graph where the nodes correspond to collections. SCEAS can be defined as follows:
Sj =∑i→j
Si + b
Nia−1 (a ≥ 1, b > 0) (3.56)
In the formula, Ni is the number of outgoing citations of node i, b is the direct citation enforcement factor,
which is used so that citations from zero scored nodes can also contribute to the score of their citing
publications, and a denotes the speed at which an indirect citation enforcement converges to zero. If a
change in the score of node i occurs, it is going to affect the score of node j that is x nodes away, with
a factor of a−x. Also the SCEAS approach has the following advantages over the PageRank and HITS
algorithms:
1. A node’s score is affected by the number of incoming citations.
2. The algorithm’s computation and convergence is very fast. In the experiment conducted by Sidiropou-
los & Manolopoulos (2005) with a DBLP dataset, they have verified that SCEAS needed half the
time needed by PageRank, and about 1/10 of the time needed by HITS.
3. A node’s score is less affected by the score of distant nodes and, whenever new nodes and ci-
tations are added to the network, the new score’s computation can be performed incrementally,
using the previous score vector as the input vector for the computation.
Specifically for co-authorship networks, where the graph nodes represent authors and edges repre-
sent ties between two authors, Liu et al. (2005) proposed AuthorRank, a modification to the PageRank
38
algorithm that is computed over a weighted directed co-authorship graph.
The co-authorship graph is directed and weighted in order to express the magnitude of the relationship
between two authors and is, as in the Weighted PageRank, represented by G = (V,E,W ), with a set of
V authors, a set of E co-author relationships, and a set W of normalized weights wij connecting authors
vi and vj . The normalized weights wij are such that the weights of an author sum up to one, and they
are computed as follows:
wij =cij∑nk=1 cik
(3.57)
In the formula, cij and cik correspond to the co-authorship frequency (Equation 3.58), which is also
correlated with exclusivity.
The idea behind co-authorship frequency is to assign more weight to authors that co-publish more
papers together, and do so exclusively (Liu et al., 2005). For a set of m articles, co-authorship frequency
is defined as follows:
cij =
m∑k=1
gi,j,k (3.58)
In its turn, exclusivity, i.e., giving more weight to co-authorship ties in articles with fewer total co-authors
than in articles with large number of co-authors (Liu et al., 2005), for authors vi and vj , who co-author
article ak, is defined as follows:
gi,j,k =1
f(ak)− 1(3.59)
In the formula, f(ak) is the total number of authors of article ak.
The magnitude of the connection between two authors is determined by the following factors:
1. Frequency of co-authorship: Authors that co-author frequently should have a higher co-authorship
weight;
2. Total number of co-authors on articles: Less weight should be assigned to the co-author relation-
ship if the article has many authors
Therefore, the AuthorRank of author i is expressed as follows:
AR(i) = (1− d) + d
n∑j=0
AR(j)× wj,i (3.60)
In the formula above, AR(j) is the AuthorRank score of the backlinking node j and wj,i corresponds to
39
the weight of the edge between node j and node i.
Also, when exclusivity and collaboration frequency are taken into account, one can assess that some
ties are more prestigious than others.
3.5 Temporal Issues in Ranking Scientific Articles
Citation networks are generally static networks, since a scientific article can not lose citations throughout
the years, and since articles do not disappear from the network. On the other hand, social networks are
generally characterized as dynamic networks, which change at a very fast pace, due to new users
that make new connections and former users that leave the social network, breaking the ties they have
established. Still, even in the case of citation networks, new articles are also being constantly introduced.
Therefore, time is a key factor in social network analysis.
Sayyadi and Getoor developed FutureRank, which computes the expected PageRank score of a sci-
entific article, based on the citations it will obtain in the future (Sayyadi & Getoor, 2009). This number
of future citations is referred to as the usefulness of the article, and the authors assumed that recent
articles are more useful. Nevertheless, older and highly cited articles still get a good ranking, due to
being cited by recent articles. The algorithm is computed in a network that has two different types of
nodes, namely, articles and authors, thus being unfold into two distinct networks (i) a citation network
connecting articles through citation edges, and (ii) a authorship network connecting articles and authors
through co-authorship edges. In the second network, articles can be mapped as the authorities and
authors as the hubs from the HITS algorithm. As the networks share nodes, information is passed from
and to one another.
In short, FutureRank runs one step of PageRank in the first network, in order to transfer authority from
the articles to their references, and one step of HITS in the second network. These results are repeatedly
combined until convergence is reached. The ranking of articles also involves a personalized PageRank
vector, which is pre-computed with basis on the current time and the publication time of the articles,
instead of being based on the number of nodes in the network as in the original PageRank algorithm.
The CiteRank algorithm (Walker et al., 2007) makes use of publication time in order to rank articles,
where each researcher, independently of others, is assumed to start his search with recent articles,
proceeding in a chain of citations until full satisfaction. The output of the algorithm can be seen as an
estimate of traffic to an article, i.e., the probability of encountering an article via a path of any length, and
is correlated to the number of citations in a way that the larger the number of citations, the more likely it
will be for the article to be visited via one of the incoming links. CiteRank is in all similar to PageRank
algorithm, except for the fact that CiteRank initially distributes random surfers exponentially with age and
with probability ρi = e−agei/τdir , where agei is the age of the ith article and τdir is the decay of time, thus
40
favoring recent articles.
3.6 Summary
This chaper presented what has been previously done regarding the task of finding influencers in a
network, having its main focus on the PageRank algorithm and in the different variants that have arisen
over the years. The Influence Passivity (IP) algorithm was also presented, i.e, a novel approach to
influence based on the HITS and PageRank algorithms that also takes information diffusion into account.
Finally, we glanced into a recent trending research topic that concerns the temporal issues in ranking
scientific articles, specifically, the prediction of future PageRank scores in a citation network, based on
future citations that an article may receive.
41
Chapter 4
Finding Influencers in Social Networks
This chapter presents and details the work that was developed in the context of my MSc thesis. I
focused on studying and developing techniques to identify influential nodes in a network so that,
given a network, one can characterize it and assess which are the nodes that exert more influence
over others, i.e., which are the nodes that induce others to have a particular behavior, e.g., forward a
message or visit a renowned monument or concert venue.
Two distinct experiments were conducted, each with a different type of network. In the first experiment we
collected real and up-to-date data from a location-based social networking service, namely FourSquare,
and from Twitter, a social networking and microblogging service, building social networks from the the
collected data. In the case of the network built from FourSquare’s data, it is commonly called a location-
based social network due to its inclusion of information from user’s interactions with other users, as
well as user’s interactions with locations, as they check-in in different places. The second experiment
involved data from DBLP, a digital library containing information about academic publications and their
citations, from which a citation network was built.
With the work that was developed, we wanted to prove the hypothesis that we can identify a network’s
most influential nodes, through network analysis metrics and algorithms. These techniques were applied
to different kinds of social networks, in order to explore influence in distinct contexts. On the experiment
with location-based social networks we wanted to test how good these social network analysis metrics
and algorithms are, in the task of identifying the most relevant nodes. On the other hand, when experi-
menting with academic social networks, we wanted to identify which were the most important papers in
the dataset and test if it was possible to predict the future influence scores of the nodes in the network,
based on their previous influence scores.
The remaining of this chapter is organized as follows: first we introduce the main software package that
43
were used and extended in the course of this research. Then, we describe the metrics used to char-
acterize the social networks of our experiments. In Section 4.2 we thoroughly describe the experiment
with location-based social networks, while in Section 4.3 we describe the experiment with the academic
social network developed from DBLP, going from the process of data collection, the algorithms that were
computed, and the methods to find influential nodes. We finish this chapter with a brief summary of what
has been presented.
4.1 Available Resources for Finding Influencers
To perform our experiments and fulfill the tasks of characterizing a social network and finding which
are its most influential nodes, we used several state-of-the-art algorithms and open-source software
packages for network analysis, in which the LAW Webgraph open-source software package is included.
LAW Webgraph is an open source project developed by researchers from the Laboratory of Web Algo-
rithms at the University of Milan. It contains a Java library for large-scale web graph analysis, presenting
a novel approach to graph compression that enables the creation and storage of web-scale graphs.
Among other things, the LAW Webgraph package contains an implementation of the PageRank algo-
rithm, which was the first algorithm we used for assessing the influence of nodes in our experiments. As
it was intended to extend this software package with the HITS and IP algorithms, the structure of LAW’s
PageRank algorithm implementation served as a template for our algorithmic extensions.
For the implementation of the HITS algorithm we followed the pseudo-code in Algorithm 1, in which
we have to compute two different scores - the hub score and the authority score. The computation of
these scores is based, respectively, on outlinks and inlinks from every node in the graph. Through LAW
Webgraph’s API we could only have access to the successors of a node. To overcome this limitation,
when computing the HITS algorithm, we built the graph and its transpose, instead of just the graph,
so we can access both the successors and predecessors of each node through the transpose of the
original graph (i.e., the inlinks of a node are the outlinks on the graph’s transpose).
Analogously, the Influence-Passivity (IP) algorithm involves the computation of two scores - the influence
score and the passivity score. Thus, two graphs were again built. In this implementation we followed the
pseudo-code in Algorithm 2 from Section 3.3.
4.1.1 Characterizing Networks
To understand aspects such as the dimension or how well connected are the nodes in our generated
graphs, some well-known network analysis metrics were used.
With the average path length one can assess the average distance between the nodes in our networks,
44
understanding how tightly connected they are (e.g., a small average path length indicates that all nodes
are closely connected, which means that it will be easy to spread information through the network). The
clustering coefficient allows us to assess how neighbours on our networks are close to one another, i.e.,
how our neighbours tend to create clusters with a large number of ties between them. On the other
hand, by studying the degree distribution of the nodes in a network, one can assess if we are at the
presence of a large-scale network that is characterized by a power-law distribution of the nodes degree,
i.e., in the presence of a network in which the majority of the nodes have few connections, but where
there is a smaller set of nodes holding an extremely large number of connections. These well connected
nodes are called the hubs, and they can also be seen as central points of aggregation in the network.
4.2 Analysis of Location-based Social Networks
A traditional social network comprises an unique type of nodes, which are the users in the network.
The edges between these nodes represent the friendship ties between the users. In its turn, a location-
based social network has all the properties of a social network however, we now have two types of
nodes instead of just one, namely (1) user nodes, which are the users in the network and who can be
friends with other users, and (2) location nodes, which are the locations users have visited or mentioned
in their personal messages. Therefore, one can say that a location-based social network also has two
types of edges or social ties, namely (1) user-user ties, corresponding to the edges between two users
and in all similar to the edges existing in social networks, and (2) user-location ties, corresponding to
the edges between users and locations, which are derived from a user mentioning or visiting a specific
location. Location-based social networks yield a great amount of information, because one can look at
them as according to two layers: one where the users are connected to their friends, and an underlying
layer where users are connected to locations, the latter being an intersecting layer through which one
can identify the most visited locations (i.e., locations that are connected to a larger number of users)
and, on a location perspective, which locations exert more influence to the users they are connected to
- see Figure 4.8.
Most online social networking services have public APIs, which allow the search and extraction of pub-
licly available, real and up-to-date data. In our experiments, all the considered social network platforms
permitted the access to a public API. Thus, the first step to gather information from these social net-
working services was to request data from the API and store it in a structured way, e.g., a XML file, for
subsequent processing. With the raw data organized, it was then filtered to decouple user information
from location information and also from relationship ties. The different ranking algorithms and network
analysis metrics were finally applied to a graph generated from relationship ties in the data that was
previously filtered.
45
Figure 4.8: Example of a location-based social network (adapted from Zheng & Zhou (2011)).
Data was collected from two different social network platforms: FourSquare and Twitter. FourSquare
is a location-based social network that allows users to check-in in different locations which, in their
terminology are called venues, ranging from restaurants to nightclubs, movie theaters, university campi
or a city’s most iconic monument. It was founded in 2009 and is a web application specially intended
to be used in mobile devices. With the widespread availability of smartphones and mobile gadgets with
Internet connection, FourSquare’s network and service has been growing and evolving throughout the
years, reaching the 7 million registered users milestone in 2011.
In FourSquare, registered users can search for other users or venues, e.g., one can search for Indian
Restaurant near New York and access an extensive list of restaurants, each one with an address and a
geospatial location, user uploaded photos, reviews by users that have had checked-in there, as well as a
list of venues that are similar to the searched one. Venues can be associated with categories and tags.
There is also an underlying game-play concept in this kind of social networks, encouraging continuous
interaction: (i) users earn points for checking-in at venues or adding new venues to FourSquare, (ii) users
earn badges if they check-in in various different venues or complete tasks, (iii) a user in FourSquare can
become mayor of a specific venue if he has checked-in in that venue for more days that anyone else, in
a period of 60 days.
On the other hand, Twitter is a social networking and microblogging service that allows users to post
messages 140 characters long - the tweets. Created in 2006, it has grown to be one of the most well
known social networks with over 500 million active users. Initially, Twitter was only accessible via their
website, but today one has a multitude of mobile applications at hand to manage one’s account, tweet
wherever we please, and also attach links to tweets. Nowadays, many Twitter users tweet as they arrive
(or check-in) at a specific location, deliberately attaching the geographical coordinates of that place to
their tweet. This way, we can associate Twitter users with locations, building a location-based social
network.
46
4.2.1 Data Collection from Online Services
To extract data about users and venues in FouSquare, we used the FourSquare API1, which returns
JSON2 objects that contain the result of each API call. Nevertheless, for simplicity of use, an open-
source Java implementation3 of the FourSquare API was used, providing straightforward methods to
interact with the FourSquare API. This Java API includes all methods in the official FourSquare API.
However, the functionality of the method that searches for venues (i.e., venuesSearch) was not fully
implemented, so there was the need to make a simple change to the FourSquare’s Java API in order to
extract reliable data. Even though the original API’s venuesSearch method allowed us to obtain a set of
venues that are near the provided latitude-longitude coordinates and within a specified radius ranging up
to 5 km, this radius functionality was not implemented in the open-source FourSquare Java API, which
led to a simple addition of the radius parameter to the venuesSearch API call, thus, taking full advantage
of that functionality and obtaining more venues per call - see pseudo-code in Algorithm 3. Also, we
have defined a bounding box for the New York City-Manhattan area, restricting our data collection to that
geographical area, in order to make a more contained study.
Algorithm 3 Pseudocode for the extraction of user and friend data from FourSquare.
latmax: maximum latitude for the New York City - Manhattan bounding boxlongmax: maximum longitude for the New York City - Manhattan bounding boxlatmin: minimum latitude for the New York City - Manhattan bounding boxlongmin: minimum latitude for the New York City - Manhattan bounding boxlat: current latitudelong: current longituderadius = 1000 (i.e., 1km)userSet: Set of users from a venuefor all lat ∈ [latmin, latmax] and long ∈ [longmax, longmin] do
venueSet← all venues for lat, long within radiusfor all venue ∈ venueSet do
Retrieve and store venue infouserSet← all venue’s visiting usersfor all user in userSet do
Retrieve users’ friendsStore friend information
end forend for
end for
As for Twitter, we used the Twitter Public Stream API4 that provides 1% of all the tweets that have been
published in each API second. The data collection process had the following phases:
1. From that 1% of tweets only the ones which had geographical coordinates were selected. Also, for
each tweet we collected information such as, the user id, users that he is following and users that1https://developer.foursquare.com2http://www.json.org/3http://code.google.com/p/foursquare-api-java/4https://dev.twitter.com/docs/streaming-apis
47
are following him. Afterwards, with the coordinates associated to a user’s tweet, we could establish
user-location ties and, with the following and follower relationships, one could establish user-user
ties.
2. From the collected user information, the users which had the greater amount of connections were
selected and the data about their friends and followers was gathered.
3. Afterwards, similarly to was done in FourSquare, all the collected data was filtered in order to keep
only the information about tweets that were within the New York City-Manhattan area.
In order to perform the discretization of geospatial coordinates, we used the Hierarchical Triangular Mesh
(HTM) approach to divide the Earth’s surface into a set of triangular regions, each roughly occupying
an equal area of the Earth (Dutton, 1996; Szalay et al., 2007). In brief, we have that the HTM offers a
multi-level recursive decomposition of a spherical approximation to the Earth’s surface. It starts at level
zero with an octahedron and, by projecting the edges of the octahedron onto the sphere, it creates 8
spherical triangles, 4 on the Northern and 4 on the Southern hemisphere. Four of these triangles share
a vertex at the pole and the sides opposite to the pole form the equator. Each of the 8 spherical triangles
can be split into four smaller triangles by introducing new vertices at the midpoints of each side, and
adding a great circle arc segment to connect the new vertices with the existing ones - see Figure 4.9.
Figure 4.9: A sequence of subdivisions of the world sphere, starting from the octahedron, down to level 5 corre-sponding to 8192 spherical triangles. The circular triangles have been plotted as planar ones, for simplicity (adaptedfrom Szalay et al. (2007)).
This sub-division process can be repeated recursively, until we reach the desired level of resolution,
as shown in Figure 4.10. The triangles in this mesh are the regions used in our representation of the
Earth, and every triangle, at any resolution, is represented by a single numeric ID. For each location
given by a pair of coordinates on the surface of the Earth, there is an ID representing the triangle, at
a particular resolution, that contains the corresponding point. Notice that the proposed representation
scheme contains a parameter k that controls the resolution, i.e. the area of the triangular regions. With
a resolution of k, the number of regions n used to represent the Earth corresponds to n = 8 · 4k.
48
Figure 4.10: The HTM recursive division process (adapted from Szalay et al. (2007)).
From the geographical coordinates found in some of the collected tweets, we computed the hierarchical
triangular mesh (HTM) so we could give to each geographical coordinate a trixel representation. Thus,
with a trixel representation instead of a latitude-longitude representation, one can have more freedom
in specifying the range of the collected locations. In our case, we managed to establish three ranges of
trixels, according to their resolution, i.e., locations with resolution 25, with resolution 20 and resolution
10.
Nevertheless, this data collection process had some limitations. The main limitation in the FourSquare
API, was that their rate limit for authenticated calls per hour is set to 500, which is a very low threshold
considering that we have performed an extensive crawl and each request for the listing of a user’s friends
is a frequent authenticated API call. As for the Twitter API, we had a rate limit of 600 calls per hour and,
exceeding that limit, we had to wait until the next hour to make more API calls. This made us disregard
a larger number of tweets during that waiting time.
4.2.2 Adaptation of the Influence-Passivity (IP) Algorithm
A major contribution of this work was the adaptation and implementation of the aforementioned Influence-
Passivity (IP) algorithm. Developed by Romero et al. (2011), the IP algorithm was part of a study on
information propagation in Twitter, where the authors came to the conclusion that most users of this
social network act as passive consumers of information, not forwarding content to the network. This al-
gorithm presents a novel way of quantifying the influence of nodes in a network by considering that each
node has an influence score, as well as, a passivity score. These scores have a mutually reinforcing
relationship, like the hub score and authority score in the HITS algorithm ( Kleinberg (1998)).
For our implementation, some changes had to be conducted to the original IP algorithm, in order to
adapt it to location-based social networks and perform an edge weight calculation that was consistent
with the datasets we were working with. From the Twitter data collected by Romero et al. (2011), the
weight of an edge e = (i, j) was assigned as the follows:
49
we =SijQij
(4.61)
In the formula, Qi represents the number of URLs that node i mentioned and Sij is the number of URLs
that were mentioned by node i and retweeted by node j.
In the case of our datasets from FourSquare and Twitter, we wanted to generate a weight exclusively
based on user-location and user-user ties, instead of URLs or retweets, as proposed by the original
authors. Thus, we built a graph that rather than having two types of nodes, i.e., locations and users,
would only have user nodes, estimating exclusively the influence of users in the network.
To calculate the weight of edges between users, we adapted the Qij and Sij parameters, having Qij as
the number of locations node i has visited and Sij as the number of locations visited by both i and j,
i.e., number of common visited locations between nodes i and j, having i visited the location before j
had visited it. From our adaptation of the algorithm, user influence is always dependent on the popularity
of the locations a user has visited.
The original graph built from our datasets is depicted in Figure 4.11, i.e., the left-most graph which
includes two types of nodes: (i) user nodes, represented by U1...U4, and (ii) location nodes, represented
by S1...S3, and has undirected user-location ties and directed user-user ties. Also, the right-most graph
in Figure 4.11 is the result of our adaptation of the IP algorithm, generating a network graph that only
has directed and weighted user-user ties and has some differences regarding its structure, e.g., the
original user-user edges no longer exist and new edges arise from common visits to locations. The
connection between two nodes is associated with a non-negative, non-zero weight if they share a visited
location, e.g., U3 and U2 both visited location S2 so there is a new edge from U3 to U2, with the weight
w2, because U3 visited S2 after U2 had visited it.
4.3 Analysis of Academic Social Networks
Alongside with social networks, this work focused on assessing the influence of nodes in an academic
social network, which is a network where the nodes either refer to authors of scientific papers connected
via co-authorship ties that form a co-authorship network, or to the scientific papers themselves con-
nected through citation ties, originating a citation network. We wanted to assess which were the most
influential papers in the scientific community, i.e., the ones that were gathering more attention either due
to the importance of their author(s), due to being about a trending topic or an important breakthrough. To
do so, we gathered the already organized data from the digital library DBLP, via the Arnetminer Project1,
which contains information about scientific papers from 1935 to 2011, including the abstract and the
1http://arnetminer.org/DBLP_Citation
50
Figure 4.11: Transformation of the original network graph (left) to our IP algorithm graph (right).
number of citations. From this data we built a citation network for set of time-stamps ranging from 2007
to 2011, as depicted in Figure 4.12, in order to have a record of how the network evolved over time.
Figure 4.12: Structure of the citation graph built upon the DBLP data.
Although any other ranking algorithm could have been used, in the case of the DBLP citation network,
the most influential papers on the dataset were determined through the computation of the PageRank
algorithm. The top-10 highest ranked papers were then selected and we gathered their full information,
in order to cross-check the set of authors of each paper with the recipients of renowned computer
science and engineering awards such as the Gerard Salton award or the Turing award, identifying which
of these authors were distinguished by the scientific community.
4.3.1 Predicting Future Influence Scores and Download Counts
Instead of computing future PageRank scores of scientific papers based on their future citations, as did
Sayyadi & Getoor (2009), we created a framework to predict the future PageRank scores of scientific
papers in a citation network for a specific year, based on their previous PageRank scores, among other
51
features. The same principle was also applied to the prediction of download counts for scientific articles
downloaded from the ACM Digital Library website, in the year of 2011.
In the framework depicted in Figure 4.13, and in order to predict the future PageRank scores and future
download counts, we have three distinct phases:
1. Feature Vector Creation
The first phase is to prepares the input for further computation related to the prediction of impor-
tance scores. Having the dataset, either for paper citations or downloads counts, one generates
the different features, namely the text, age and PageRank scores, and store them in a relational
database, so then feature vectors can be generated.
2. Prediction
In a second phase, one creates training and test files from the generated feature vector files, in
order to proceed with the computation of a machine learning technique intended for predicting the
future PageRank scores and the future download counts.
3. Accuracy Assessment
Finally, to assess the quality of the obtained results, one proceeds with the computation of various
evaluation metrics.
Figure 4.13: Framework for predicting future PageRank scores and download counts.
Each aforementioned phase is a preparation to following one. To predict the PageRank scores and the
download counts we relied on features that can represent the characteristics of the information in the
dataset. The following types of features was considered:
1. Absolute Scores - Includes the PageRank score resulting from the computation of the algorithm
for papers that were published until a specific year, inclusive. Regarding the PageRank score of a
52
paper, we defined 5 different cumulative time-stamps, from 2007 to 2011, so we could have access
to the respective PageRank scores in each k previous year.
2. Differential Scores - Includes the Rank Change Rate (Racer), representing the change rate of
PageRank score between two consecutive years, capturing the evolution of PageRank scores.
The Rank Change Rate between to time-stamps ti and ti+1, for paper p is given by the following
equation:
racer(p, ti) =rank(p, ti+1)− rank(p, ti)
rank(p, ti + 1)(4.62)
3. Profile Information - Includes the Average PageRank Score, that represents the average of the
PageRank score of all publications that have an author in common with the paper’s set of au-
thors, and the Maximum PageRank Score, which represents the maximum PageRank score of all
publications that have an author in common with the paper’s set of authors.
4. Age - Includes the difference between the present year and the publication year of a paper, i.e., its
age.
5. Text - Includes the term frequency score for the top 100 most frequent tokens in abstracts and titles
of publications, not having in consideration the terms from the Standard English stop-word list.
For each aforementioned type of feature, except age and text, its value for the previous k years, with
k ranging from 1 to 3 was considered, e.g., when predicting the future PageRank score for year 2010,
one predicted that score only with information from the PageRank score of the previous year (k =1, i.e.,
2009), then with information from the two previous years (k = 2, i.e., 2009 and 2008) and finally from the
three previous years (k=3, i.e., 2009, 2008, 2007).
In order to enrich the way we made our predictions, we made a structured combination of the previously
enumerated types of features, which fit into three different groups:
• 1 - In this group we used exclusively the PageRank scores of the paper as features.
• 1 + 2 - In this group we used both PageRank and Racer scores of the paper as features.
• 1 + 2 + 3 - In this group we used PageRank scores, Racer scores, Average Author scores and
Maximum Author scores as features.
The remaining text and age features were separately added to the aforementioned combination of fea-
tures enabling the creation of two distinct subsets of results. Thus, alongside with the different range of k
used, one could assess if for that particular type of feature or group of features, adding more information
about previous years would improve or deviate the accuracy of our results. Also, for a straightforward
53
computation of the Racer, Average PageRank score, Average PageRank score an feature vectors, the
PageRank scores for each paper in each time-stamp and information about the authors of the papers
and the information about download counts was stored in a relational database.
4.3.2 The Learning Approach
To predict future PageRank scores and future download counts, we used an ensemble machine learning
technique included in the RT-Rank1 package, which is an open-source project consisting of the imple-
mentation of various machine learning algorithms based on regression trees.
The algorithm we used, called Initialized Gradient Boosting Regression Trees (IGBRT) is essentially a
point-wise machine learning algorithm developed by the team from Washington University of St. Louis
for the 2010 Yahoo Learning-To-Rank Challenge. The algorithm is shown in Algorithm 4, and it is based
on Gradient Boosting Regression Trees (GBRT) (Mohan et al., 2011). GBRT is a machine learning
technique based on tree averaging, which uses a set of trees to classify a new object, instead of the
single best tree (Oliver & Hand, 1995). It sequentially adds small trees (d≈ 4), each with high bias and,
in each iteration, the new tree to be added focuses strictly on the objects that are responsible for the
current remaining regression error. IGBRT follows the guidelines of SVM light 2, proposed by Joachims
(1999, 2002).
Algorithm 4 Initialized Gradient Boosted Regression Trees (Squared Loss)
Input: data set D = {(x1, y1), ..., (xn, yn)}, Parameters: α, MB ,d, KRF , MRF
F← RandomForests(D,KRF , MRF )Initialization: ri = yi − F (xi) for i = 1→ nfor i = 1→MB doTt ← Cart({(x1, r1), ..., (xn, rn)} , f, d) {Build Cart of depth d, with all f features, and targets ri}for i = 1→MB dori ← ri − αTi(xi) {Update residual of each sample xi}T (·) + α
∑MB
t=1 Tt (·) {Combine the Regression Trees T1, ..., Tm with the RF F}end for
end forreturn T (·)
With the intention of addressing GBRT’s main weakness, i.e., the inherent trade-off between the step-
size and the early stopping, Mohan et al. (2011) proposed an ensemble algorithm that starts-off at a point
very close to the global minimum and refines the already good predictions. Thus, instead of initializing
the algorithm with an all-zero function, as occurred in GBRT, the IGBRT algorithm is initialized with
the predictions of Random Forests (Breiman, 2001), due to the latter being known as being resistant
towards overfitting, insensitive to parameter settings, and not implying additional parameter tunning.
IGBRT uses GBRT to further refine the results of Random Forests, which are regarded by the authors
1https://sites.google.com/site/rtranking/2http://svmlight.joachims.org/
54
as a good starting point for the algorithm.
4.4 Summary
In this chapter I detailed of the two types of experiments that were conducted within my MSc thesis. I
began explaining the characteristics of location-based social networks and of academic social networks,
emphasizing their peculiarities. Then, for each experiment, I described the datasets, the data collection
technique, and the methodology for finding the influencers in the network, alongside with the algorithms
that were used. For the particular case of academic social networks, a novel approach to predicting
future PageRank scores and future download counts was also presented.
55
Chapter 5
Validation Experiments
This chapter presents the results of the undertaken experiments and the evaluation methodology
used to assess the veracity of the obtained results. Beginning with a concise characterization
of all the datasets that were used and their respective networks, the evaluation methodology is then
presented, comprising all the metrics that were used to assess the quality and veracity of the results.
Finally, the obtained results for each experiment are presented and further discussed. The results
comprise the experiments for finding influencers in FourSquare and Twitter, and the citation network
built upon the DBLP dataset, as well as, the experiments for predicting the future PageRank score of
a scientific papers from 2010 and 2011 in the DBLP citation network and the prediction of download
counts for the scientific papers published in 2011, downloaded from the ACM Digital Library.
5.1 The Considered Datasets
This section includes the dataset and network characterization of all the datasets that we used.
In order to understand the structural differences between a location-based social network and a social
network that only consists in relationships between users, and how this structure affects influence esti-
mation, we created two different graphs for both FourSquare and Twitter datasets. First we considered
a graph consisting in the original location-based network built upon the data that was crawled, which we
called the User+Spot Graph. Afterwards, we disregarded all the user-location relationships and built a
graph consisting only in user-user ties, which we called the User Graph.
In the case of the DBLP dataset, the distinction between two graph was not needed, because our focus
was on creating a citation network upon which we could estimate the PageRank scores of their nodes
and use them as features for the algorithm that predicts future influence scores of papers and future
download counts. As for FourSquare and Twitter, this structural difference presents interesting results
57
when estimating user influence.
FourSquare Twitter
Spots
Total 48,257 1,358HTM Resolution 10 — 13HTM Resolution 20 — 1,277HTM Resolution 25 — 1,358
UsersTotal 447,545 2,603,505Relations 970,587 3,218,997Visiting Spots 16,960 1,017
ArcsPageRank & HITS (User+Spot Graph) 2,539,986 3,757,555PageRank & HITS (User Graph) 1,017,887 3,576,157IP Algorithm 1,017,887
NodesPageRank & HITS (User+Spot Graph) 451,664 2,604,863PageRank & HITS (User Graph) 403,407 2,603,505IP Algorithm 447,545
InDegree
Minimum (User+Spot Graph) 0 1Maximum (User+Spot Graph) 3,166 38,542Average (User+Spot Graph) 2.8626 5.6162Minimum (User Graph) 0 1Maximum (User Graph) 3,166 38,452Average (User Graph) 2.5478 5.6256
OutDegree
Minimum (User+Spot Graph) 0 1Maximum (User+Spot Graph) 1,000 460,466Average (User+Spot Graph) 74.8821 1.5615Minimum (User Graph) 0 1Maximum (User Graph) 1,000 460,466Average (User Graph) 60.5829 1.5618
Average Degree
Total (User+Spot Graph) 5.4640 3.8868Users (User+Spot Graph) 5.6714 2.8878Spots (User+Spot Graph) 5.7118 1.0376Total (User Graph) 5.0488 2.8872
Average Path LengthUsers+Spot Graph 4.7369 3.9776User Graph 4.7764 3.9823
Clustering CoefficientUsers+Spot Graph 0.2987 0.1156User Graph 0.3718 0.1152
Table 5.1: Characterization of the FourSquare and Twitter networks.
Regarding the characteristics of both graphs in the FourSquare and Twitter datasets depicted in Ta-
ble 5.1, one can acknowledge that while the first dataset is more complete in terms of user-location ties
and quantitative spot information, the latter is more complete in terms of user-user ties and user friend-
ship information. We have this behaviour, since FourSquare is a pure location-based network focused
on sharing the locations users have visited, while Twitter is a microblogging and social network platform
focused on the exchange of messages between users, thus giving priority to the relationship between
the user and his friends and followers. In what regards the HTM resolution, we used a resolution of 26.
When considering the average path length and the clustering coefficient, one can assess that while
the nodes in FourSquare network are more close to each other, neighbours of nodes in Twitter are
more close to one another than in FourSquare. The latter phenomena has to do with the fact that we
could collect a greater extent of data for friends of users in the Twitter dataset, resulting in the scenario
58
where friends of different users can, themselves, be friends and/or have friends in common. Also, one
can observe that the User-Graph has naturally a greater average path length and a greater clustering
coefficient than the User+Spot Graph, because the User-Graph as less nodes and, thus, shortens the
distance between users and neighbourhoods of users, previously parted by the spots between them.
The academic citation network built upon DBLP data comprises scientific papers from 1935 to 2011 and,
from Table 5.2, one can also have an idea of the dimension of the dataset for each of the considered
time-stamps, as well as, how complete the information about the scientific papers is.
Regarding the degree distribution in the FourSquare and Twitter networks in both User+Spot Graph and
the User Graph, one can acknowledge from Figure 5.14 that the degree distribution for these datasets
follows a power-law distribution, which a characteristic of large-scale networks, i.e., networks in which
the majority of the nodes very few connections, while very few nodes have a high number of connections.
Nevertheless, from the values of average path length and clustering coefficient, one can say that both
FourSquare and Twitter networks are not representative of large-scales networks, because in large-
scale networks, besides the power-law distribution for the degree, the average path length must be
much smaller than the clustering coefficient, revealing that the nodes are very close to each other and
their neighbourhoods are highly clustered.
Publications Citations Authors Papers with Papers with Average TermsDownloads Abstract Per Paper
Overall 1,572,277 2,084,019 601,339 17,973 529,498 1042007 135,277 1,150,195 330,001 15,516 343,837 952008 146,714 1,611,761 385,783 17,188 419,747 982009 155,299 1,958,352 448,951 17,973 504,900 1012010 129,173 2,082,864 469,719 17,973 529,201 1032011 8,418 2,083,947 469,917 17,973 529,498 104
Table 5.2: Characterization of the DBLP dataset.
On the other hand, one can acknowledge from the network characterization in Table 5.3 that the aca-
demic social network that was built naturally grows in each time-stamp, although this growth is not as
significant in the last two time-stamps as it is in the first two.
Focusing on the average path length and the clustering coefficient, one can conclude that as we include
more papers in the network, i.e., at each time-stamp, papers are closer to one another through the
existence of more citation relationships between them, even though they tend not to be as clustered
together over time.
From the plots in Figure 5.15, one can acknowledge that the number of papers increases trough the
years. However, these new papers have tend to have few citations, and so the tail of the plots get ticker
throughout the years, i.e., new fewer cited papers are frequently added to the dataset, while the number
highly cited paper remains almost unaltered.
59
●
●
●
●
●
●●
●●
●●●●●
●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●
●●●●●●●●●
●
●●●
●
●●●●●●●●●●●
●●●
●●●●●●●●●●●●
●
●●
●●●●●●
●
●●●●
●●●●●●●●●●●
●
●●●●●●●●●●
●●●●●
●
●
●
●
●●
●
●●●●
●●●●
●●
●●●●
●●●●
●
●
●●
●
●●●●●●
●●●
●
●
●●●●●
●●●
●●●
●●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●●
●●●
●
●
●●●
●●
●
●
●
●
●
●
●●●
●●
●
●
●
●●
●●●●
●●
●
●●●●
●
●
●
●●
●
●●●
●
●
●
●●
●
●●●●●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●
●
●
●
●
●
●
●●●●
●
●
●
●●●
●
●
●
●●●●
●
●
●
●
●
●
●
●●●●
●
●●
●●●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●●
●●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●●
●
●
●●
●●
●
●
●
●
●
●●
●●
●●
●●●●
●
●
●
●
●
●●●
●●
●●
●●●
●
●
●●
●
●
●●●
●●●
●
●
●
●
●●
●
●
●
●●●●
●
●●●
●●
●●●●
●●
●
●
●
●●●●●●
●
●●
●
●
●
●●●●●
●●●
●
●
●
●
●●
●
●●●●●●●●●
●
●●●
●
●
●
●●●
●
●●●●●●
●
●●●●●
●●●
●●●●●●●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●●●●●
●
●●
●
●●
●
●●●●●●
●
●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●
●
●●●●
●
●●●●●●●●●●●●●●
●●
●●●
●
●●
●
●●
●●●●●●●●
●
●●●●
●
●●●●●●
●●●●●●
●●
●●
●●
●
●
●
●
●
●
●●
●
●
●●
●●●●
●
●
●
●
●
●●●
●
●●●●●
●
●●●●●●●●●●●●●●●●●●
●
●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●
1 5 10 50 100 500
110
010
000
Degree Distribution in FourSquare (User+Spot Graph)
Node id
Deg
ree
●
●
●
●
●
●
●
●●
●●●●
●●●●
●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●
●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●
●●●●
●●●●●●●●●●●
●
●●●●●●●●●●●
●●
●●●
●
●
●●●●
●●
●●●
●●●●●●●
●
●
●
●●●●●●
●●●●
●
●●●●●
●
●●●
●●●
●
●
●
●●●●●●
●
●●
●●●
●
●
●
●
●●
●
●●●●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●●●●●●●
●
●
●
●●
●
●●●●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●●
●●
●●●●
●
●
●
●
●
●
●●●
●●●
●
●●
●●●
●●●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●●
●●●
●
●
●
●●
●
●●●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●●
●●
●
●●●●
●●●
●●
●●●
●
●
●
●
●
●●●●●●
●
●●
●
●
●
●
●●●●●
●●
●
●
●
●
●
●●●●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●●●
●
●●
●
●●●
●
●●●
●
●●●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●●●
●●
●
●●●
●
●
●●
●
●
●
●
●●●●●
●
●●●●
●
●●●
●●
●●●
●●
●
●●
●●●●●●
●
●
●
●●●
●
●●●●
●●
●●●●●●●●
●
●●
●
●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●
●
●
●
●●●●●
●●●
●●
●
●●●●●●●
●
●●●
●
●●●●●●●●●●●●●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●●
●
●
●
●
●●
●
●●●●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
1 5 10 50 100 500
110
010
000
Degree Distribution in FourSquare (User Graph)
Node id
Deg
ree
●
●
●
●
●●
●●
●●
●●●●
●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●
●
●
●
●
●●
●
●●
●●●
●●●●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●●
●●
●
●●●
●
●
●●●
●
●
●●
●●●
●
●●
●●
●
●
●
●
●●●●●●●●●●
●
●●
●●
●
●
●●
●●●●●
●
●●●
●
●●●
●
●●●●●
●
●●●●●
●
●
●
●
●●●●
●
●●●●●●
●
●●
●●
●●●●
●
●●●
●
●●●
●●
●●●
●
●●
●
●
●
●
●
●
●
●
●●●●●
●
●
●
●
●●●●●
●
●
●●
●●●●●●●
●
●
●
●●●●●●
●●
●
●
●●●
●●●
●●●
●
●
●
●●
●
●
●●●●●
●
●
●●●●
●
●●●
●
●●●●
●●●●
●
●●●
●
●●
●●
●●●●
●
●
●●●
●
●●●●
●●
●●●●
●
●
●
●
●●
●
●●
●
●●●●
●
●
●
●
●
●
●
●
●●●●●
●
●●●●●
●
●●●●
●
●●●●●
●
●●●●●●●
●
●
●
●●●●
●
●●●●●●●●●●
●
●
●
●●
●
●●●
●●
●●
●
●●
●
●●●●●●
●
●●●●●●●●●●●
●
●
●
●●●
●●
●●●●
●
●●●●●●●●●●
●
●●●●●●●●●
●
●●●●●●
●
●●
●●
●●●●●●
●
●●●●●●●●●●●●●●●●
●
●●
●
●●●●
●●
●●
●
●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●
●
●
●●
●●●●●●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●
●
●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●
●
●●●●
●
●●
●
●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●
●
●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●
●
●
●
●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●●
1 100 10000
1e+
001e
+02
1e+
041e
+06
Degree Distribution in Twitter (User+Spot Graph)
Node id
Deg
ree
●
●
●
●
●●
●●
●●
●●●●
●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●
●
●
●
●
●●
●●●●●●
●●●●
●
●
●●●
●
●
●
●●
●●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●●●
●
●
●
●●
●
●●●
●●
●●●●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●●●●●●●●
●
●
●
●
●
●
●●●●●
●
●●●
●●
●●
●
●●●
●
●●●●●
●
●●
●
●
●●●●
●●
●●
●●●●
●●
●●
●●
●●●●●●●●
●
●●●●●
●
●
●
●
●
●●●●●
●
●
●
●●
●●●●
●●●
●
●
●●●●●●
●
●
●●
●●●●●●
●
●
●●
●
●
●●
●
●
●●●●●
●●
●●●
●
●
●●
●
●
●●●●●
●
●
●
●
●●●
●
●●●
●
●
●●
●
●●●●●
●
●●
●
●●●
●●●
●
●●●
●●●●
●
●●●●●●●●●●●●●
●
●●
●●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●●●●●
●
●●●●●●
●
●●●
●●
●●●
●
●●●●●●
●
●
●
●●
●
●
●●●●●
●
●●●
●
●
●
●●
●
●●
●
●
●●●●●●
●
●●●●
●
●
●●●●●●●●●
●●●
●●●●
●
●●●●●
●
●●●●
●
●●●●
●
●●●●●●●
●●
●●●●●●●●●●
●
●●●●●●●
●
●●●●●●●●●●●●●
●
●●
●
●●
●
●●●●
●
●●●●●
●
●●●●●●●
●
●●●●●
●
●●●●●●
●
●●●●●●●●●●●●●●●●●●
●
●
●
●
●●
●●●●●
●
●●●●
●
●●●●●●●●●●●●●●●●
●
●●●●●●●●
●
●●●●●●●●●
●
●
●
●
●●
●●●●●●
●
●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●
●
●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●
●
●●●●
●
●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●●
1 100 10000
1e+
001e
+02
1e+
041e
+06
Degree Distribution in Twitter (User Graph)
Node id
Deg
ree
Figure 5.14: Degree distribution for nodes in the User+Spot Graph and the User Graph, from the FourSquare andTwitter datasets.
5.2 Evaluation Methodology
When assessing the quality and veracity of the results for the top-10 highest ranked users and spots
in the FourSquare and Twitter datasets, we conducted an empirical analysis and relied on profile infor-
mation, due to the fact that, this research area is still evolving and there are not strict parameters or
ground-truth lists to truly assess the influence of a node in these networks. On the other hand, when
assessing the veracity of the DBLP top-10 highest ranked papers, we empirically analyzed our results
against a list of recipients of renowned scientific awards, like the Gerard Salton Award and the Turing
Award, and if they were not part of that list, we also checked their academic publication profiles1 in order
to assess if they were renowned scientists.
In the case of the experiment of future PageRank and future download count prediction, we used a set
of error metrics. One of these metrics is Kendall’s Tau, which corresponds to a value ranging between
1http://academic.research.microsoft.com/
60
In-Degree Out-Degree Degree Average ClusteringMin Max Avg Min Max Avg Min Max Avg Path Length Coefficient
2007 0 1,508 2.9153 0 227 2.9153 0 1,508 5.8329 0.1323 6.18002008 0 1,875 3.5357 0 266 3.5357 0 1,875 7.0790 0.1319 6.10472009 0 2,207 3.6993 0 269 3.6993 0 2,207 7.4012 0.1314 6.08332010 0 2,306 3.7670 0 269 3.7670 0 2,306 7.5430 0.1312 6.06652011 0 2,311 3.7673 0 269 3.7673 0 2,311 7.5367 0.1310 6.0676
Table 5.3: Characterization of the DBLP network.
[−1, 1] and is defined as follows:
τ =2ci
12ni(ni − 1)
− 1 (5.63)
In the formula, ci is the number of concordant pairs between the produced ranked list and the ground
truth list, and ni is the length of the two lists Li (2011). The aforementioned LAW-Webgraph software
package includes an implementation of this metric.
We can also assess the level of correlation between two ranked lists using Spearman’s Correlation (i.e
Spearman’s ρ), according to the formula bellow:
ρ = 1−6∑ni=1(xi − yi)2
n3 − n(5.64)
In the formula, x1, ..., xn and y1, ..., yn are the two rankings of n objects (Best & Roberts, 1975). This
metric was computed via its implementation in the R-Project open source statistical software1. Both
Kendall’s Tau and Spearman’s Correlation measure the strength of the association between two ranked
lists Cha et al. (2010). The correlation ranges between [−1, 1] and, hence, if it is close to −1, one
can determine the variables are negatively correlated, whereas if it is close to +1 they are positively
correlated. To perform the Spearman’s Correlation we used the R-Project for statistical computing, which
a specific statistical language and open-source software package that includes various mathematical
and statistical techniques, being also suitable for large amounts of data.
In order to measure the accuracy of the prediction models, we used the normalized root-mean-squared
error (NRMSE) metric between our predictions and the true values, which is given by the formula:
NRMSE =
√∑ni=1(x1,i−x2,i)2
N
xmax − xmin(5.65)
The average of absolute error, which is the average of the difference between the inferred, i.e., pre-
dicted value and the actual value, was also used and specially relevant for assessing the quality of the
predictions of download counts.
1http://www.r-project.org/
61
●
●●
●●
●●
●●
●●
●●●●
●●●●●●
●●●●●
●●●●
●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●
●●●●●●●●●
●●
●●
●
●
●
●
●●
●
●
●●●●●●●●
●
●●●●●●
●
●
●
●
●
●●
●●
●
●
●
●●●●●●●
●
●
●
●
●●●
●
●
●●
●●
●●●
●●
●
●
●●
●
●
●●
●●
●
●●
●
●●●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●●●
●
●
●●●
●●
●
●
●
●
●
●
●
●●●
●●
●●●●●
●
●●
●●●
●●
●
●
●●●
●
●●
●
●
●●●
●
●●●
●
●●●●●●●
●
●●●●●●●●●●●
●●
●●●●●
●
●●●●●●●●●●
●
●
●
●
●
●●●
●
●●●●●●
●
●●●●●●●
●
●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●● ●●
1 5 10 50 100 500 1000
110
010
000
Degree Distribution in DBLP (2008)
Node id
Deg
ree
●
●●
●●
●●
●●
●●
●●●●●
●●●
●●●●●●
●●●●●●●
●●●●●●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●
●●●●●●●●●●●●●●●
●●●●●●●●
●
●●●●●●
●●●●
●
●●●●●●●
●
●
●●
●●●●●●
●
●
●●
●
●
●●●●●●
●
●
●
●●●●●●●●
●
●●●●●
●
●
●●
●●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●●●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●●●●
●●
●●
●
●
●●●●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●●
●●
●●
●●
●●
●
●●
●
●●
●●●●●●
●
●●
●
●
●●
●●●●
●
●●●
●
●●●●
●●
●●
●●
●
●
●
●
●●●
●
●●●●
●●
●●●●●●●●●
●
●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
1 5 10 50 100 500 1000
110
010
000
Degree Distribution in DBLP (2009)
Node id
Deg
ree
●
●●
●●
●●
●●
●●
●●●●●
●●●●●
●●●●●●●●●●●●●●●●●●●
●●●●●●●
●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●
●●●●●
●●●
●
●
●●●
●●
●
●
●
●
●●●●●●●●●●●
●●
●
●
●
●
●●
●
●●
●●●●
●●
●
●
●●
●●
●
●
●
●●
●
●
●●
●●
●●
●
●
●●●
●●●●
●
●●
●●●
●
●
●
●
●
●●
●
●●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●●●
●●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●●●●
●●●
●
●
●
●●●●●●
●
●
●●
●●
●
●●●●
●
●
●
●●
●●●●
●
●
●
●●●●●●
●
●●●●●●●●●●●
●
●●●●●●●●●
●
●
●
●●
●
●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●
●
●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●● ● ●
1 5 10 50 100 500 1000
110
010
000
Degree Distribution in DBLP (2010)
Node id
Deg
ree
●
●●
●●
●●
●●
●●
●●●●●
●●●●●
●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●
●
●●
●
●●●●●●
●●●●
●
●●●●
●
●●●
●
●●●●
●●
●
●
●
●
●●●●●●
●●●●●
●
●
●
●
●
●
●●●
●●●
●●●●●●
●
●●
●●
●
●
●
●●●
●
●●
●●
●●
●
●
●●●
●●●●
●●
●
●●●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●●●
●●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●●●●
●
●●●●●
●●●
●
●
●●●●●●●●
●
●
●●
●●
●
●●●●
●
●●●
●
●
●●●●●●●●
●
●●●●●●
●●
●●●●●●●●●
●
●●●●●●●●●●
●
●
●●●
●●●
●
●●●●●●●
●
●●●●●●●●●●●●●●●●●●
●
●●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●
1 5 10 50 100 500 1000
110
010
000
Degree Distribution in DBLP (2011)
Node id
Deg
ree
Figure 5.15: Degree distribution for the DBLP dataset from 2008 to 2011.
5.3 The Obtained Results
This section exhibits the results obtained from the various conducted experiments, alongside with their
discussion. First of all, the results from the experiments for finding influencers in FourSquare and Twitter,
as well as, for the BDLP citation network are presented and further discussed, where we assess the
quality of these results and if the top-10 highest ranked list of individuals and spots produced by the
different algorithms really corresponds to the top-10 of influencers and influential spots in the network.
The results for the experiment of predicting future PageRank scores and download counts are then
presented, alongside with their discussion, where we compare the output of the different evaluation
metrics that were computed for the different groups of features, in order to understand if the task of
predicting a future PageRank score and the future download counts could be successfully accomplished
with the framework that was developed.
62
5.3.1 Finding Influencers
In the following sections the results of the computation of PageRank, HITS and IP algorithms for the
FourSquare and Twitter datasets are presented, as well as, the results of the computation of PageRank
algorithm for the DBLP dataset. While the first two datasets comprise the top-10 highest ranked users
and the top-10 highest ranked spots in the network, the results from the DBLP highlight solely the most
influential papers in the DBLP digital library dataset.
We begin by exposing and discussing the results from the experiments with, respectively, the FourSquare
and Twitter datasets, then we present and discuss the influence estimation for the DBLP dataset, closing
this section with the results from the future PageRank scores and download counts experiment.
In order to identify the most influential users and spots in FourSquare and Twitter datasets, aver-
age anonymous users and spots (e.g., streets) are identified, respectively, by Person − XXXX and
Spot − Y Y : ZZ, where XXXX corresponds to the real user id, Y Y corresponds the latitude and ZZ
to the longitude associated with that spot id in the network, while publicly well-known companies, loca-
tions/venues and people are identified by their real name, e.g., Ellen DeGeneres for users and Dunkin’
Donuts for spots.
5.3.1.1 Location-based social networks: FourSquare & Twitter
From the user influence scores for PageRank and HITS algorithm depicted in Table 5.4, one can ac-
knowledge that the addition of spots to the network reveals well-known influentials, such as worldwide
celebrities, TV channels or magazines.
PageRank HITS - Authority HITS - HubName Friends Likes Name Friends Likes Name Friends LikesTimeOut NY — 122,172 ZAGAT — 328,189 ZAGAT — 328,189Lucky Mag. — 164,323 TimeOut NY — 122,172 MTV — 731,067ZAGAT — 328,189 MTV — 731,067 Bravo TV — 375,363NYPL — 61,132 Bravo Tv — 375,363 History Chnl — 541,847MTV — 731,067 History Chnl — 541,847 The NY Times — 367,008Person-12935563 956 20 Starbucks — 929,915 Starbucks — 929,915Bravo TV — 375,363 The NY Times — 367,008 VH1 — 380,987Person-1478079 981 96 Lucky Mag. — 164,323 People Mag. — 372,008NYC Parks — 17,429 VH1 — 380,987 TimeOut NY — 122,172History Chnl — 541,847 NYPL — 61,132 The WSJ — 227,894
Table 5.4: User influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built from theFourSquare dataset.
Meanwhile, when we have the User Graph, as depicted in Table 5.5, the average users of social plat-
forms are distinguished both in the PageRank and the HITS algorithms, the latter when ordered by hub
scores. In this case, average users are highlighted through their great amount of mayorships, checkins,
tips about locations and friends. Mostly through their outlinks, they become network users that other
users want to follow and listen to.
63
PageRank HITS - Authority HITS - HubName Friends Likes Name Friends Likes Name Friends LikesPerson-11890308 794 84 ZAGAT — 328,189 Person-2630685 110 817Person-449480 1,000 374 MTV — 731,067 Person-1127366 39 749Person-1544684 987 144 Bravo TV — 375,363 Person-4148169 77 899Person-619656 823 8 History Chnl — 541,847 Person-634270 216 755Person-4071912 1,004 860 Starbucks — 929,915 Person-42695 128 775NYCHA 807 59 The NY Times — 367,000 Person-1011520 39 723Person-6935835 990 275 VH1 — 380,987 Person-3231666 14 713Person-6004767 958 319 Ellen DeGeneres — 457,155 Person-7991820 3 767Person-10934560 1,001 64 TimeOut NY — 122,172 Person-3290360 62 632Person-10554269 985 4 People Mag. — 372,008 Person-6483868 95 765
Table 5.5: User influence scores for PageRank and HITS algorithms, for the User Graph, built from the FourSquaredataset.
When the location-based network was reshaped to connect only the users that have visited at least one
location in common, for the IP algorithm, the average user of FourSquare is distinguished, yet again due
to a combination of factors that include their great amount of mayorships, checkins, tips about locations
and friend counts, as one can acknowledge from Table 5.6.
In brief, the fact that worldwide TV channels, magazines, and celebrities are highlighted in a network
that contains both users and spots reveals a strict connection between these well known influentials and
the spots, through a continuous activity that is intended to gather and retain their followers. When these
ties are removed, the connections between real users prevail.
Name Friends LikesPerson-9797197 52 10Person-9726342 5 —Person-9615360 25 9Person-9578554 34 —Person-9553862 4 —Person-9450025 47 7Person-9264407 43 —Person-8956766 28 —Person-8916830 47 4Person-884020 95 32
Table 5.6: User influence scores for the IP algorithm, built from the FourSquare dataset.
As for the most influential spots in the FourSquare dataset, the top-10 highest ranked spots resulting
from the computation of both PageRank and HITS algorithms, either with authority or hub sort, was the
same. Focusing on the type of spots that were highlighted, they mainly include bars, boardwalks and
other spots near the New York coastline due to the fact that the data collection was done during the
months of August and early September of 2012.
64
Name CheckinsTattoo Shot Lounge 227Dunkin’ Donuts 970Gargiulo’s Restaurant 697The Freak Bar 540Ruby’s Bar & Grill 2,025Coney Island Beach & Boardwalk 36,206Cha Cha’s 1,142Denny’s Delight 84Coney Island Sound 280Coney Island Polar Bear Club 85
Table 5.7: Spot influence scores for PageRank and HITS algorithms (that present the exact same top-10), for theUser+Spot Graph, built from the FourSquare dataset.
When finding influencers in the Twitter dataset, one must acknowledge that users tweet wherever they
are, may it be at home, while waiting for a doctor’s appointment, etc ... Therefore many of the locations
that we could identify are not necessarily venues, i.e., the geographic coordinates associated with a
tweet may point to a street or avenue, and not a theater, museum or restaurant like it happened in the
FourSquare experiment. Nevertheless, this is only due to the inner characteristics of the Twitter social
network, which is content and user-centered and not location-centered like FourSquare. Due to the fact
that social networks have a dynamic behaviour, i.e., they can change over time with the addition or loss
of users and relationship ties, the third highest ranked user for HITS - Authority, from Tables 5.8 and 5.9
had a profile on Twitter and was active during our crawl, between July and August of 2012, nevertheless,
he no longer has a Twitter profile thus, being marked with a *, after the user id.
In the case of the Twitter dataset, the results from the computation of IP algorithm are not be presented,
because the obtained results were not coherent and not nearly comparable with the ones that were
obtained in FourSquare.
From Table 5.8, we can observe that HITS algorithm, with influence sorted by authority or hub score,
reveals Twitter users that are well-known to the public and whom exert significant influence due to their
roles on society, e.g., by being an entrepreneur, a journalist or an actor. Also, due to their professional
activity and media exposure, one can say that they can shape conversations, they are users other
network users want to listen to. Conversely, from the top-10 generated by PageRank algorithm, one can
acknowledge that friendship ties among anonymous (to public) users are highlighted.
Regarding the User Graph, we can see that the output from HITS an PageRank algorithms, depicted in
Table 5.9, is exactly the same as in the User+Spot Graph. This enhances the fact that in this particular
dataset there is a greater number of relationships among users than between users and locations, so
when these location ties are disregarded the strong ties between users naturally prevail. Also, one can
see from Tables 5.8 and 5.9 that, yet again, the total number of follower and friends is not necessarily
correlated with influence on Twitter.
65
PageRank HITS - Authority HITS - HubName Followers Following Name Followers Following Name Followers FollowingPerson-67779865 45,702 41,870 J. Wortham 463,772 3,424 J. Lupton 301,965 276,780J. K. Pulver 469,092 38,542 J. K. Pulver 469,092 38,542 NOH8 Campaign 426,079 251,158JobsDirectUSA.com 17,075 18,782 Person-325410549* — — Person-25915690 595,404 192,241Person-479562736 16,703 16,241 B. Thurston 124,722 5,707 M. Allen 144,540 55,678America Hires 11,824 13,006 StumbleUpon 72,133 10,370 Person-203455506 188,527 41,190Person-52306188 9,989 9,878 DL Hughley 73,835 886 NY Daily News 85,821 10,681Person-35844123 10,030 9,761 J. Rampton 47,593 578 Person-18704291 19,212 21,098Person-24883913 11,191 9,583 Person-51560438 103,721 14.766 J. Calacanis 151,155 112,248Person-213105865 8,531 9,965 Person-67779865 45,699 41,868 92YTribeca 13,015 10,560Person-30735143 7,837 8,513 Person-1536651 34,216 456 C.C. Chapman 34,512 28,505
Table 5.8: User influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built from the Twitterdataset.
PageRank HITS - Authority HITS - HubName Followers Following Name Followers Following Name Followers FollowingPerson-67779865 45,702 41,870 J. Wortham 463,772 3,424 J. Lupton 301,965 276,780J. K. Pulver 469,092 38.542 J. K. Pulver 469,092 38,542 NOH8 Campaign 426,079 251,158JobsDirectUSA.com 17,075 18,782 Person-325410549* — — Person-25915690 595,404 192,241Person-479562736 16,703 16,241 B. Thurston 124,722 5,707 M. Allen 144,540 55,678America Hires 11,824 13,006 StumbleUpon 72,133 10,370 Person-203455506 188,527 41,190Person-52306188 9,989 9,878 DL Hughley 73,835 886 NY Daily News 85,821 10,681Person-35844123 10,030 9,761 J. Rampton 47,593 578 Person-18704291 19,212 21,098Person-24883913 11,191 9,583 Person-51560438 103,721 14.766 J. Calacanis 151,155 112,248Person-213105865 8,531 9,965 Person-67779865 45,699 41,868 92YTribeca 13,015 10,560Person-30735143 7,837 8,513 Person-1536651 34,216 456 C.C. Chapman 34,512 28,505
Table 5.9: User influence scores for PageRank and HITS algorithms, for the User Graph, built from the Twitterdataset.
As one can observe from Table 5.10, a great majority of the top-10 highest ranked scores are not venues
per se, the geographical locations associated with these tweets correspond to streets or avenues, due
to the use of Twitter in various mobile applications. Nevertheless, some well known spots like Times
Square and JFK are naturally highlighted. Also, one can acknowledge that, in this particular case, the
spots with greater number of checkins turn out to be the most influential spots in the dataset.
PageRank HITS - Authority HITS - HubName Checkins Name Checkins Name CheckinsBroadway - Times Square 4 Pace University 8 Spot40.71498749:-73.95485289 2JFK Airport 2 Spot40.679254:-73.8632521 1 Spot40.7827699:-73.95211752 1JFK Airport (Subway Station) 1 Spot40.67982674:-73.86344992 1 Spot40.76619859:-73.91322359 1Spot40.80567362:-73.91862858 1 Spot40.6792906:-73.8622276 1 Skin Magic Ltd 1Spot40.66931554:-74.20359207 1 Park Lane Hotel 1 Spot40.76614592:-73.91323331 1Spot40.73262798:-73.98359375 1 Astoria Bowl 1 Spot40.76616717:-73.91319381 1Rosa Mexicano (Restaurant) 1 Spot40.7166368:-73.9543937 1 Broadway - Times Square 1The Abyssinian Baptist Church 1 Columbus Circle 1 Spot40.75612638:-73.90477465 1St Luke’s School 1 Spot40.86745661:-74.12978901 1 Spot40.76113205:-73.97952078 1Spot40.742727:-73.994372 1 Spot40.89064994:-73.89948689 1 JFK Airport 1
Table 5.10: Spot influence scores for PageRank and HITS algorithms, for the User+Spot Graph, built from theTwitter dataset.
5.3.1.2 Academic social network: DBLP
In Table 5.11 are the top-10 highest ranked papers from the citation network built upon DBLP data,
where recipients of scientific awards are highlighted in bold. From this table one can acknowledge that
the top-10 remained unaltered for scientific papers published until 2010 and until 2011, and that the
66
majority of these publications are authored by recipients of one or more of the renowned awards from
the list in Appendix A.
Focusing on the title of these scientific papers, one can also verify that this top-10 comprises publications
that can be considered breakthroughs in a specific research area, e.g., Gerard Salton’s leading work in
information retrieval, or inevitable textbook references, e.g., Cormen et al.’s Introduction to Algorithms.
Nevertheless, even if the authors aren’t recipients of renowned scientific awards, the fact that they col-
laborate with many other authors lead them to be cited in a greater number of publications, reinforcing
their PageRank score.
PageRankPaper Authors 2010 2011
A Unified Approach to Functional Philip A. Bernstein, J. Richard Swenson, 0,000903919 0,000903646Dependencies and Relations Dennis Tsichritzis
On the Semantics of the Hans Albrecht Schmid, J. Richard Swenson 0,000891394 0,000891123Relational Data Model
Database Abstractions: Aggregation John Miles Smith, Diane C. P. Smith 0,000860181 0,00085993and Generalization
Smalltalk-80: The Language Adele Goldberg, David Robson 0,000763314 0,000763174and Its Implementation
A Characterization of Ten Hidden-Surface Ivan E. Sutherland, Robert F. Sproull, 0,000716136 0,000716507Algorithms Robert A. Schumacker
An algorithm for hidden line elimination R. Galimberti 0,000706674 0,000707118
Introduction to Modern Information Retrieval Gerard Salton, Michael McGill 0,000699671 0,000699584
C4.5: Programs for Machine Learning J. Ross Quinlan 0,000635416 0,000636705
Introduction to Algorithms Thomas H. Cormen, Charles E. Leiserson, 0,000592198 0,000592414
Ronald L. Rivest
Compilers: Princiles, Techniques, and Tools Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman 0,000528325 0,000528235
Table 5.11: PageRank scores for top-10 highest ranked papers of the DBLP dataset.
5.3.2 Predicting Future PageRank Scores and Download Counts
In this section, the experiment regarding the prediction of future influence scores and future download
counts is detailed and thoroughly discussed. For a better understanding, we call the model for predicting
future PageRank scores and download counts that includes the age of each article the age model and
the model that includes age of the article and the term frequency of the 100 most frequent words in the
abstract and title of each paper the text model - see Table 5.12.
From Table 5.12 and considering the experiment of predicting the PageRank scores for the year of
2010, both models have provided very similar results, both improving as we added more information, i.e,
comparing the three groups of features (PageRank Scores, PageRank scores with racer scores, and
PageRank scores with Racer scores, Average PageRank score of the author and Maximum PageRank
score of the author) and also comparing within the same groups, the quality of the results improves
67
consistently. Only for the set of features that combine the PageRank score of one previous year with its
respective Racer and the author’s Average and Maximum PageRank score, the age model is outper-
formed by the text model. Comparing the error rate for the same year, one can assess that, for both
models, as we add more information the error rate increases, resulting in the deviation of the results.
Nevertheless, for the first two groups of features, the text model has a lower error rate than the age
model, while the opposite happens for the third group of features.
Having computed the absolute error for all the groups of features in both models, the results show that,
on average, the text model has always a lower absolute error than the age model.
PageRank 2010 PageRank 2011Features ρ τ NRMSE ρ τ NRMSE
Age
Rank k = 1 0.9725065 0.9163994 0.0003224 0.9929880 0.9837121 0.0001057Rank k = 2 0.9836493 0.9381865 0.0006161 0.9999050 0.9994758 0.0000995Rank k = 3 0.9890716 0.9506366 0.0006391 0.9999002 0.9993787 0.0004768Racer + Rank k = 1 0.9724540 0.9173649 0.0003469 0.9998887 0.9994037 0.0002322Racer + Rank k = 2 0.9837098 0.9387564 0.0006520 0.9999004 0.9992955 0.0001634Racer + Rank k = 3 0.9888725 0.9493687 0.0006605 0.9952435 0.9866206 0.0005492A + R + Rank k = 1 0.9675213 0.9098510 0.0005354 0.9998529 0.9994497 0.0002530A + R + Rank k = 2 0.9840530 0.9355465 0.0008336 0.9998353 0.9993422 0.0002962A + R + Rank k = 3 0.9892456 0.9468673 0.0006986 0.9938021 0.9828511 0.0005317
Text
Rank k = 1 0.9708719 0.9101722 0.0003608 0.9992124 0.9979693 0.0002479Rank k = 2 0.9831039 0.9310399 0.0006268 0.9997962 0.9992362 0.0004543Rank k = 3 0.9886945 0.9451537 0.0006276 0.9995012 0.9983375 0.0005800Racer + Rank k = 1 0.9711170 0.9098901 0.0005515 0.9994290 0.9984499 0.0001590Racer + Rank k = 2 0.9832037 0.9314405 0.0006747 0.9997300 0.9990720 0.0001919Racer + Rank k = 3 0.9887959 0.9470102 0.0006667 0.9994104 0.9980729 0.0006416A + R + Rank k = 1 0.9705230 0.9984499 0.0001590 0.9997019 0.9990583 0.0002480A + R + Rank k = 2 0.9837012 0.9990720 0.0001919 0.9998617 0.9993443 0.0002800A + R + Rank k = 3 0.9888386 0.9980729 0.0006416 0.9998793 0.9993885 0.0006987
Table 5.12: Results for the prediction of impact PageRank scores for papers in the DBLP dataset.
For the year of 2011, as we add more information to the models, the text model outperforms the age
model, as shown in the last two sets of features from the third group. Also, in the scenario in which
the models only have the information about the immediately previous PageRank score, the age model is
again outperformed by the text model. Nevertheless, when considering the error rate for both models for
this year, the text model has an overall higher error rate than the age model showing that, even though
the quality of the predicted results is lower in the age model, the results are more accurate.
As occurred for the computation of the absolute error for the year 2010, in all groups of features in both
models, the results for the year of 2011 show that, on average, the text model has a lower absolute error
than the age model.
Regarding the prediction of download counts depicted in Table 5.13, one can acknowledge that using a
text model increases the quality of our results. In the age model, we can verify that adding information
about the Racer to the previous PageRank scores affects the results negatively, while combining previ-
ous PageRank scores with Racer, and the author’s Average and Maximum PageRank scores provides
better results with a lower error rate. From this fact, we can conclude that the age model provides a
68
more accurate prediction as it becomes more complete. The opposite happens in all groups of the text
model, i.e., as we, within the same group, add more information to the model, one can acknowledge that
the quality of the results decreases, even though they are far better than the corresponding results in
the age model.
We can also verify that the age model, for the groups of features that only include previous PageRank
scores, and for the ones that combine previous PageRank scores with Racer and author’s Average and
Maximum PageRank scores, have a lower error rate than the corresponding groups in the text model.
And even though text model has better overall results, the error rate is greater than in the age model for
download counts prediction.
As for the absolute error the results showed that, generally, the text model has a lower absolute error
rate than the age model in all groups, except the third.
Features ρ τ NRMSE
Age
Rank k = 1 0.3864814 0.2742998 0.0080585Rank k = 2 0.4221492 0.3001470 0.0029377Rank k = 3 0.4323201 0.3080974 0.0028074Racer + Rank k = 1 0.4396605 0.3076576 0.0076713Racer + Rank k = 2 0.3370149 0.4747241 0.0078403Racer + Rank k = 3 0.3313412 0.4612442 0.0088301A + R + Rank k = 1 0.3377553 0.2558403 0.0147155A + R + Rank k = 2 0.5335481 0.3894899 0.0088093A + R + Rank k = 3 0.5406937 0.3962472 0.0078576
Text
Rank k = 1 0.5250188 0.3837016 0.0086955Rank k = 2 0.5261168 0.3849615 0.0087775Rank k = 3 0.5060003 0.3674801 0.0091976Racer + Rank k = 1 0.5325432 0.3887987 0.0085328Racer + Rank k = 2 0.5224018 0.3822982 0.0089440Racer + Rank k = 3 0.5087407 0.3703400 0.0091979A + R + Rank k = 1 0.5709764 0.4234845 0.0076071A + R + Rank k = 2 0.5651282 0.4180070 0.0079000A + R + Rank k = 3 0.5608946 0.4148554 0.0088935
Table 5.13: Results for the prediction of download numbers for papers in the DBLP dataset.
In brief, from the results in Tables 5.12 and 5.13, we can acknowledge that predicting the number of
downloads is an harder task than predicting the future PageRank scores. We can also see that, when
predicting future PageRank scores, as more information is added to the model, the more the results
deviate. Nevertheless, the opposite happens when we are trying to predict the number of downloads.
Comparing the years of 2010 and 2011, we can acknowledge that predicting the PageRank scores of a
more recent year is easier than if we progressively go back in time to predict the PageRank score of a
more distant year.
69
5.4 Summary
In this chapter I presented and discussed the results obtained from the experiments of finding influ-
encers in FourSquare and Twitter, as well as, in the DBLP citation network, and from the experiments for
predicting future PageRank scores and future download counts for scientific papers downloaded from
the ACM Digital Library.
Regarding location-based social networks, one can acknowledge that, most of the time, the most influ-
ential users in a network are not the ones who have more followers. From the results one can see that
in the User Graph, the relationships between unknown (to the public) users prevails, while TV channels,
celebrities or worldwide magazines are highlighted and, thus, among the most influential users in the
User+Spot Graph.
As for the experiment with the DBLP citation network, results have shown that the proposed frame-
work, based on an ensemble regression model, offers highly accurate predictions, providing an effective
mechanism to support the future ranking of papers in academic digital libraries.
70
Chapter 6
Conclusions
In my MSc thesis I proposed to explore the task of finding influential users in a social network, with
the aid of network analysis techniques and algorithms. As I intended to perform experiments with
different types of social networks, I began by collecting real and up-to-date data from both FourSquare
and Twitter, in order to build two distinct social networks based on location, and gathered a dataset from
the DBLP digital library, already structured in the context of the Arnetminer project, so an academic
citation network could be built.
Influence was then estimated through the computation of ranking state-of-the-art algorithms, such as
PageRank, HITS and IP. In the particular case of the IP algorithm, and concerning location-based social
networks, we wanted to estimate exclusively user influence, thus instead of building a network with
user-user and user-location ties, the original implementation of the IP algorithm was adapted so that the
resulting network graph consisted solely in weighted user-user ties.
Regarding the academic citation network, besides an influence estimation for all the papers in the
dataset, we also addressed a recent research topic and developed a framework to predict the future
influence scores of scientific papers and the future download counts of papers downloaded from the
ACM digital library for a specific year, based on the previous years’ influence scores. In this experiment
we could test and combine different sets of features, resulting in two different models for the prediction
of future influence scores: (1) a model including the age of the paper, and (2) a model including the 100
most frequent words in all papers’ titles and abstracts.
Rank Aggregation was also part of the initial objectives of this work, in order to combine the output of
the different algorithms nonetheless, due to some difficulties with the completion of the remaining tasks
included in the MSc thesis work,this task could not be addressed in time.
With the results of our experiments we could perform a detailed characterization of the aforementioned
71
social networks, and verify that social network analysis techniques can be used to assess the most in-
fluential nodes of a network. As for the prediction of future influence scores, we can conclude that the
framework that was developed for academic citation networks provides reliable and accurate estima-
tions, very close to the real values.
A major limitation of this work resides in the evaluation of the results regarding location-based networks.
Unlike academic social networks, where one can either assess the validity of the most influential authors
or the most influential articles through an extensive list of renowned scientific awards that have been
earning prestige throughout the years, social network analysis and, most specifically, location-based
networks is a recent area of studies in which one does not yet have a list of characteristics that indicate
without flaws that a user or a spot is influential, or a series of public prizes that award people, companies
or spots due to their relevance and influence in a specific context. Therefore, this task had to be done
by comparison to well known state-of-the-art social network analysis metrics. Also, social networks are
dynamic, so that set of users or spots that can be considered influent or trendy today, might be different
if we make the same estimation, within the same conditions, in a couple of months or a year.
6.1 Summary of Results
In brief, the following are the most important contributions of my MSc thesis, according to their relevance:
Crawling software
I implemented crawlers to extract data from FourSquare and from Twitter, using their respective
APIs. From the data that was collected I built two location-based networks, from which I extracted
its most influential nodes. The source code for the FourSquare crawler was made available as an
open-source project1, so it can be re-used by others researching this topic.
Implementation and adaptation of the Influence-Passivity (IP) algorithm
Having conducted a thorough study regarding ranking algorithms, with special focus in the PageR-
ank algorithm and its variants, I implemented Influence-Passivity (IP) algorithm. The originality in
the implementation of IP resides in the fact that the network is built in such way that it only con-
tains user-user arcs and the weights assigned to each edge depend on the number of spots that
the two users have visited in common. This adaptation of IP bears the fact that in location-based
networks the information is spread differently than in average social networks. The code for the
implementation of the IP algorithm was made available as an open-source project2, so it can be
used and improved by others researching this topic.
1http://code.google.com/p/fscrawler/2http://code.google.com/p/ezgraph/
72
Academic Citation Network
From the already structured data from DBLP, organized in the context of the Arnetminer Project,
I built an academic citation network and was able to extract its most influential papers, through
the computation of the PageRank algorithm. The results were validated against an extensive list of
renowned scientific awards, coming to the conclusion that the majority of the top-10 highest ranked
papers in the network are either authored by recipients of the aforementioned awards or represent
breakthroughs, unquestionable text books on a specific topic or are authored by scientists who
have collaborated and co-authored with a great number of other scientists.
Framework for prediction of future PageRank scores and future download counts
I developed a framework to predict the future PageRank score of scientific papers and the future
download counts of a scientific paper for a specific year, using the academic citation network
mentioned in the previous item.
This task was address through an ensemble learning regression algorithm, the IGBRT. I also as-
sessed the impact that different features and the combination of different features like previous
PageRank scores or the age of the paper, have in the accuracy of the results. Our predictions
were compared to the real PageRank scores and the real number of downloads in the ACM Digital
Library for that specific paper and year and we concluded that in some cases, depending on the
combination of features that we used, having some information can negatively deviate the results,
while in others, as we combine more information, the predictions become closer to the real values.
Globally, this approach to future PageRank prediction proved to be accurate, with the predicted
results very close to the real values.
6.2 Future Work
In terms of future work, it would be important to address all the tasks that I initially intended fulfill, namely,
conduct rank aggregation in the aforementioned experiments. It would also be very interesting to find
the most influential users and spots for more complete datasets, which could result in a much richer
network and subsequent analysis.
Taking advantage of the fact that this research area is still in its infancy, we could combine the work of
this MSc thesis with the work of Lima & Musolesi (2012), which adapts well known local and global social
network analysis metrics like degree or clustering coefficient that are location-agnostic, giving them a
spatial context, e.g., to calculate the degree of a node in the network, but only considering the friends of
this node that are associated with a specific geographical location, such as a city or a state.
Also, due to the fact that social networks are dynamic networks, i.e, its structure can change overtime
with the addition or loss of nodes and relationships, we could integrate state-of-the-art frameworks
73
and algorithms in order to include the passage of time in the networks we have studied. Even though
dynamic networks have been frequently addressed regarding network visualization (Demoll & Mcfarland,
2005), works such as of Berger-Wolf & Saia (2006) break away from conventional networks analysis, by
proposing a mathematical framework for dynamic network analysis.
On the other hand, we could also extend our work with the implementation of temporal distance metrics
proposed by Tang et al. (2009), that could be applied to networks that change over time and allow
us to capture the properties of these time-varying graphs, such as delay, duration and time order of
interactions between nodes.
74
Bibliography
AGARWAL, N., LIU, H., TANG, L. & YU, P.S. (2008). Identifying the influential bloggers in a community.
In Proceedings of the 2008 International Conference on Web Search and Web Data Mining.
ANAGNOSTOPOULOS, A., KUMAR, R. & MAHDIAN, M. (2008). Influence and correlation in social net-
works. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining.
ANDERSON, L.R. & HOLT, C.A. (1995). Information cascades in the laboratory. American Economic
Review , 87.
ARGUELLO, J., BUTLER, B.S., JOYCE, E., KRAUT, R., LING, K.S., ROSE, C. & WANG, X. (2006). Talk
to me: foundations for successful individual-group interactions in online communities. In Proceedings
of the 2006 SIGCHI Conference on Human Factors in Computing Systems.
BAKSHY, E., HOFMAN, J.M., MASON, W.A. & WATTS, D.J. (2011). Everyone’s an influencer: quantifying
influence on twitter. In Proceedings of the 4th ACM International Conference on Web Search and Data
Mining.
BASTIAN, M., HEYMANN, S. & JACOMY, M. (2009). Gephi: An open source software for exploring and
manipulating networks. In Proceedings of the 3rd International AAAI Conference on Weblogs and
Social Media.
BERBERICH, K., BEDATHUR, S. & WEIKUM, G. (2006). Rank synopses for efficient time travel on the
web graph. In Proceedings of the 15th ACM International Conference on Information and Knowledge
Management .
BERGER-WOLF, T.Y. & SAIA, J. (2006). A framework for analysis of dynamic social networks. In Pro-
ceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.
BEST, D.J. & ROBERTS, D.E. (1975). Algorithm as 89: The upper tail probabilities of spearman’s rho.
Journal of the Royal Statistical Society. Series C (Applied Statistics), 24.
BOLDI, P. & VIGNA, S. (2004). The webgraph framework I: compression techniques. In Proceedings of
the 13th International Conference on World Wide Web.
75
BOLDI, P., SANTINI, M. & VIGNA, S. (2005). Pagerank as a function of the damping factor. In Proceed-
ings of the 14th International Conference on World Wide Web.
BOLLEN, J., RODRIGUEZ, M.A. & VAN DE SOMPEL, H. (2006). Journal status. Scientometrics, 69.
BOLLEN, J., VAN DE SOMPEL, H., HAGBERG, A. & CHUTE, R. (2009). A principal component analysis
of 39 scientific impact measures. Public Library of Science, 4.
BONACICH, P. (2007). Some unique properties of eigenvector centrality. Social Networks, 29.
BONDY, J.A. & MURTY, U.S.R. (1976). Graph Theory with Applications. Macmillan.
BRAUER, A. (1952). Limits for the characteristic roots of a matrix. IV: Applications to stochastic matrices.
Duke Mathematical Journal , 19.
BREIMAN, L. (2001). Random forests. Machine Learning, 45.
BRIN, S. & PAGE, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proceed-
ings of the 7th International Conference on World Wide Web.
CHA, M., HADDADI, H., BENEVENUTO, F. & GUMMADI, K.P. (2010). Measuring user influence in twitter:
The million follower fallacy. In Proceedings of the 2010 International AAAI Conference on Weblogs
and Social Media.
CHEN, C. (2006). Citespace II: Detecting and visualizing emerging trends and transient patterns in
scientific literature. Journal of the American Society for Information Science, 57.
CHEN, P., XIE, H., MASLOV, S. & REDNER, S. (2007). Finding scientific gems with google’s pagerank
algorithm. Journal of Informetrics, 1.
CLARK, J. & HOLTON, D.A. (1991). A First Look at Graph Theory . World Scientific.
CONITZER, V. (2006a). Computational Aspects of preference aggregation. Ph.D. thesis, Carnegie Mellon
University.
CONITZER, V. (2006b). Computing slater rankings using similarities among candidates. In Proceedings
of the 21st National Conference on Uncertainty in Artificial Intelligence.
CONITZER, V. & SANDHOLM, T. (2005). Common voting rules as maximum likelihood estimators. In
Proceedings of the 2005 National Conference on Uncertainty in Artificial Intelligence.
CORMEN, T.H., LEISERSON, C.E., RIVEST, R.L. & STEIN, C. (2001). Introduction to Algorithms. The
MIT Press, 2nd edn.
DEMOLL, B.S. & MCFARLAND, D. (2005). The Art and Science of Dynamic Network Visualization. Jour-
nal of Social Structure, Volume 7.
76
DEVEZAS, J., NUNES, S. & RIBEIRO, C. (2011). Using the H-index to Estimate Blog Authority. In Pro-
ceedings of the 5th International AAAI Conference on Weblogs and Social Media.
DIESTEL, R. (2005). Graph Theory , vol. 173. Springer-Verlag, Heidelberg, 3rd edn.
DING, Y. & CRONIN, B. (2011). Popular and/or prestigious? measures of scholarly esteem. Information
Processing and Management , 47.
DING, Y., YAN, E., FRAZHO, A. & CAVERLEE, J. (2009). Pagerank for ranking authors in co-citation
networks. Journal of the American Society for Information Science and Technology , 60.
DUTTON, G. (1996). Improving locational specificity of map data - a multi-resolution, metadata-driven
approach and notation. International Journal of Geographical Information Science, 10.
EASLEY, D. & KLEINBERG, J. (2010). Networks, Crowds, and Markets: Reasoning About a Highly Con-
nected World . Cambridge University Press.
EGGHE, L. (2006). Theory and practise of the g-index. Scientometrics, 69.
EGGHE, L. (2009). Lotkaian informetrics and applications to social networks. The Bulletin of the Belgian
Mathematical Society , 16.
FIALA, D., ROUSSELOT, F. & JEZEK, K. (2008). PageRank for bibliographic networks. Scientometrics,
76.
FRANCK, G. (1999). Essays on Science and Society: Scientific Communication–A Vanity Fair? Science,
286.
FREEMAN, L.C. (1978). Centrality in social networks conceptual clarification. Social Networks, 215.
GELLER, C. (2002). Single transferable vote with Borda elimination: A new vote counting system. Tech.
rep., Deakin University, Faculty of Business and Law, School of Accounting, Economics and Finance.
GHOSH, R., LERMAN, K., SURACHAWALA, T., VOEVODSKI, K. & TENG, S.H. (2011). Non-conservative
diffusion and its application to social network analysis. Arxiv article pre-print.
GIBBONS, A. (1985). Algorithmic Graph Theory . Cambridge University Press.
HAGBERG, A.A., SCHULT, D.A. & SWART, P.J. (2008). Exploring network structure, dynamics, and
function using NetworkX. In Proceedings of the 7th Python in Science Conference.
HARARY, F. (1962). The determinant of the adjacency matrix of a graph. Society for Industrial and Ap-
plied Mathematics, 4.
HAVELIWALA, T.H. (2002). Topic-sensitive pagerank. In Proceedings of the 11th international conference
on World Wide Web.
77
HEIDEMANN, J., KLIER, M. & PROBST, F. (2010). Identifying key users in online social networks: A
pagerank based approach. In Proceedings of the 31st International Conference on Information Sys-
tems.
HIRSCH, J.E. (2010). An index to quantify an individual’s scientific research output that takes into ac-
count the effect of multiple coauthorship. Scientometrics, 85.
HUBERMAN, B.A., ROMERO, D.M. & WU, F. (2009). Crowdsourcing, attention and productivity. Journal
of Information Science, 35.
JOACHIMS, T. (1999). Advances in kernel methods. chap. Making large-scale support vector machine
learning practical, MIT Press.
JOACHIMS, T. (2002). Learning to classify text using support vector machines. Kluwer, dissertation.
KAISER, M. (2008). Mean clustering coefficients: the role of isolated nodes and leafs on clustering
measures for small-world networks. New Journal of Physics, 10.
KISELEV, V. (2008). On eligibility by the Borda voting rules. International Journal of Game Theory , 37.
KLEINBERG, J.M. (1998). Authoritative sources in a hyperlinked environment. In Proceedings of the 9th
Annual ACM-SIAM Symposium on Discrete Algorithms.
LEAVITT, A., BURCHARD, E., FISHER, D. & GILBERT, S. (2009). The influentials: New approaches for
analyzing influence on twitter. Webecology Project.
LEBANON, G. & LAFFERTY, J.D. (2002). Cranking: Combining rankings using conditional probability
models on permutations. In Proceedings of the 19th International Conference on Machine Learning.
LI, H. (2011). Learning to Rank for Information Retrieval and Natural Language Processing. Morgan &
Claypool Publishers.
LIMA, A. & MUSOLESI, M. (2012). Spatial dissemination metrics for location-based social networks.
In Proceedings of the 4th ACM International Workshop on Location-Based Social Networks (LBSN
2012). Colocated with ACM UbiComp 2012.
LIU, T.Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Re-
trieval , 3.
LIU, X., BOLLEN, J., NELSON, M.L. & VAN DE SOMPEL, H. (2005). Co-authorship networks in the digital
library research community. Information Processing and Management , 41.
LOTKA, A.J. (1926). The frequency distribution of scientific productivity. Journal of the Washington
Academy of Science, 16.
78
LUCIANO, RODRIGUES, F.A., TRAVIESO, G. & BOAS, V.P.R. (2005). Characterization of complex net-
works: A survey of measurements. Advances in Physics, 56.
LUCIANO, RODRIGUES, F.A., TRAVIESO, G. & BOAS, V.P.R. (2006). Characterization of complex net-
works: A survey of measurements. Advances in Physics, 56.
MACSKASSY, S.A. & PROVOST, F. (2007). Classification in networked data: A toolkit and a univariate
case study. Journal of Machine Learning Research, 8.
MCPHERSON, M., SMITH-LOVIN, L. & COOK, J.M. (2001). Birds of a feather: Homophily in social
networks. Annual Review of Sociology , 27.
MIHALCEA, R. (2004). Graph-based ranking algorithms for sentence extraction, applied to text summa-
rization. In Proceedings of the 2004 Annual Meeting of the Association for Computational Linguistics.
MILLEN, D.R. & PATTERSON, J.F. (2002). Stimulating social engagement in a community network. In
Proceedings of the 2002 ACM Conference on Computer Supported Cooperative Work .
MOHAN, A., CHEN, Z. & WEINBERGER, K.Q. (2011). Web-search ranking with initialized gradient
boosted regression trees. Journal of Machine Learning Research - Proceedings Track , 14.
NEWMAN, M.E.J. (2003). A measure of betweenness centrality based on random walks. Social Net-
works, 27.
NEWMAN, M.E.J. (2004). Analysis of weighted networks. Physical Review E , 70.
OLIVER, J.J. & HAND, D.J. (1995). On pruning and averaging decision trees. In In Proceedings of the
Twelfth International Conference on Machine Learning, Morgan Kaufmann.
PAGE, L., BRIN, S., MOTWANI, R. & WINOGRAD, T. (1998). The pagerank citation ranking: Bringing
order to the web. In Proceedings of the 7th International World Wide Web Conference.
PAPAGELIS, M., BANSAL, N. & KOUDAS, N. (2009). Information cascades in the blogosphere: A look
behind the curtain. In Proceedings of the 3rd International AAAI Conference on Weblogs and Social
Media.
PERRA, N. & FORTUNATO, S. (2008). Spectral centrality measures in complex networks. Physical Re-
view E , 78.
PROCACCIA, A.D., ZOHAR, A. & ROSENSCHEIN, J.S. (2006). Automated design of voting rules by
learning from examples. In In Proceedings of the 1st International Workshop on Computational Social
Choice.
REKA, A. & BARABASI (2002). Statistical mechanics of complex networks. Reviews of Modern Physics,
74.
79
ROMERO, D.M., GALUBA, W., ASUR, S. & HUBERMAN, B.A. (2011). Influence and passivity in social
media. In Proceedings of the 20th International Conference Companion on World Wide Web.
SAYYADI, H. & GETOOR, L. (2009). Futurerank: Ranking scientific articles by predicting their future
pagerank. In Proceedings of the 2009 SIAM International Conference on Data Mining.
SHANNON, P., MARKIEL, A., OZIER, O., BALIGA, N.S., WANG, J.T., RAMAGE, D., AMIN, N.,
SCHWIKOWSKI, B. & IDEKER, T. (2003). Genome Research, 13.
SIDIROPOULOS, A. & MANOLOPOULOS, Y. (2005). A citation-based system to assist prize awarding.
ACM SIGMOD Record , 34.
SIDIROPOULOS, A., KATSAROS, D. & MANOLOPOULOS, Y. (2007). Generalized hirsch h-index for dis-
closing latent facts in citation networks. Scientometrics, 72.
SZABO, G. & HUBERMAN, B.A. (2010). Predicting the popularity of online content. Communications of
the ACM, 53.
SZALAY, A.S., GRAY, J., FEKETE, G., KUNSZT, P.Z., KUKOL, P. & THAKAR, A. (2007). Indexing the
sphere with the hierarchical triangular mesh. Techinical Report.
TANG, J., MUSOLESI, M., MASCOLO, C. & LATORA, V. (2009). Temporal distance metrics for social
network analysis. In Proceedings of the 2nd ACM workshop on Online social networks.
WALKER, D., XIE, H., YAN, K.K. & MASLOV, S. (2007). Ranking scientific publications using a simple
model of network traffic. Journal of Statistical Mechanics.
WATTS, D.J. & DODDS, P.S. (2007). Influentials, networks, and public opinion formation. Journal of
Consumer Research, 34.
WATTS, D.J. & STROGATZ, S.H. (1998). Collective dynamics of ’small-world’ networks. Nature, 393.
WENG, J., LIM, E.P., JIANG, J. & HE, Q. (2010). Twitterrank: finding topic-sensitive influential twitterers.
In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining.
WU, F., WILKINSON, D.M. & HUBERMAN, B.A. (2009). Feedback loops of attention in peer production.
In Proceedings of the 2009 International Conference on Computational Science and Engineering.
XIA, L., LANG, J. & MONNOT, J. (2011). Possible winners when new alternatives join: New results com-
ing up. In In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent
Systems.
XING, W. & GHORBANI, A. (2004). Weighted pagerank algorithm. In Proceedings of the 2004 Annual
Conference on Communication Networks and Services Research.
80
YAN, E. & DING, Y. (2011). Discovering author impact: A pagerank perspective. Information Processing
and Management , 47.
YANG, J. & COUNTS, S. (2010). Predicting the speed, scale, and range of information diffusion in twitter.
In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media.
YOUNG, H.P. (2009). Innovation Diffusion in Heterogeneous Populations: Contagion, Social Influence,
and Social Learning. American Economic Review , 99.
ZHANG, C.T. (2009). The e-index, complementing the H-index for excess citations. Public Library of
Science, 4.
ZHENG, Y. & ZHOU, X., eds. (2011). Computing with Spatial Trajectories. Springer.
81
Appendix A
Important Awards in Computer
Science
The following renowned award lists were used as ground-truth lists in the task of assessing the veracity
of the PageRank scores obtained for the DBLP dataset:
• A. M. Turing Award1
• Knuth Prize2
• IEEE John von Neumann Medal3
• IEEE Emanuel R. Piore Award4
• ACM SIGMOD Edgar F. Codd Innovations Award5
• ACM SIGMOD Best Paper Award6
• ACM SIGMOD Test of Time Award7
• ACM Software System Award8
• ACM Innovation Award9
• National Science Foundation Presidential Young Investigator Award10
1http://amturing.acm.org/2http://www.sigact.org/Prizes/Knuth/3http://www.ieee.org/about/awards/medals/vonneumann.html4http://www.ieee.org/about/awards/tfas/piore.html5http://www.sigmod.org/sigmod-awards/sigmod-awards#innovations6http://www.sigmod.org/sigmod-awards/sigmod-awards#bestpaper7http://www.sigmod.org/sigmod-awards/sigmod-awards#time8http://awards.acm.org/homepage.cfm?srt=all&awd=1499http://www.sigkdd.org/awards_innovation.php
10http://www.nsf.gov/awards/presidential.jsp
83
• SIGIR Gerard Salton Award1
1http://www.sigir.org/awards/awards.html
84