simplifying large scale-free networks by …
TRANSCRIPT
The Pennsylvania State University
The Graduate School
College of Engineering
SIMPLIFYING LARGE SCALE-FREE NETWORKS BY
OPTIMIZING THE INFORMATION CONTENT AND
INCOMPREHENSIBILITY
A Thesis inIndustrial Engineering
byJuxihong Julaiti
Submitted in Partial Fulfillmentof the Requirementsfor the Degree of
Master of Science
December 2016
The thesis of Juxihong Julaiti was reviewed and approved* by the following: Soundar R.T. Kumara Allen E. Pearce/Allen M. Pearce Professor of Industrial and Manufacturing Engineering Thesis Adviser Conrad Tucker Assistant Professor of Industrial and Manufacturing Engineering Janis Terpenny Professor of Industrial and Manufacturing Engineering Head of the Department of Industrial and Manufacturing Engineering *Signatures are on the file in the Graduate School.
ii
Abstract
Visualizations of networks are critical to gain insights into complex systems, how-
ever, a complex network, which contains billions of nodes and edges, is incompre-
hensible. In this thesis, we focus on the simplification of large scale-free networks
by maximizing the information content and minimizing the incomprehensibility of
networks. An algorithm is developed to increase information content further by
enabling the sizes of retained nodes to represent information lost. Our proposed
method is verified through experimental study; the results reveal the optimal fil-
tering strategy for large scale-free networks is retaining roughly top 9% of vertices
based on their degree centrality. This method can be applied to visualize the
main information of large scale-free networks or used as the pre-processing step
for running other algorithms on networks
iii
Contents
List of Figures v
List of Tables vi
Chapter 1: Background 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Network Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2: Literature Review 6
Chapter 3: Methodology 11
3.1 Incomprehensibility . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Information function . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Neighbors-filtering simplification algorithm . . . . . . . . . . . . . . 19
Chapter 4: Experimentation 21
Chapter 5: Conclusions 24
Bibliography 27
iv
List of Figures
1 A large Facebook network . . . . . . . . . . . . . . . . . . . . . . . 2
2 The incomprehensibility changes based on the size of networks . . . 13
3 An example of how incomprehensibility changes based on the size
of networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 I(G) and IC(G) in a filtering process of a connected network . . . . 16
5 Filtering process of a connected network . . . . . . . . . . . . . . . 17
v
List of Tables
1 Results of simplification on 10 simulated scale-free networks . . . . 21
2 Original networks and simplified networks based on optimal filter
strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
vi
Chapter 1
Background
1.1 Introduction
Networks are increasing being used in di↵erent field including biology, physics,
social science, transportation. However, extracting information or applying algo-
rithms to vast and complex networks, as Figure 1 shows, sometimes is challenging.
The di�culties are significant when time and computational resources are limited,
and this necessitates intelligent analysis. Research has identified classes of prop-
erties that many real-world networks obey such as degree power laws, which show
that the set of node degree has a heavy-tailed distribution [15]. Such degree distri-
butions have been identified in phones call networks [1], the Internet [9], the Web
[3], click-stream data [4], and trust in social networks [5]. This property helps
in simplifying analysis of large networks and lead to focusing on certain nodes
of the complex network. Even simplifying network decreases the information of
the network, the Shannon entropy measure [22], can be used to quantify the in-
formation lost. With additional visualization quality measurements, the problem
can be treated as an optimization problem that minimize the information lost and
maximize the comprehensibility of the network visualizations.
1
2
Figure 1: A large Facebook network
In this thesis we propose an incomprehensibility measure to indicate the vi-
sualization quality of the network. The objective in the optimization of network
visualization is to maximize information content and at the same time minimize in-
comprehensibility index. This process will o↵er a way to find an optimal strategy,
which find the break-even point between information content and the incompre-
hensibility index. In other words, the process shrinks the network considerably
but keep most of the information.
3
Our contribution is three-fold.
1. Using Shannon entropy to evaluate the information content of a network.
It is well known that both fully connected and sparsely connected networks
have a very low information content as compared to networks with stochastic
degree distribution. By assign probability of nodes based on their degree
centrality, the Shannon entropy is able to capture this phenomenon. It
provides a way to measure the changes of information content during the
simplification.
2. We define incomprehensibility of a given network, which provides a way to
quantify how heavy the burden is when a human is viewing the network. By
modeling the simplification process as an optimization problem, our method
provides an optimal strategy that shows which nodes should be retained.
3. We develop a parametric algorithm for the network filtering process to fur-
ther reduce information losses. The algorithm records the information loss
by increasing sizes of vertices based on the degree of their filtered neigh-
bors so that the sizes give a hint of which nodes connected with larger, less
important and filtered hubs.
The remaining parts of the thesis are organized as follows: in the next sub-
section, we introduce concepts of network and degree centrality. In the second
section, we review recent work and applications of network visualization. In the
third section, we introduced the incomprehensibility, information function, and
our simplification algorithm. In the fourth section, the experimentation is con-
ducted by simplifying three large networks with power law degree distributions.
conclusion, limitation, and future works are discussed in the fifth section.
4
1.2 Network Concepts
In this work we use network and network interchangeably. A network is, in its
simplest form, a collection of points joined by lines[20]. The points are referred
to as nodes, and the lines are referred to as edges. Many objects of interest in
the physical, biological, and social sciences can be thought of as networks, leading
to new and useful insights [20]. Formally, a network of a network G is a pair of
sets (N, E), where N is the set of nodes, E is the set of edges, formed by pairs of
nodes [26]. Specifically, when the edges of a network do not have direction, then
the network is called an undirected network; when the edges of a network have a
direction associated with them, then the network is called directed network.
A large volume of research on networks has been devoted to the concept of
centrality. The corresponding research addresses the question, which nodes are
the most important or central to a network? There are many possible definitions
of importance and, correspondingly, many centrality measures for networks. In
this section, we focus on degree centrality that we are used in this work.
The number of edges connected to the node is known as the degree. The degree
is sometimes called degree centrality[20]. In directed networks, nodes have both
an in-degree and an out-degree, corresponding to the number of edges connected
to the node and the number of edges connected from the node. If a node has
high degree centrality, then the node might have more influence, more access
to information, more prestige, or be more widely known[20]. For an undirected
network of n nodes, the degree of node i can be written in terms of adjacency
matrix as: CD(i) =
Pn
j=1 Aij
, which is number of edges connected to the node i
[20]. For a directed network of n nodes , the in-degree and out degree of node i can
be written in terms of adjacency matrix as: C in
D(i) =P
n
j=1 Aij
, Cout
D(i) =P
n
i=1 Aij
[20].
5
1.3 Problem description
We identify the following issues when exploring existing work in visualizing large
networks:
1. Congested or incomprehensible visuals: as the number of nodes and
edges increases so does the required space for visualizing the network. When
the network size reaches thousands, millions, or billions of nodes, as is com-
mon in social networks, the ability to extract useful information and patterns
from visuals becomes challenging even with the best layout settings [7].
2. Uncertainty of the degree of simplification: The over-simplification
would decrease the readability of networks as well as the under-simplification.
The over-simplification removes too much information from the original net-
work by keeping just the nodes with the highest centrality, which can be done
by looking at the list of centrality. By contrast, the under-simplification usu-
ally keeps too much information for comprehension. Hence, it’s hard to find
a suitable degree to keep enough information to study the network.
3. Neglect of neighbors in resizing: Using the size of the nodes is equiv-
alent to add an additional dimension in to the visualization, it allows us to
extract more information from networks by sizing nodes based the features
of interest. Existing resizing algorithms allow user to set size of a given node
based on its attributes. In simplification process, we want to use the infor-
mation lost that is caused by filtering neighbors as the size of a given node,
so that we can observe the most of the information of a network and see how
much information are lost associated with the nodes. However, there is no
such algorithm exists.
Chapter 2
Literature Review
The mathematical study of the network science as network theory began with Eu-
ler’s solution to the ’Bridges of Konigsberg’ problem. In the past decades, as more
and more data are available, the scientific study of networks attracts researchers
from physics, computer science, mathematics, social science, and chemistry. Re-
lated work has been introduced and discussed in books such as The New Science of
Networks [3] and Network Science by Alber-Lazlo Barabasi, Networks: An Intro-
duction[20] by Mark Newman, as well as Six Degrees: The Science of a Connected
Age[25] by Duncan Watts.
As the study of networks becoming popular, networks become an important
model to form complex systems. The formalism is crucial to increase our under-
standing of natural and social systems, as well as important to build the ability
to engineer and e↵ectively use the complex networks. For example, it can help us
to control the spread of diseases, gossips; build robust communication structures;
establish better Web search engine.
Vilfredo Pareto made contributions to economics at the end of 19th century.
He is the first person who discovered that the income follows a power law prob-
ability distribution. It shows roughly 80% of the money is earned by 20% of the
6
7
population, which is also known as the 80/20 rule. The rule work in many other
domains, in project management, it is common that 20% of work consume 80% of
time and resources; in inventory management, it can often be seen that 80 percent
of profits are produced by only 20 percent of the products. This rule is helpful
since it reveals that not only doing things right matters but also doing right things.
Therefore, it is critical to identify those 20% important things at first place when
time and resources are limited.
A similar phenomenon can be captured in networks, such 80 percent of links
on the Web point to only 15 percent of web pages; 80 percent of citations go to
only 38 percent of scientists; 80 percent of links in Hollywood are connected to
30 percent of actors[3]. We encounter scale-free network in most real networks.
They represent a signature of a deeper organizing principle that is called the scale-
free property[3]. In scale-free networks, the degree distribution follows power law
distribution as (1) shows, where n is a node in the node set N of a given scale-free
network, k represents the degree centrality of node n. � ranges from 2 to 3 based
on the network. From (1), we can observe that if we randomly pick one node from
a scale-free network, the chance of choosing a node with relatively high degree
centrality is a lot less than the chance of getting one with a relatively low degree.
P (degree(n) � k) ⇠ k��, 2 � 3, n 2 N (1)
The power law distribution indicates that no matter how big a scale-free net-
work is, there always a fixed percent of nodes having great degree centrality.
Usually, these types of nodes play an important role in many complex systems.
Such as in social networks, higher the number of connections a person has, faster
the speed of spreading a message through him or her would be; more transient
population a city has, faster the speed of spreading a disease would be through
that city. Therefore, when we study a scale-free network and the time, resources
8
are limited, it is crucial to identify these critical nodes.
As it is colloquially stated that a picture worth 1000 words, the human brain
responds faster to a picture than to text because the eye and the visual cortex of the
brain form a parallel processor that provides the highest bandwidth channel into
human cognitive centers [23]. By augmenting cognition with the human visual
systems highly tuned ability to see patterns and trends, visualization tools are
expected to aid comprehension of the dynamics, intricacies, and properties [21].
However, a network visualization is useful only when it has good visualization
quality. There are two types of measurements to estimate the quality, one is
computational measurements called readability metrics, readers can estimate how
readable the network is by calculating metrics such as edge crossing, edge crossing
angle, average edge length and node occlusion, etc.; Another one is empirical
measurements, readers estimate how useful the network is by performing tasks or
asking experts opinion [12].
It seems true that with a suitable layout and settings, we could always make
a network comprehensible, however, if the network contains billions of nodes and
edges, even it has great readability metrics, it’s still incomprehensible. The ex-
perimental studies herein give many examples of the challenges when viewing
thousands of nodes because comprehending interacts in networks with thousands
of nodes or more is a challenging visualization task. These studies are consid-
ered small by today’s standards because social network analyses are being done
on networks with millions or billions of nodes. [17]. Hence, generally speaking,
a network contains billions of nodes and edges is less readable than a network
contains less than hundred nodes and edges. With limited computation resources,
a simplification is necessary to show the crucial features of the systems to give us
insight views into them, so that a further study can be conducted to understand
9
the details that matter most.
Network simplification has been studied to address di↵erent problems. For in-
stance, maximum flow problems are one of the practical problems. [27] studied
simplification of flow networks to reduce the time of finding maximum flow, it
identifies useful edges and useless edges and achieves networks simplification by
deleting useless edges without changing the maximum flow. [18] simplified net-
works by decomposing the networks and identifying bi-connected components to
reduce the time of finding maximum flow. Similarly, simplification is also used to
find minimum spanning tree in a shorter time[6]
Grouping nodes using di↵erent criteria is another way to reduce the visualization
complexity. NetLens Interface was designed around the abstract Content-Actor
network data model to allow users to pose a series of simple queries and iteratively
refine visual overviews and sorted lists. This enables the support of complex
queries that is traditionally hard to specify [14]. Pivot Graph is a software tool
that uses a new technique for visualizing and analyzing network structures. It can
explore and analyze multivariate networks by combining a grid-based visualization
with straightforward data reduction techniques [24].
Alternatively, a hierarchical topologic clustering shows a network of meta-nodes
like ASK-GraphView [2] and van Ham, van Wijk [11] used semantic fisheye views
to show clusters as merging spheres. Other approaches in creating overview net-
works include network summarization [19] and aggregating nodes by shared neigh-
bor sets [16]. In each of these techniques, it can be di�cult to understand the
topology of the individual aggregates, often because of the ambiguous nature of
clustering algorithms [8].
10
Even though there are many studies reported to simplifying networks for visual-
ization propose, the resulted simplified networks with a large number of nodes are
still incomprehensible. Therefore, which nodes should be retained and why with
quantitative measurements with respect to comprehensibility and information con-
tent is the gap we are trying to fill. Since the scale-free networks have power-law
degree distribution, which provides an elegant way to model the networks by con-
necting the probability and importance, the thesis is focused on simplifying large
scale-free networks by optimizing the information content and incomprehensibility.
Chapter 3
Methodology
3.1 Incomprehensibility
Logarithms are often used in human perception[13], Hick’s law proposes a loga-
rithmic relation between the time individuals take to choose an alternative and
the number of choices they have[15]. Since the number of nodes and number of
edges will a↵ect some choices when individuals are doing tasks on a network, we
assume the axiom is true: The general readability decreases logarithmically as
the size of network increase. We define the Incomprehensibility of a network as
function (2) shows:
IC(G) = log2(|E|) (2)
Where G is the network, |E| is the number of edges the network contains.
Based on the axiom, we can prove following lemmas:
Lemma 1: The incomprehensibility increase logarithmically as the den-
sity of networks increase
Proof. LetG as an undirected network such that |N | > 1, 0 < |E| Potentialconnections(G)
11
12
*G! undirected
) Potentialconnections(G) = |N |(|N |�1)2
* Density(G) = |E|Potentialconnections(G)
) |E| = Density(G)|N |(|N |�1)2
) IC(G) = log2(|E|) = log2(Density(G)|N |(|N |�1)
2 )
Then, when |N | is fixed, the IC(G) will increase logarithmically asDensity(N)
increase.
Proof. LetN as a directed network such that |N | > 1, 0 < |E| Potentialconnections(G)
*G! directed
) Potentialconnections(G) = |N |(|N |� 1)
* Density(G) = |E|Potentialconnections(G)
) |E| = Density(G)|N |(|N |� 1)
) IC(G) = log2(|E|) = log2(Density(G)|N |(|N |� 1))
Lemma 2: With the same number of nodes and density, directed net-
works always have the higher incomprehensibility than undirected net-
works
Proof. LetG1 be an undirected network such that |N1| > 1, 0 < |E1| Potentialconnections(G1)
andG2 be a directed network such that |N2| > 1, 0 < |E2| Potentialconnections(G2)
) IC(G1) = log2(|E|) = log2(Density(G)|N |(|N |�1)
2 )
IC(G2) = log2(|E|) = log2(Density(G)|N |(|N |� 1))
Then, when |N1| = |N2|, Density(G1) = Density(G2), IC(G2) > IC(G1).
Then, when |N | is fixed, the IC(G) will increase logarithmically as Density(N)
increase.
Figure 2 shows that the incomprehensibility increases in a slower rate as |E|
increased. As the example is shown in Figure 3, there are 2 groups of network
G1 and G2 , each network in each group contains 50 nodes. The growth of edges
13
Figure 2: The incomprehensibility changes based on the size of networks
from G11 to G13 and G21 to G23 are the same, however, the increasing of number
of edges can be visually-captured easier in G1 than in G2. By comparing the
incomprehensibility changes, we can seeIC(G12) � IC(G11) = 0.36 > IC(G22) �
IC(G21) = 0.06, IC(G13)�IC(G12) = 0.29 > IC(G23)�IC(G22) = 0.06, it shows
that the same amount of change of |E| has a greater a↵ected on IC(G) when |E|
is smaller. The result indicates that 1)changes of the size of a network are easier
to be perceived when it contains less edges, and harder to be perceived when it
contains more edges, 2) the incomprehensibility measure is able to capture this
14
phenomenon.
Figure 3: An example of how incomprehensibility changes based on the size ofnetworks
3.2 Information function
An important question in network analysis is, Are all of the nodes in a network are
equally important for readers? The Pareto principle states that for many events,
roughly 80% of the e↵ects come from 20% of causes.
Like Pareto principle, in scale-free networks, there are the only small amount
of nodes that have a high value of degree, most of the nodes have relatively low
degree [10][13].
15
In such networks, if we randomly pick one node, the probability of choosing a
node that has relatively high centrality is lower than the probability of choosing
a node that has relatively high centrality. Based on the maximum likelihood
estimation, we could estimate the probability of the occurrence of a node with
centrality x by P (x) ⇡ freq(x)|N | , where freq(x) is the count of the nodes that has
centrality as x. Based on the probability, we could calculate how much information
is contained in a network. Hence, by comparing the information content of the
network before and after simplification, we could estimate how much information
is lost when we are simplifying the network.
Named after Boltzmanns H-theorem, Shannon defined the entropy[22] H of
a discrete random variable X with possible values {x1, x2, , xn
} and probability
mass function P (X) as H(X) =P
P (X = x) ⇤ I(x), where I(x) = log21
P (X=x) is
the information function, it represents information content of X. For example,
if there are two events X1 , X2, and X1 is less likely to happen than X2, then
P (X1) < P (X2), therefore, I(X1) > I(X2), which means the occurrence of event
X1 provides more information than X2.
We use I(G) =P
n2N I(n) =P
n2N log21
P (X=centrality(n)) to measure the infor-
mation of a network, where G is the network, N is the set of nodes, and n is a node
in N , P (X = centrality(n)) is estimated by freq(X=centrality(n))|N | . I(G) measures the
total information content of the network. The representation is intuitive, it shows
that filtering one node, which has relatively high centrality, will decrease I(G)
dramatically, and filtering a node that has relatively lower centrality not have a
considerable impact on I(G).
A similar assumption can be applied here to the incomprehensibility of a given
network, human respond to information changes logarithmically. Therefore, in
this thesis, we use I(G) = log2P
n2N I(n) = log2(P
n2N) log2(1
P (X=centrality(n))).
16
Figure 4: I(G) and IC(G) in a filtering process of a connected network
Given a connected undirected network G, with a power law degree distribution,
when we filter it, the amount of information I(G), as well as the incomprehen-
sibility IC(G) will decrease as Figure 4 shows, where both I(G) and IC(G) are
scaled to be between 0 to 1.
Since the objective is to maximize I(G) and minimize IC(G), the optimal simpli-
fication strategy can be obtained by maximizing I(G)-IC(G). Figure 5 illustrates
the impact of removing nodes from a given network G on I(G) and IC(G). In
the simplifying process: 1) |N |,I(G), and IC(G) are scaled to be between 0 and
1; 2) The vertices are sorted non-decreasing order with respected to their degree
di
, where i 2 N , and N is the node set of the given network G and the nodes with
smaller degree are always removed earlier than nodes with larger degree.
17
Figure 5: Filtering process of a connected network
Figure 5 shows that when no node is removed (|N | = 1), G has maximum infor-
mation content(I(G) = 1) and maximum incomprehensibility measure(IC(G) =
1). As |N | decreases, both IC(G) and I(G) are decreasing. When |N | > 0.21,
IC(G) is decreasing faster than I(G), therefore I(G)-IC(G) is increasing. Once
|N | <= 0.21, I(G)-IC(G) starts to decrease because I(G) is decreasing faster
than IC(G). Since we want to maximize I(G)-IC(G), the optimal filtering strat-
egy should be retaining the top 21% of nodes.
18
The problem can be formulated as a binary integer programming problem as
shown below:
maximizev
↵log2[nX
i
vi
log2(1
pi
)]� �log2(dnX
i
nX
j
ai,j
vi
)
subject to 0 vi
1, i = 1, . . . , n.
vi
2 Z.
(3)
Where vi
is the binary decision variable that represents if the vertex i should
be removed, if vi
=1, the vertex i will be retained, if vi
=0 otherwise. ai,j
is an
element in given an adjacency matrix An,n
. When the network is directed d = 1,
when it is undirected d = 12 .
↵ and � are used to scale I(G) and IC(G) to be between 0 to 1. ↵ scales I(G)
using maximum and minimum value of I(G). It is easy to show the maximum of
the I(G) =P
n
i
log2(1pi) when the strategy is retaining all nodes. The minimum of
I(G) can be obtained if the strategy is only retaining the nodes with the largest
degree, since there are very few number of such nodes in the context of scale-free
networks, even when the size of the network is very large, the probability of the
occurrence of such nodes when we arbitrarily pick one node from the network
can be approximated as 1|N | , and the minimum of I(G) = log2(log2(|N |)). There-
fore, the scaled I(G) =log2[
Pni vilog2(
1pi
)]�log2(|N |)log2(
Pni log2(
1pi
))�log2(log2(|N |)) =log2[
Pni vilog2(
1pi
)]
log2(Pn
i log2(1pi
))�log2(log2(|N |)) �log2(log2(|N |))
log2(Pn
i log2(1pi
))�log2(log2(|N |)) = ↵log2[P
n
i
vi
log2(1pi)] � ↵log2(log2(|N |)), where ↵ =
1log2(
Pni log2(
1pi
))�log2(log2(|N |)) . Since |N | is a constant number given a network, it can
be removed from objective function.
Similarly, the maximum IC(G) = log2(dP
n
i
Pn
j
ai,j
) when no vertex is removed;
the minimum, however, is set as 0 since log20 ! �1. Therefore, the scaled
IC(G) =log2(d
Pni
Pnj ai,jvi)�0
log2(dPn
i
Pnj ai,j)�0 =
log2(dPn
i
Pnj ai,jvi)
log2(dPn
i
Pnj ai,j)
= �log2(dP
n
i
Pn
j
ai,j
vi
), where
� = 1log2(d
Pni
Pnj ai,j)
.
19
3.3 Neighbors-filtering simplification algorithm
In this sub-section, we proposed a Neighbors-filtering Simplification Algorithm
(NFSA) using Gephi toolkit in JAVA, and it uses the optimal strategy from equa-
tion (3) as part of inputs to further decrease information lost without a↵ect the
incomprehensibility by using sizes of nodes. NFSA is shown as Algorithm 1.
Algorithm 1: Neighbors-filtering simplification algorithmInput: Target percentile t;Width w and length l of the output image;Output path path;Data: G = (N,E) in GEXF format, with degree centrality C
D
;Result: G0 = (N 0, E 0)withN 0 2 N,E 0 2 E;Initialize node size to 1;maxsize = 1;
↵ =1
ln|E| ;
G’=G;for p = 1 : t do
filterList (i|i 2 N 0, CD
(i) < pthpercentile.CD
);for node n 2 filterList do
for node m 2 neighbors(n)&m /2 filterList do
size=[1+↵ ⇤ CD
(n)
pthpercentile.CD
⇤n.getSize()]*m.getSize();
if size > maxsize thenmaxsize=size
m.size=min(w
3,t
3)
size
maxsize;
m.addedge(m1 2neighbors(n),s.t.m1.index()< m.index()&m1 /2 filterList)
G’=G’.removeNode(n)
Output.image(”path/simplifiedGraph.png”, width = w, length =l)Output.gexf(G’,”path/simplifiedGraph.gexf”)
In NFSA, the ith-percentile of the degree centrality is used as the threshold, and
i = 1, 2, 3, ..., t where tth-percentile is set based the optimal filtering strategy and
any node whose centrality is lower than the threshold will be filtered.
In the NFSA, the size of a given node is changed during filtering process, it
depends on the number of filtered neighbors, as well as their degree centrality.
20
As NFSA shown, the size of nodes will increase faster if 1) it has more neighbor-
hoods, or 2)its filtered neighborhoods have higher centrality, or 3) those neighbor-
hoods have bigger sizes. The retained nodes in the simplified scale-free networks
have high degree centrality, and the size of a given node represents cumulative
information-loss caused by its filtered neighbors, where information-loss is defined
as the di↵erence of information content of the network before and after one node
is removed.
To apply NFSA, user needs to input 1)the desired network G in GEXF format,
and it should contains a column named degree, which can be easily generate using
Gephi. 2)optimal percentile threshold t, which can be found after solving equation
(3) and check the percentile of degree of the retained nodes with lowest value;
3)width and length of the output image, which contains the simplified network;
4) output path, which will be used to output images, as well as GEXF(Graph
Exchange XML Format) file.
Chapter 4
Experimentation
To test how the simplification algorithm works on di↵erent sizes of scale-free net-
works, ten random scale-free networks are generated using NetworkX in python.
The results are shown in Tables 1 and 2.
Table 1: Results of simplification on 10 simulated scale-free networks
The results indicate that, on the average, the optimal strategy of simplifica-
tion is filtering 91% of vertices that have a relatively small degree. The result is
intuitive, even the power law states roughly 20% of vertices matter, considering
readability of a large scale-free network, retaining about 9% of vertices gives most
of information content with a lower incomprehensibility. By applying the result to
21
22
Table 2: Original networks and simplified networks based on optimal filter strategy
large scale-free networks that contain million/billion nodes, the simplified networks
would theoretically provide the optimal balance between information content and
incomprehensibility measure.
Simplified networks in Table 2 show the algorithm captures the nodes that
play the role as the hub. From the perspective of understanding the structure
of networks, the simplified networks provide clearer structures, which might not
be clear in the original network. Moreover, since the number of nodes and edges
23
decreased, the simplified network shows the important nodes and their connections
clearer than original networks. In other words, by balancing the size of networks
and information lost based on Hick’s law, the results provide an optimal filtering
strategy that o↵ers a way of understanding the structure and important nodes of
a given network faster and easier. In a real life scenario, we might be interested in
finding the hubs and their connection, so that we can spread information or break
down the system of spreading of the epidemic faster and more e�ciently.
Chapter 5
Conclusions
A simplification technique to increase the readability of networks visualizations is
presented in this thesis. To quantify the process of the simplification, the Shannon
Entropy is used to measure information content of networks and information-loss
during the simplification. In addition, the incomprehensibility measure is pro-
posed based on Hick’s law, it measures the visualization quality of networks. By
measuring the information content of the network during the simplification, a
filtering strategy can be found to balance the information content and the incom-
prehensibility measure of a given network during the simplification. By formulate
the simplifying process into an binary optimization problem, the optimal filtering
strategy can be found easily. Moreover, an algorithm is developed to indicate cu-
mulative information-loss by set the sizes of nodes based on the information-loss
which is caused by their filtered neighbors. Overall, this methodology addresses
the problems described in above. Specifically:
1. Congested or Incomprehensible Visuals: our approach reduces the
number of nodes, as well as output a network that has theoretically opti-
mal network with respect to balance information constant and incompre-
hensibility measures. And the results can be used as an input of a further
visualization process,.
24
25
2. Neglect of Neighbors in Resizing: our approach will set a given node’s
size based on information-loss that caused by its filtered neighbors. Greater
the size of a node is, more accessibility the node has.
3. Uncertainty in simplification: Users can find optimal filtering strategy
by maximizing I(G)� IC(G).
The work can be applied in di↵erent problems that deal with scale-free networks.
For example, the spreading of the epidemic diseases through social networks. By
looking at the degree distribution, one can easily find potential hubs that would
cause a faster spreading of the diseases if any of those people are infected. However,
given hubs with the same degree, their accessibility to other parts of the network
might be di↵erent, and the method proposed in the thesis can find hubs with
higher accessibility by considering information-lost during the simplification. In
other word, bigger the size of a node is, among hubs with the same degree, more
accessibility the node has, therefore, higher priority should be given to it.
Similarity, given a limited chance to spread a information to a network, the best
way is spreading information through hubs. However, by spreading information
through hubs that are not connected and have higher accessibility would provide
a faster spread, which can be obtained from the simplified network.
The limitation of the work are shown as follow:
1. Our method only considered the scale-free networks, but there are many
networks are not necessarily scale-free, more generalizations toward di↵erent
types of networks are needed in the future work.
2. Modeling the probability based on the degree might be the intuitive way to
measure the information content in scale-free networks, however, it is not
26
generalized to capture all the information of the networks, and it will not be
enough when the method dealing with none-scale-free networks.
3. The simplified network from our method might not have the same properties
as original network, for example, the distributions of degree, betweenness
centrality, closeness centrality etc might be di↵erent, which causes additional
information lose that are not considered in our method.
The problem can be extend to simplify given connected networks while contain
the connectivity by simply adding an additional constraint as equations (4) show.
maximizev
↵log2[nX
i
vi
log2(1
pi
)]� �log2(dnX
i
nX
j
ai,j
vi
)
subject to 0 vi
1, i = 1, . . . , n.nX
j
ai,j
vi
� 2, i = 1, . . . , n.
vi
2 Z.
(4)
Where ↵ and � are used to scale the size and information content of the given
network. vi
is the binary decision variable that represents if the vertex i should
be removed, if vi
=1, the vertex i will be retained, if vi
=0 otherwise. ai,j
is an
element in given an adjacency matrix An,n
. When the network is directed d = 1,
when it is undirected d = 12 . The second constrain ensure the network will be
connected after the simplification.
In the future, we will construct the probability in the log2(1pi) using weighted
combination of other centrality measurements based on interests, the method can
capture more features of networks. In specific, pi
=PC
cen wcen⇤freq(ceni)|C||N | , where p
i
is
the probability of picking node i from a given network G if we randomly select
one node from node set N of the network; C is a set of interested centrality, it
contains centrality measurements such as degree centrality, closeness centrality,
27
betweenness centrality or any other interested centrality measurements; cen 2 C
and wcen
is the weight users give to the centrality cen, and freq(ceni
) is the
number of nodes that have centrality cen as node i; |C| is the cardinality of the
set of interested centrality; |N | is the number of nodes in network G. The weight
can be learned from preferred simplified networks from other complex methods,
or the weight can be set by users.
Bibliography
[1] Abello, Buchsbaum, and Westbrook. A Functional Approach to ExternalGraph Algorithms, 2002.
[2] J Abello, F Van Ham, and N Krishnan. ASK-Graph View: A large scalegraph visualization system. IEEE Transactions on Visualization and Com-puter Graphics, 12(5):669, 2006.
[3] Albert-Lszl Barabasi. Emergence of scaling in complex networks. Handbookof Graphs and Networks: From the Genome to the Internet, pages 69–84,2003.
[4] Zhiqiang Bi, Christos Faloutsos, and Flip Korn. The ”DGX” distributionfor mining massive, skewed data. Proceedings of the seventh ACM SIGKDDinternational conference on Knowledge discovery and data mining - KDD ’01,pages 17–26, 2001.
[5] D Chakrabarti, Y P Zhan, and C Faloutsos. R-MAT: A recursive model forgraph mining. 2004.
[6] Siddhartha Chatterjee, Michael Connor, and Piyush Kumar. Computing Geo-metric Minimum Spanning Trees Using the Filter-Kruskal Method. Technicalreport, 2009.
[7] Cody Dunne and Ben Shneiderman. Motif Simplification: Improving NetworkVisualization Readability with Fan, Connector, and Clique Glyphs. In CHI’13: Proc. 2013 international conference on Human Factors in ComputingSystems, pages 3247–3256, 2013.
[8] Cody Dunne and Ben Shneiderman. Motif Simplification: Improving NetworkVisualization Readability with Fan, Connector, and Clique Glyphs. CHI’13: Proc. 2013 international conference on Human Factors in ComputingSystems, pages 3247–3256, 2013.
[9] Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. On power-lawrelationships of the Internet topology, 1999.
[10] Santo Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75–174, 2010.
28
29
[11] F Van Ham and J J.van Wijk. Interactive visualization of small worldgraphs.pdf. Information Visualization, 2004. INFOVIS 2004. IEEE Sym-posium on, 2004.
[12] Weidong Huang, Mao Lin Huang, and Chun Cheng Lin. Evaluating overallquality of graph visualizations based on aesthetics aggregation. InformationSciences, 330:444–454, 2016.
[13] Yuntao Jia, Jared Hoberock, Michael Garland, and John C. Hart. On thevisualization of social and other scale-free networks. IEEE Transactions onVisualization and Computer Graphics, 14(6):1285–1292, 2008.
[14] Hyunmo Kang, Catherine Plaisant, Bongshin Lee, and Benjamin B. Beder-son. NetLens: Iterative exploration of content-actor network data. IEEESymposium on Visual Analytics Science and Technology 2006, VAST 2006 -Proceedings, (January):91–98, 2006.
[15] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution, 2007.
[16] Qi Liao. Anomaly Analysis and Visualization through Compressed Graphs.Ldav, 2012.
[17] X. Liu and T. Murata. Advanced modularity-specialized label propagationalgorithm for detecting communities in networks. Physica A: Statistical Me-chanics and its Applications, 389(7):1493–1500, 2010.
[18] Ewa Misio lek and Danny Z. Chen. Two flow network simplification algo-rithms. Information Processing Letters, 97(5):197–202, 2006.
[19] Saket Navlakha, Rajeev Rastogi, and Nisheeth Shrivastava. Graph summa-rization with bounded error. Proceedings of the 2008 ACM SIGMOD inter-national conference on Management of data - SIGMOD ’08, page 419, 2008.
[20] M. E. J. Newman. Networks: An Introduction. 2010.
[21] D.T. Rover. A performance visualization paradigm for data parallel com-puting. Proceedings of the Twenty-Fifth Hawaii International Conference onSystem Sciences, ii, 1992.
[22] Claude E Shannon. A mathematical theory of communication. The BellSystem Technical Journal, 27(July 1928):379–423, 1948.
[23] Colin Ware. Information visualization: perception for design. InformationVisualization, page 486, 2004.
[24] Martin Wattenberg. Visual Exploration of Multivariate Graphs, 2006.
[25] Duncan J Watts. Six Degrees: The New Science of Networks. 2004.
[26] Douglas B West. Introduction To Graph Theory Notice Second Edition ( 2001). Number 2001. 2005.
30
[27] Fang Zhou, Sbastien Mahler, and Hannu Toivonen. Network simplificationwith minimal loss of connectivity. In Proceedings - IEEE International Con-ference on Data Mining, ICDM, pages 659–668, 2010.