simplifying large scale-free networks by …

The Pennsylvania State University

The Graduate School

College of Engineering

SIMPLIFYING LARGE SCALE-FREE NETWORKS BY

OPTIMIZING THE INFORMATION CONTENT AND

INCOMPREHENSIBILITY

A Thesis inIndustrial Engineering

byJuxihong Julaiti

Submitted in Partial Fulfillmentof the Requirementsfor the Degree of

Master of Science

December 2016

The thesis of Juxihong Julaiti was reviewed and approved* by the following: Soundar R.T. Kumara Allen E. Pearce/Allen M. Pearce Professor of Industrial and Manufacturing Engineering Thesis Adviser Conrad Tucker Assistant Professor of Industrial and Manufacturing Engineering Janis Terpenny Professor of Industrial and Manufacturing Engineering Head of the Department of Industrial and Manufacturing Engineering *Signatures are on the file in the Graduate School.

ii

Abstract

Visualizations of networks are critical to gain insights into complex systems, how-

ever, a complex network, which contains billions of nodes and edges, is incompre-

hensible. In this thesis, we focus on the simplification of large scale-free networks

by maximizing the information content and minimizing the incomprehensibility of

networks. An algorithm is developed to increase information content further by

enabling the sizes of retained nodes to represent information lost. Our proposed

method is verified through experimental study; the results reveal the optimal fil-

tering strategy for large scale-free networks is retaining roughly top 9% of vertices

based on their degree centrality. This method can be applied to visualize the

main information of large scale-free networks or used as the pre-processing step

for running other algorithms on networks

iii

Contents

List of Figures v

List of Tables vi

Chapter 1: Background 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Network Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Chapter 2: Literature Review 6

Chapter 3: Methodology 11

3.1 Incomprehensibility . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Information function . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Neighbors-filtering simplification algorithm . . . . . . . . . . . . . . 19

Chapter 4: Experimentation 21

Chapter 5: Conclusions 24

Bibliography 27

iv

List of Figures

1 A large Facebook network . . . . . . . . . . . . . . . . . . . . . . . 2

2 The incomprehensibility changes based on the size of networks . . . 13

3 An example of how incomprehensibility changes based on the size

of networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 I(G) and IC(G) in a filtering process of a connected network . . . . 16

5 Filtering process of a connected network . . . . . . . . . . . . . . . 17

v

List of Tables

1 Results of simplification on 10 simulated scale-free networks . . . . 21

2 Original networks and simplified networks based on optimal filter

strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

vi

Chapter 1

Background

1.1 Introduction

Networks are increasing being used in di↵erent field including biology, physics,

social science, transportation. However, extracting information or applying algo-

rithms to vast and complex networks, as Figure 1 shows, sometimes is challenging.

The di�culties are significant when time and computational resources are limited,

and this necessitates intelligent analysis. Research has identified classes of prop-

erties that many real-world networks obey such as degree power laws, which show

that the set of node degree has a heavy-tailed distribution [15]. Such degree distri-

butions have been identified in phones call networks [1], the Internet [9], the Web

[3], click-stream data [4], and trust in social networks [5]. This property helps

in simplifying analysis of large networks and lead to focusing on certain nodes

of the complex network. Even simplifying network decreases the information of

the network, the Shannon entropy measure [22], can be used to quantify the in-

formation lost. With additional visualization quality measurements, the problem

can be treated as an optimization problem that minimize the information lost and

maximize the comprehensibility of the network visualizations.

1

2

Figure 1: A large Facebook network

In this thesis we propose an incomprehensibility measure to indicate the vi-

sualization quality of the network. The objective in the optimization of network

visualization is to maximize information content and at the same time minimize in-

comprehensibility index. This process will o↵er a way to find an optimal strategy,

which find the break-even point between information content and the incompre-

hensibility index. In other words, the process shrinks the network considerably

but keep most of the information.

3

Our contribution is three-fold.

1. Using Shannon entropy to evaluate the information content of a network.

It is well known that both fully connected and sparsely connected networks

have a very low information content as compared to networks with stochastic

degree distribution. By assign probability of nodes based on their degree

centrality, the Shannon entropy is able to capture this phenomenon. It

provides a way to measure the changes of information content during the

simplification.

2. We define incomprehensibility of a given network, which provides a way to

quantify how heavy the burden is when a human is viewing the network. By

modeling the simplification process as an optimization problem, our method

provides an optimal strategy that shows which nodes should be retained.

3. We develop a parametric algorithm for the network filtering process to fur-

ther reduce information losses. The algorithm records the information loss

by increasing sizes of vertices based on the degree of their filtered neigh-

bors so that the sizes give a hint of which nodes connected with larger, less

important and filtered hubs.

The remaining parts of the thesis are organized as follows: in the next sub-

section, we introduce concepts of network and degree centrality. In the second

section, we review recent work and applications of network visualization. In the

third section, we introduced the incomprehensibility, information function, and

our simplification algorithm. In the fourth section, the experimentation is con-

ducted by simplifying three large networks with power law degree distributions.

conclusion, limitation, and future works are discussed in the fifth section.

4

1.2 Network Concepts

In this work we use network and network interchangeably. A network is, in its

simplest form, a collection of points joined by lines[20]. The points are referred

to as nodes, and the lines are referred to as edges. Many objects of interest in

the physical, biological, and social sciences can be thought of as networks, leading

to new and useful insights [20]. Formally, a network of a network G is a pair of

sets (N, E), where N is the set of nodes, E is the set of edges, formed by pairs of

nodes [26]. Specifically, when the edges of a network do not have direction, then

the network is called an undirected network; when the edges of a network have a

direction associated with them, then the network is called directed network.

A large volume of research on networks has been devoted to the concept of

centrality. The corresponding research addresses the question, which nodes are

the most important or central to a network? There are many possible definitions

of importance and, correspondingly, many centrality measures for networks. In

this section, we focus on degree centrality that we are used in this work.

The number of edges connected to the node is known as the degree. The degree

is sometimes called degree centrality[20]. In directed networks, nodes have both

an in-degree and an out-degree, corresponding to the number of edges connected

to the node and the number of edges connected from the node. If a node has

high degree centrality, then the node might have more influence, more access

to information, more prestige, or be more widely known[20]. For an undirected

network of n nodes, the degree of node i can be written in terms of adjacency

matrix as: CD(i) =

Pn

j=1 Aij

, which is number of edges connected to the node i

[20]. For a directed network of n nodes , the in-degree and out degree of node i can

be written in terms of adjacency matrix as: C in

D(i) =P

n

j=1 Aij

, Cout

D(i) =P

n

i=1 Aij

[20].

5

1.3 Problem description

We identify the following issues when exploring existing work in visualizing large

networks:

1. Congested or incomprehensible visuals: as the number of nodes and

edges increases so does the required space for visualizing the network. When

the network size reaches thousands, millions, or billions of nodes, as is com-

mon in social networks, the ability to extract useful information and patterns

from visuals becomes challenging even with the best layout settings [7].

2. Uncertainty of the degree of simplification: The over-simplification

would decrease the readability of networks as well as the under-simplification.

The over-simplification removes too much information from the original net-

work by keeping just the nodes with the highest centrality, which can be done

by looking at the list of centrality. By contrast, the under-simplification usu-

ally keeps too much information for comprehension. Hence, it’s hard to find

a suitable degree to keep enough information to study the network.

3. Neglect of neighbors in resizing: Using the size of the nodes is equiv-

alent to add an additional dimension in to the visualization, it allows us to

extract more information from networks by sizing nodes based the features

of interest. Existing resizing algorithms allow user to set size of a given node

based on its attributes. In simplification process, we want to use the infor-

mation lost that is caused by filtering neighbors as the size of a given node,

so that we can observe the most of the information of a network and see how

much information are lost associated with the nodes. However, there is no

such algorithm exists.

Chapter 2

Literature Review

The mathematical study of the network science as network theory began with Eu-

ler’s solution to the ’Bridges of Konigsberg’ problem. In the past decades, as more

and more data are available, the scientific study of networks attracts researchers

from physics, computer science, mathematics, social science, and chemistry. Re-

lated work has been introduced and discussed in books such as The New Science of

Networks [3] and Network Science by Alber-Lazlo Barabasi, Networks: An Intro-

duction[20] by Mark Newman, as well as Six Degrees: The Science of a Connected

Age[25] by Duncan Watts.

As the study of networks becoming popular, networks become an important

model to form complex systems. The formalism is crucial to increase our under-

standing of natural and social systems, as well as important to build the ability

to engineer and e↵ectively use the complex networks. For example, it can help us

to control the spread of diseases, gossips; build robust communication structures;

establish better Web search engine.

Vilfredo Pareto made contributions to economics at the end of 19th century.

He is the first person who discovered that the income follows a power law prob-

ability distribution. It shows roughly 80% of the money is earned by 20% of the

6

7

population, which is also known as the 80/20 rule. The rule work in many other

domains, in project management, it is common that 20% of work consume 80% of

time and resources; in inventory management, it can often be seen that 80 percent

of profits are produced by only 20 percent of the products. This rule is helpful

since it reveals that not only doing things right matters but also doing right things.

Therefore, it is critical to identify those 20% important things at first place when

time and resources are limited.

A similar phenomenon can be captured in networks, such 80 percent of links

on the Web point to only 15 percent of web pages; 80 percent of citations go to

only 38 percent of scientists; 80 percent of links in Hollywood are connected to

30 percent of actors[3]. We encounter scale-free network in most real networks.

They represent a signature of a deeper organizing principle that is called the scale-

free property[3]. In scale-free networks, the degree distribution follows power law

distribution as (1) shows, where n is a node in the node set N of a given scale-free

network, k represents the degree centrality of node n. � ranges from 2 to 3 based

on the network. From (1), we can observe that if we randomly pick one node from

a scale-free network, the chance of choosing a node with relatively high degree

centrality is a lot less than the chance of getting one with a relatively low degree.

P (degree(n) � k) ⇠ k��, 2 � 3, n 2 N (1)

The power law distribution indicates that no matter how big a scale-free net-

work is, there always a fixed percent of nodes having great degree centrality.

Usually, these types of nodes play an important role in many complex systems.

Such as in social networks, higher the number of connections a person has, faster

the speed of spreading a message through him or her would be; more transient

population a city has, faster the speed of spreading a disease would be through

that city. Therefore, when we study a scale-free network and the time, resources

8

are limited, it is crucial to identify these critical nodes.

As it is colloquially stated that a picture worth 1000 words, the human brain

responds faster to a picture than to text because the eye and the visual cortex of the

brain form a parallel processor that provides the highest bandwidth channel into

human cognitive centers [23]. By augmenting cognition with the human visual

systems highly tuned ability to see patterns and trends, visualization tools are

expected to aid comprehension of the dynamics, intricacies, and properties [21].

However, a network visualization is useful only when it has good visualization

quality. There are two types of measurements to estimate the quality, one is

computational measurements called readability metrics, readers can estimate how

readable the network is by calculating metrics such as edge crossing, edge crossing

angle, average edge length and node occlusion, etc.; Another one is empirical

measurements, readers estimate how useful the network is by performing tasks or

asking experts opinion [12].

It seems true that with a suitable layout and settings, we could always make

a network comprehensible, however, if the network contains billions of nodes and

edges, even it has great readability metrics, it’s still incomprehensible. The ex-

perimental studies herein give many examples of the challenges when viewing

thousands of nodes because comprehending interacts in networks with thousands

of nodes or more is a challenging visualization task. These studies are consid-

ered small by today’s standards because social network analyses are being done

on networks with millions or billions of nodes. [17]. Hence, generally speaking,

a network contains billions of nodes and edges is less readable than a network

contains less than hundred nodes and edges. With limited computation resources,

a simplification is necessary to show the crucial features of the systems to give us

insight views into them, so that a further study can be conducted to understand

9

the details that matter most.

Network simplification has been studied to address di↵erent problems. For in-

stance, maximum flow problems are one of the practical problems. [27] studied

simplification of flow networks to reduce the time of finding maximum flow, it

identifies useful edges and useless edges and achieves networks simplification by

deleting useless edges without changing the maximum flow. [18] simplified net-

works by decomposing the networks and identifying bi-connected components to

reduce the time of finding maximum flow. Similarly, simplification is also used to

find minimum spanning tree in a shorter time[6]

Grouping nodes using di↵erent criteria is another way to reduce the visualization

complexity. NetLens Interface was designed around the abstract Content-Actor

network data model to allow users to pose a series of simple queries and iteratively

refine visual overviews and sorted lists. This enables the support of complex

queries that is traditionally hard to specify [14]. Pivot Graph is a software tool

that uses a new technique for visualizing and analyzing network structures. It can

explore and analyze multivariate networks by combining a grid-based visualization

with straightforward data reduction techniques [24].

Alternatively, a hierarchical topologic clustering shows a network of meta-nodes

like ASK-GraphView [2] and van Ham, van Wijk [11] used semantic fisheye views

to show clusters as merging spheres. Other approaches in creating overview net-

works include network summarization [19] and aggregating nodes by shared neigh-

bor sets [16]. In each of these techniques, it can be di�cult to understand the

topology of the individual aggregates, often because of the ambiguous nature of

clustering algorithms [8].

10

Even though there are many studies reported to simplifying networks for visual-

ization propose, the resulted simplified networks with a large number of nodes are

still incomprehensible. Therefore, which nodes should be retained and why with

quantitative measurements with respect to comprehensibility and information con-

tent is the gap we are trying to fill. Since the scale-free networks have power-law

degree distribution, which provides an elegant way to model the networks by con-

necting the probability and importance, the thesis is focused on simplifying large

scale-free networks by optimizing the information content and incomprehensibility.

Chapter 3

Methodology

3.1 Incomprehensibility

Logarithms are often used in human perception[13], Hick’s law proposes a loga-

rithmic relation between the time individuals take to choose an alternative and

the number of choices they have[15]. Since the number of nodes and number of

edges will a↵ect some choices when individuals are doing tasks on a network, we

assume the axiom is true: The general readability decreases logarithmically as

the size of network increase. We define the Incomprehensibility of a network as

function (2) shows:

IC(G) = log2(|E|) (2)

Where G is the network, |E| is the number of edges the network contains.

Based on the axiom, we can prove following lemmas:

Lemma 1: The incomprehensibility increase logarithmically as the den-

sity of networks increase

Proof. LetG as an undirected network such that |N | > 1, 0 < |E| Potentialconnections(G)

11

12

*G! undirected

) Potentialconnections(G) = |N |(|N |�1)2

* Density(G) = |E|Potentialconnections(G)

) |E| = Density(G)|N |(|N |�1)2

) IC(G) = log2(|E|) = log2(Density(G)|N |(|N |�1)

2 )

Then, when |N | is fixed, the IC(G) will increase logarithmically asDensity(N)

increase.

Proof. LetN as a directed network such that |N | > 1, 0 < |E| Potentialconnections(G)

*G! directed

) Potentialconnections(G) = |N |(|N |� 1)

* Density(G) = |E|Potentialconnections(G)

) |E| = Density(G)|N |(|N |� 1)

) IC(G) = log2(|E|) = log2(Density(G)|N |(|N |� 1))

Lemma 2: With the same number of nodes and density, directed net-

works always have the higher incomprehensibility than undirected net-

works

Proof. LetG1 be an undirected network such that |N1| > 1, 0 < |E1| Potentialconnections(G1)

andG2 be a directed network such that |N2| > 1, 0 < |E2| Potentialconnections(G2)

) IC(G1) = log2(|E|) = log2(Density(G)|N |(|N |�1)

2 )

IC(G2) = log2(|E|) = log2(Density(G)|N |(|N |� 1))

Then, when |N1| = |N2|, Density(G1) = Density(G2), IC(G2) > IC(G1).

Then, when |N | is fixed, the IC(G) will increase logarithmically as Density(N)

increase.

Figure 2 shows that the incomprehensibility increases in a slower rate as |E|

increased. As the example is shown in Figure 3, there are 2 groups of network

G1 and G2 , each network in each group contains 50 nodes. The growth of edges

13

Figure 2: The incomprehensibility changes based on the size of networks

from G11 to G13 and G21 to G23 are the same, however, the increasing of number

of edges can be visually-captured easier in G1 than in G2. By comparing the

incomprehensibility changes, we can seeIC(G12) � IC(G11) = 0.36 > IC(G22) �

IC(G21) = 0.06, IC(G13)�IC(G12) = 0.29 > IC(G23)�IC(G22) = 0.06, it shows

that the same amount of change of |E| has a greater a↵ected on IC(G) when |E|

is smaller. The result indicates that 1)changes of the size of a network are easier

to be perceived when it contains less edges, and harder to be perceived when it

contains more edges, 2) the incomprehensibility measure is able to capture this

14

phenomenon.

Figure 3: An example of how incomprehensibility changes based on the size ofnetworks

3.2 Information function

An important question in network analysis is, Are all of the nodes in a network are

equally important for readers? The Pareto principle states that for many events,

roughly 80% of the e↵ects come from 20% of causes.

Like Pareto principle, in scale-free networks, there are the only small amount

of nodes that have a high value of degree, most of the nodes have relatively low

degree [10][13].

15

In such networks, if we randomly pick one node, the probability of choosing a

node that has relatively high centrality is lower than the probability of choosing

a node that has relatively high centrality. Based on the maximum likelihood

estimation, we could estimate the probability of the occurrence of a node with

centrality x by P (x) ⇡ freq(x)|N | , where freq(x) is the count of the nodes that has

centrality as x. Based on the probability, we could calculate how much information

is contained in a network. Hence, by comparing the information content of the

network before and after simplification, we could estimate how much information

is lost when we are simplifying the network.

Named after Boltzmanns H-theorem, Shannon defined the entropy[22] H of

a discrete random variable X with possible values {x1, x2, , xn

} and probability

mass function P (X) as H(X) =P

P (X = x) ⇤ I(x), where I(x) = log21

P (X=x) is

the information function, it represents information content of X. For example,

if there are two events X1 , X2, and X1 is less likely to happen than X2, then

P (X1) < P (X2), therefore, I(X1) > I(X2), which means the occurrence of event

X1 provides more information than X2.

We use I(G) =P

n2N I(n) =P

n2N log21

P (X=centrality(n)) to measure the infor-

mation of a network, where G is the network, N is the set of nodes, and n is a node

in N , P (X = centrality(n)) is estimated by freq(X=centrality(n))|N | . I(G) measures the

total information content of the network. The representation is intuitive, it shows

that filtering one node, which has relatively high centrality, will decrease I(G)

dramatically, and filtering a node that has relatively lower centrality not have a

considerable impact on I(G).

A similar assumption can be applied here to the incomprehensibility of a given

network, human respond to information changes logarithmically. Therefore, in

this thesis, we use I(G) = log2P

n2N I(n) = log2(P

n2N) log2(1

P (X=centrality(n))).

16

Figure 4: I(G) and IC(G) in a filtering process of a connected network

Given a connected undirected network G, with a power law degree distribution,

when we filter it, the amount of information I(G), as well as the incomprehen-

sibility IC(G) will decrease as Figure 4 shows, where both I(G) and IC(G) are

scaled to be between 0 to 1.

Since the objective is to maximize I(G) and minimize IC(G), the optimal simpli-

fication strategy can be obtained by maximizing I(G)-IC(G). Figure 5 illustrates

the impact of removing nodes from a given network G on I(G) and IC(G). In

the simplifying process: 1) |N |,I(G), and IC(G) are scaled to be between 0 and

1; 2) The vertices are sorted non-decreasing order with respected to their degree

di

, where i 2 N , and N is the node set of the given network G and the nodes with

smaller degree are always removed earlier than nodes with larger degree.

17

Figure 5: Filtering process of a connected network

Figure 5 shows that when no node is removed (|N | = 1), G has maximum infor-

mation content(I(G) = 1) and maximum incomprehensibility measure(IC(G) =

1). As |N | decreases, both IC(G) and I(G) are decreasing. When |N | > 0.21,

IC(G) is decreasing faster than I(G), therefore I(G)-IC(G) is increasing. Once

|N | <= 0.21, I(G)-IC(G) starts to decrease because I(G) is decreasing faster

than IC(G). Since we want to maximize I(G)-IC(G), the optimal filtering strat-

egy should be retaining the top 21% of nodes.

18

The problem can be formulated as a binary integer programming problem as

shown below:

maximizev

↵log2[nX

i

vi

log2(1

pi

)]� �log2(dnX

i

nX

j

ai,j

vi

)

subject to 0 vi

1, i = 1, . . . , n.

vi

2 Z.

(3)

Where vi

is the binary decision variable that represents if the vertex i should

be removed, if vi

=1, the vertex i will be retained, if vi

=0 otherwise. ai,j

is an

element in given an adjacency matrix An,n

. When the network is directed d = 1,

when it is undirected d = 12 .

↵ and � are used to scale I(G) and IC(G) to be between 0 to 1. ↵ scales I(G)

using maximum and minimum value of I(G). It is easy to show the maximum of

the I(G) =P

n

i

log2(1pi) when the strategy is retaining all nodes. The minimum of

I(G) can be obtained if the strategy is only retaining the nodes with the largest

degree, since there are very few number of such nodes in the context of scale-free

networks, even when the size of the network is very large, the probability of the

occurrence of such nodes when we arbitrarily pick one node from the network

can be approximated as 1|N | , and the minimum of I(G) = log2(log2(|N |)). There-

fore, the scaled I(G) =log2[

Pni vilog2(

1pi

)]�log2(|N |)log2(

Pni log2(

1pi

))�log2(log2(|N |)) =log2[

Pni vilog2(

1pi

)]

log2(Pn

i log2(1pi

))�log2(log2(|N |)) �log2(log2(|N |))

log2(Pn

i log2(1pi

))�log2(log2(|N |)) = ↵log2[P

n

i

vi

log2(1pi)] � ↵log2(log2(|N |)), where ↵ =

1log2(

Pni log2(

1pi

))�log2(log2(|N |)) . Since |N | is a constant number given a network, it can

be removed from objective function.

Similarly, the maximum IC(G) = log2(dP

n

i

Pn

j

ai,j

) when no vertex is removed;

the minimum, however, is set as 0 since log20 ! �1. Therefore, the scaled

IC(G) =log2(d

Pni

Pnj ai,jvi)�0

log2(dPn

i

Pnj ai,j)�0 =

log2(dPn

i

Pnj ai,jvi)

log2(dPn

i

Pnj ai,j)

= �log2(dP

n

i

Pn

j

ai,j

vi

), where

� = 1log2(d

Pni

Pnj ai,j)

.

19

3.3 Neighbors-filtering simplification algorithm

In this sub-section, we proposed a Neighbors-filtering Simplification Algorithm

(NFSA) using Gephi toolkit in JAVA, and it uses the optimal strategy from equa-

tion (3) as part of inputs to further decrease information lost without a↵ect the

incomprehensibility by using sizes of nodes. NFSA is shown as Algorithm 1.

Algorithm 1: Neighbors-filtering simplification algorithmInput: Target percentile t;Width w and length l of the output image;Output path path;Data: G = (N,E) in GEXF format, with degree centrality C

D

;Result: G0 = (N 0, E 0)withN 0 2 N,E 0 2 E;Initialize node size to 1;maxsize = 1;

↵ =1

ln|E| ;

G’=G;for p = 1 : t do

filterList (i|i 2 N 0, CD

(i) < pthpercentile.CD

);for node n 2 filterList do

for node m 2 neighbors(n)&m /2 filterList do

size=[1+↵ ⇤ CD

(n)

pthpercentile.CD

⇤n.getSize()]*m.getSize();

if size > maxsize thenmaxsize=size

m.size=min(w

3,t

3)

size

maxsize;

m.addedge(m1 2neighbors(n),s.t.m1.index()< m.index()&m1 /2 filterList)

G’=G’.removeNode(n)

Output.image(”path/simplifiedGraph.png”, width = w, length =l)Output.gexf(G’,”path/simplifiedGraph.gexf”)

In NFSA, the ith-percentile of the degree centrality is used as the threshold, and

i = 1, 2, 3, ..., t where tth-percentile is set based the optimal filtering strategy and

any node whose centrality is lower than the threshold will be filtered.

In the NFSA, the size of a given node is changed during filtering process, it

depends on the number of filtered neighbors, as well as their degree centrality.

20

As NFSA shown, the size of nodes will increase faster if 1) it has more neighbor-

hoods, or 2)its filtered neighborhoods have higher centrality, or 3) those neighbor-

hoods have bigger sizes. The retained nodes in the simplified scale-free networks

have high degree centrality, and the size of a given node represents cumulative

information-loss caused by its filtered neighbors, where information-loss is defined

as the di↵erence of information content of the network before and after one node

is removed.

To apply NFSA, user needs to input 1)the desired network G in GEXF format,

and it should contains a column named degree, which can be easily generate using

Gephi. 2)optimal percentile threshold t, which can be found after solving equation

(3) and check the percentile of degree of the retained nodes with lowest value;

3)width and length of the output image, which contains the simplified network;

4) output path, which will be used to output images, as well as GEXF(Graph

Exchange XML Format) file.

Chapter 4

Experimentation

To test how the simplification algorithm works on di↵erent sizes of scale-free net-

works, ten random scale-free networks are generated using NetworkX in python.

The results are shown in Tables 1 and 2.

Table 1: Results of simplification on 10 simulated scale-free networks

The results indicate that, on the average, the optimal strategy of simplifica-

tion is filtering 91% of vertices that have a relatively small degree. The result is

intuitive, even the power law states roughly 20% of vertices matter, considering

readability of a large scale-free network, retaining about 9% of vertices gives most

of information content with a lower incomprehensibility. By applying the result to

21

22

Table 2: Original networks and simplified networks based on optimal filter strategy

large scale-free networks that contain million/billion nodes, the simplified networks

would theoretically provide the optimal balance between information content and

incomprehensibility measure.

Simplified networks in Table 2 show the algorithm captures the nodes that

play the role as the hub. From the perspective of understanding the structure

of networks, the simplified networks provide clearer structures, which might not

be clear in the original network. Moreover, since the number of nodes and edges

23

decreased, the simplified network shows the important nodes and their connections

clearer than original networks. In other words, by balancing the size of networks

and information lost based on Hick’s law, the results provide an optimal filtering

strategy that o↵ers a way of understanding the structure and important nodes of

a given network faster and easier. In a real life scenario, we might be interested in

finding the hubs and their connection, so that we can spread information or break

down the system of spreading of the epidemic faster and more e�ciently.

Chapter 5

Conclusions

A simplification technique to increase the readability of networks visualizations is

presented in this thesis. To quantify the process of the simplification, the Shannon

Entropy is used to measure information content of networks and information-loss

during the simplification. In addition, the incomprehensibility measure is pro-

posed based on Hick’s law, it measures the visualization quality of networks. By

measuring the information content of the network during the simplification, a

filtering strategy can be found to balance the information content and the incom-

prehensibility measure of a given network during the simplification. By formulate

the simplifying process into an binary optimization problem, the optimal filtering

strategy can be found easily. Moreover, an algorithm is developed to indicate cu-

mulative information-loss by set the sizes of nodes based on the information-loss

which is caused by their filtered neighbors. Overall, this methodology addresses

the problems described in above. Specifically:

1. Congested or Incomprehensible Visuals: our approach reduces the

number of nodes, as well as output a network that has theoretically opti-

mal network with respect to balance information constant and incompre-

hensibility measures. And the results can be used as an input of a further

visualization process,.

24

25

2. Neglect of Neighbors in Resizing: our approach will set a given node’s

size based on information-loss that caused by its filtered neighbors. Greater

the size of a node is, more accessibility the node has.

3. Uncertainty in simplification: Users can find optimal filtering strategy

by maximizing I(G)� IC(G).

The work can be applied in di↵erent problems that deal with scale-free networks.

For example, the spreading of the epidemic diseases through social networks. By

looking at the degree distribution, one can easily find potential hubs that would

cause a faster spreading of the diseases if any of those people are infected. However,

given hubs with the same degree, their accessibility to other parts of the network

might be di↵erent, and the method proposed in the thesis can find hubs with

higher accessibility by considering information-lost during the simplification. In

other word, bigger the size of a node is, among hubs with the same degree, more

accessibility the node has, therefore, higher priority should be given to it.

Similarity, given a limited chance to spread a information to a network, the best

way is spreading information through hubs. However, by spreading information

through hubs that are not connected and have higher accessibility would provide

a faster spread, which can be obtained from the simplified network.

The limitation of the work are shown as follow:

1. Our method only considered the scale-free networks, but there are many

networks are not necessarily scale-free, more generalizations toward di↵erent

types of networks are needed in the future work.

2. Modeling the probability based on the degree might be the intuitive way to

measure the information content in scale-free networks, however, it is not

26

generalized to capture all the information of the networks, and it will not be

enough when the method dealing with none-scale-free networks.

3. The simplified network from our method might not have the same properties

as original network, for example, the distributions of degree, betweenness

centrality, closeness centrality etc might be di↵erent, which causes additional

information lose that are not considered in our method.

The problem can be extend to simplify given connected networks while contain

the connectivity by simply adding an additional constraint as equations (4) show.

maximizev

↵log2[nX

i

vi

log2(1

pi

)]� �log2(dnX

i

nX

j

ai,j

vi

)

subject to 0 vi

1, i = 1, . . . , n.nX

j

ai,j

vi

� 2, i = 1, . . . , n.

vi

2 Z.

(4)

Where ↵ and � are used to scale the size and information content of the given

network. vi

is the binary decision variable that represents if the vertex i should

be removed, if vi

=1, the vertex i will be retained, if vi

=0 otherwise. ai,j

is an

element in given an adjacency matrix An,n

. When the network is directed d = 1,

when it is undirected d = 12 . The second constrain ensure the network will be

connected after the simplification.

In the future, we will construct the probability in the log2(1pi) using weighted

combination of other centrality measurements based on interests, the method can

capture more features of networks. In specific, pi

=PC

cen wcen⇤freq(ceni)|C||N | , where p

i

is

the probability of picking node i from a given network G if we randomly select

one node from node set N of the network; C is a set of interested centrality, it

contains centrality measurements such as degree centrality, closeness centrality,

27

betweenness centrality or any other interested centrality measurements; cen 2 C

and wcen

is the weight users give to the centrality cen, and freq(ceni

) is the

number of nodes that have centrality cen as node i; |C| is the cardinality of the

set of interested centrality; |N | is the number of nodes in network G. The weight

can be learned from preferred simplified networks from other complex methods,

or the weight can be set by users.

Bibliography

[1] Abello, Buchsbaum, and Westbrook. A Functional Approach to ExternalGraph Algorithms, 2002.

[2] J Abello, F Van Ham, and N Krishnan. ASK-Graph View: A large scalegraph visualization system. IEEE Transactions on Visualization and Com-puter Graphics, 12(5):669, 2006.

[3] Albert-Lszl Barabasi. Emergence of scaling in complex networks. Handbookof Graphs and Networks: From the Genome to the Internet, pages 69–84,2003.

[4] Zhiqiang Bi, Christos Faloutsos, and Flip Korn. The ”DGX” distributionfor mining massive, skewed data. Proceedings of the seventh ACM SIGKDDinternational conference on Knowledge discovery and data mining - KDD ’01,pages 17–26, 2001.

[5] D Chakrabarti, Y P Zhan, and C Faloutsos. R-MAT: A recursive model forgraph mining. 2004.

[6] Siddhartha Chatterjee, Michael Connor, and Piyush Kumar. Computing Geo-metric Minimum Spanning Trees Using the Filter-Kruskal Method. Technicalreport, 2009.

[7] Cody Dunne and Ben Shneiderman. Motif Simplification: Improving NetworkVisualization Readability with Fan, Connector, and Clique Glyphs. In CHI’13: Proc. 2013 international conference on Human Factors in ComputingSystems, pages 3247–3256, 2013.

[8] Cody Dunne and Ben Shneiderman. Motif Simplification: Improving NetworkVisualization Readability with Fan, Connector, and Clique Glyphs. CHI’13: Proc. 2013 international conference on Human Factors in ComputingSystems, pages 3247–3256, 2013.

[9] Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. On power-lawrelationships of the Internet topology, 1999.

[10] Santo Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75–174, 2010.

28

29

[11] F Van Ham and J J.van Wijk. Interactive visualization of small worldgraphs.pdf. Information Visualization, 2004. INFOVIS 2004. IEEE Sym-posium on, 2004.

[12] Weidong Huang, Mao Lin Huang, and Chun Cheng Lin. Evaluating overallquality of graph visualizations based on aesthetics aggregation. InformationSciences, 330:444–454, 2016.

[13] Yuntao Jia, Jared Hoberock, Michael Garland, and John C. Hart. On thevisualization of social and other scale-free networks. IEEE Transactions onVisualization and Computer Graphics, 14(6):1285–1292, 2008.

[14] Hyunmo Kang, Catherine Plaisant, Bongshin Lee, and Benjamin B. Beder-son. NetLens: Iterative exploration of content-actor network data. IEEESymposium on Visual Analytics Science and Technology 2006, VAST 2006 -Proceedings, (January):91–98, 2006.

[15] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution, 2007.

[16] Qi Liao. Anomaly Analysis and Visualization through Compressed Graphs.Ldav, 2012.

[17] X. Liu and T. Murata. Advanced modularity-specialized label propagationalgorithm for detecting communities in networks. Physica A: Statistical Me-chanics and its Applications, 389(7):1493–1500, 2010.

[18] Ewa Misio lek and Danny Z. Chen. Two flow network simplification algo-rithms. Information Processing Letters, 97(5):197–202, 2006.

[19] Saket Navlakha, Rajeev Rastogi, and Nisheeth Shrivastava. Graph summa-rization with bounded error. Proceedings of the 2008 ACM SIGMOD inter-national conference on Management of data - SIGMOD ’08, page 419, 2008.

[20] M. E. J. Newman. Networks: An Introduction. 2010.

[21] D.T. Rover. A performance visualization paradigm for data parallel com-puting. Proceedings of the Twenty-Fifth Hawaii International Conference onSystem Sciences, ii, 1992.

[22] Claude E Shannon. A mathematical theory of communication. The BellSystem Technical Journal, 27(July 1928):379–423, 1948.

[23] Colin Ware. Information visualization: perception for design. InformationVisualization, page 486, 2004.

[24] Martin Wattenberg. Visual Exploration of Multivariate Graphs, 2006.

[25] Duncan J Watts. Six Degrees: The New Science of Networks. 2004.

[26] Douglas B West. Introduction To Graph Theory Notice Second Edition ( 2001). Number 2001. 2005.

30

[27] Fang Zhou, Sbastien Mahler, and Hannu Toivonen. Network simplificationwith minimal loss of connectivity. In Proceedings - IEEE International Con-ference on Data Mining, ICDM, pages 659–668, 2010.

simplifying large scale-free networks by …

Documents