Download - Modelling and Mining complex network data
Modelling and mining complex networks
Kaushalya Madhawa
Source: Facebook Engineering blog
What is a graph?
● Graph theory started with Euler’s solution to the problem of K�onigsberg bridges in 1736
● In simple term, a graph is a set of vertices (V) connected with a set of edges (E)○ Vertices: entities○ Edges: pairwise relations among vertices○ Optionally can have direction and weight of edges
● Graphs can be used to model many real-world datasets
Analysis of graph datasets
● Graph data sets have been studied in the past○ they were small○ visual inspection could reveal lot of information
Now:
● More and more larger networks with millions or billions of nodes
○ impossible to visualize
Types of networks
● Social networks○ Phone call networks, email networks
● Knowledge and information networks○ The web, peer-to-peer networks, blog networks
● Technology networks○ Power grid, transportation networks
● Biological networks○ Protein-protein interaction networks, gene regulation networks
Network science● The field which studies complex networks● Draws theories and methods from many fields
○ Mathematics- graph theory○ Physics- statistical mechanics○ Computer science- data mining, information
visualization○ Sociology- social structure
● Understanding networks○ Understand their topology and measure their
properties○ Study the evolution and dynamics of them○ Create realistic models○ Create algorithms that make use of the network
structure
Frieze, Gionis, and Tsourakakis, Algorithmic Techniques for Modeling and Mining Large Graphs
Describing a network: network properties
● Density- ratio of the number of edges E to the number of possible edges
● Size- number of nodes● Average degree ● Average path length- average number of steps it takes to get
from one member of the network to another● Network diameter- longest of all the calculated shortest paths in
a network● Clustering coefficient- measures "all-my-friends-know-each-
other" property● Connectedness- the way in which the network is connected● Node centrality- set of measures to identify the most important
nodes
A sample dataset• Multiple mobile operators in Sri Lanka have provided four
different types of metadata– Call Detail Records (CDRs)
• Records of calls• SMS• Internet access
– Airtime recharge records
• Data sets do not include any Personally Identifiable Information– All phone numbers are pseudonymized – LIRNEasia does not maintain any mappings of identifiers to original phone
numbers
• Covers 50-60% of users
CDR: What is the underlying graph?
● Vertices: users, base stations● Edges: calls, texts● Edge weights: number of
calls between 2 vertices, number of between 2 vertices
Properties of real-world networks
● Power law degree distribution○ pk - the fraction of vertice in the network that have degree k○ The degree distribution of a network can be visualized by making
a histogram of the pk values
● Heavy tail distribution: existence of nodes that has very high degree● scale-free : average is not informative
Properties of real-world networks...
● Transitivity (Clustering)○ “The friend of your friend is likely also to be
your friend.”○ If vertex A is connected to vertex B, and
vertex B is connected to vertex C, then there is a heightened probability that vertex A is also connected to vertex C.
○ Measured by the clustering coefficient
Properties of real-world networks...
● Communities○ “A set of vertices densely connected to
each other and sparsely connected to the rest of the graph”
○ Real -world insights can be gained from community structure
■ Metabolic networks have communities based on functional groupings
■ Communities in social networks can be formed based on common location, interests, occupation etc.
Finding the community structure● There are multiple approaches to find community structure
in a network● Modularity maximization is one of the widely used
methods○ Modularity Q = (edges inside the community) -- (expected number of
edges inside the community)
○ The goal of such algorithms is to find the community structure with the highest modularity
○ Since modularity maximization is NP-complete heuristic method are used
M. E. J.-Newman, Michele-Girvan, “Finding and evaluating community structure in networks”, Physical Review E, APS, Vol. 69, No. 2, p. 1-16, 204.
○
Properties of real-world networks...
● Small world phenomena○ Most pairs of vertices are connected by a short path
through the network.○ S. Milgram’s famous experiment demonstrated the small-
world effect ○ Each Facebook user is connected to every other user by an
average of three and a half other people. With more interconnections degree of separation on Facebook is shrinking over time https://research.facebook.com/blog/three-and-a-half-degrees-of-separation/
Communities in a mobile network
● Louvain method [1] applied to tower-to-tower call network
● The community structure of Sri Lanka having the highest modularity consists of 11 clusters.
[1] V. Blondel and J. Guillaume, “Fast unfolding of communities in large networks,” J. Stat. …, pp. 1–12, 2008.
Limitations of modularity
● Often fails to detect communities smaller than a certain size (resolution limit)
● In real-world, nodes can belong to more than one community
Machine Learning in network analysis● Finding low dimensional feature representations of large
networks
○ DeepWalk: ■ uses deep learning based word-embedding techniques developed
for natural language modelling. ■ Set of vertices considered as the vocabulary
○ GraRep○ LINE
Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. "Deepwalk: Online learning of social representations."
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM,
2014.
Machine Learning in network analysis...
● Anomaly detection○ Bayesian anomaly detection methods are used to detect
anomalies in large dynamic networks● Link prediction● Maximizing information diffusion in networks
○ Bayesian networks are used to model belief propagation in networks.
Network analysis tools and software
● igraph● networkx● Giraph● Pajek
Free and open data sets● Stanford Large Network Dataset Collection https://snap.
stanford.edu/data/ ● UCI network Data repository https://networkdata.ics.uci.
edu/resources.php● http://www-personal.umich.edu/~mejn/netdata/