statistical analysis of network data€¦ · statistical analysis of network data with r is book is...
TRANSCRIPT
Statistical Analysis of Network Data
Lecture 1: Intro, Background, & Descriptive Statistics
Eric D. Kolaczyk
Dept of Mathematics and Statistics, Boston University
Les Houches, avril 2014
Welcome
Outline
1 Welcome2 Background & Motivation
Network Analysis & StatisticsPrelude
3 Review of GraphsDefinitions and ConceptsGraphs & Matrix AlgebraGraphs Data Structures & Algorithms
4 Descriptive Statistics for NetworksNetwork MappingNetwork Characterization
Vertex DegreeCentralityComponentsConnectivity / CutsDynamic Networks
5 Wrapping UpLes Houches, avril 2014
Welcome
Topics for my lectures
L1 Introduction, Background, and Descriptive Statistics (1.5hrs)
L2 Network Sampling (1hr)
L3 Network Modeling (1.5hrs)
L4 Additional Topics in Modeling/Analysis (1.5hr)
Les Houches, avril 2014
Welcome
Resources
Organization and presentation of material in these lectures will largelyparallel that in
1
USE R !
ISBN 978-1-4939-0982-7
Use R ! Use R !
Eric KolaczykGábor Csárdi
Statistical Analysis of Network Data with R
Statistical Analysis of Network Data
with R
Kolaczyk · Csárdi
Eric Kolaczyk · Gábor Csárdi
Statistical Analysis of Network Data with R
Th is book is the fi rst of its kind in network research. It can be used as a stand-alone resource in which multiple R packages are used to illustrate how to use the base code for many tasks. igraph is the central package and has created a standard for developing and manipulating network graphs in R. Measurement and analysis are integral components of network research. As a result, there is a critical need for all sorts of statistics for network analysis, ranging from applications to methodology and theory. Networks have permeated everyday life through everyday realities like the Internet, social networks, and viral marketing, and as such, network analysis is an important growth area in the quantitative sciences. Th eir roots are in social network analysis going back to the 1930s and graph theory going back centuries. Th is text also builds on Eric Kolaczyk’s book Statistical Analysis of Network Data (Springer, 2009).
Eric Kolaczyk is a professor of statistics, and Director of the Program in Statistics, in the Department of Mathematics and Statistics at Boston University, where he also is an affi liated faculty member in the Bioinformatics Program, the Division of Systems Engineering, and the Program in Computational Neuroscience. His publications on network-based topics, beyond the development of statistical methodology and theory, include work on applications ranging from the detection of anomalous traffi c patterns in computer networks to the prediction of biological function in networks of interacting proteins to the characterization of infl uence of groups of actors in social networks. He is an elected fellow of the American Statistical Association (ASA) and an elected senior member of the Institute of Electrical and Electronics Engineers (IEEE).
Gábor Csárdi is a research associate at the Department of Statistics at Harvard University, Cambridge, Mass. He holds a PhD in Computer Science from Eötvös University, Hungary. His research includes applications of network analysis in biology and social sciences, bioinformatics and computational biology, and graph algorithms. He created the igraph soft ware package in 2005 and has been one of the lead developers since then.
Statistics
9 781493 909827
Les Houches, avril 2014
Welcome
Topics for this lecture
1 Background & Motivation
2 (Brief!) review of graph-related concepts
3 Descriptive Statistics for Networks
Network MappingNetwork Characterization
Les Houches, avril 2014
Background & Motivation
Outline
1 Welcome2 Background & Motivation
Network Analysis & StatisticsPrelude
3 Review of GraphsDefinitions and ConceptsGraphs & Matrix AlgebraGraphs Data Structures & Algorithms
4 Descriptive Statistics for NetworksNetwork MappingNetwork Characterization
Vertex DegreeCentralityComponentsConnectivity / CutsDynamic Networks
5 Wrapping UpLes Houches, avril 2014
Background & Motivation Network Analysis & Statistics
Why Networks?
Pdp
dCLK Cyc
Tim
VriPer
Relatively small ‘field’ of study until ∼ 15 years ago
Epidemic-like spread of interest in networks since mid-90s
Arguably due to various factors, such as
Increasingly systems-level perspective in science,away from reductionism;Flood of high-throughput data;Globalization, the Internet, etc.
Les Houches, avril 2014
Background & Motivation Network Analysis & Statistics
What Do We Mean by ‘Network’?
Definition (OED): A collection of inter-connected things.
Caveat emptor: The term ‘network’ is used in the literature to meanvarious things.
Two extremes are
1 a system of inter-connected things
2 a graph representing such a system1
Often is not even clear what is meant when an author refers to ‘the’network!
1I’ll use the slightly redundant term ‘network graph’.
Les Houches, avril 2014
Background & Motivation Network Analysis & Statistics
Our Focus . . .
The statistical analysis of network data
i.e., analysis of measurements either of or from a systemconceptualized as a network.
Challenges:
relational aspect to the data;
complex statistical dependencies (often the focus!);
high-dimensional and often massive in quantity.
Les Houches, avril 2014
Background & Motivation Network Analysis & Statistics
Examples of Networks
Network-based perspective has been brought to bear on problems fromacross the sciences, humanities, and arts.
A convenient (but nevertheless only rough/approximate) grouping ofnetworks is the following:
Technological
Biological
Social
Informational
Les Houches, avril 2014
Background & Motivation Network Analysis & Statistics
Examples of Networks (cont)
a1
a2
a3
a4a5
a6
a7
a8
a9
a10
a11
a12
a13
a14
a15
a16
a17
a18
a19
a20
a21
a22
a23a24a25
a26
a27
a28a29
a30
a31
a32
a33a34
“Know thy data!” (W. Willinger)Les Houches, avril 2014
Background & Motivation Network Analysis & Statistics
Statistics and Network Analysis
The (still emerging?) field of ‘network science’ started very ‘horizontal’,although it is increasingly filling out with greater ‘vertical’ depth.
Lots of ‘players’ . . . uneven depth across the ‘field’ . . . mixed levels ofcommunication/cross-fertilization.
Note: Statisticians arguably a (growing) minority in this area!
But from a statistical perspective, there are certain canonical tasks andproblems faced in the questions addressed across the different areas ofspecialty.
Better vertical depth can achieved in this area by viewing problems – andpursuing solutions – from this perspective.
Les Houches, avril 2014
Background & Motivation Prelude
Statistical Analysis of Network Data: Prelude
The unique relational nature of network data means that we frequentlyencounter challenges in canonical tasks that differ in important ways fromthose typically faced in the statistical analysis of ‘standard’ data.
We will examine how such challenges arise in relation to
visualization;
summary & description;
sampling and inference;
and modeling.
Les Houches, avril 2014
Background & Motivation Prelude
Statistical Analysis of Network Data: Prelude (cont.)
Let’s look at some examples of how the unique relational nature of networkdata makes seemingly familiar statistical problems change in character.
1 Mapping Science
2 Understanding Epilepsy
3 Monitoring Social Media
4 Predicting Protein Function
Les Houches, avril 2014
Background & Motivation Prelude
Mapping Science
How does one go about‘mapping’ the ‘land-scape’ of ‘Science’?
Statistical challenges:
Defining thepopulation ofinterest.
Representativenessof our data.
Appropriate notionsof units (i.e.‘vertex’and ‘edge’).
How to visualize?
Les Houches, avril 2014
Background & Motivation Prelude
Understanding Epilepsy
How can we effectivelysummarize/describe thecomplex interactionstaking place during anepileptic seizure?
Statistical challenges:
Criterion fordefining ‘brainnetworks’.
Choice of networksummary statistics.
Assessing‘significance’ ofchange/differences.
Les Houches, avril 2014
Background & Motivation Prelude
Monitoring Social Media
Can we monitor basic charac-teristics of (typically massive!)social media networks based onsamples?
Statistical challenges:
Computer protocolscorrespond to what typesof sampling designs?
What sort of bias(es) areinherent in the sampling?
Can we correct for suchbiases?
Les Houches, avril 2014
Background & Motivation Prelude
Predicting Protein Function
Is it possible to use knowledge ofprotein-protein interactions to predictbiological function of proteins?
Statistical challenges:
To what extent do interactingproteins share commonfunction?
How do we incorporate anetwork as an‘explanatory/predictor variable’?
Can we account for uncertaintyin training data and/or network?
Les Houches, avril 2014
Background & Motivation Prelude
Dramatic Pause!
Lot’s of interesting issues . . .
. . . shall we get started?
Les Houches, avril 2014
Review of Graphs
Outline
1 Welcome2 Background & Motivation
Network Analysis & StatisticsPrelude
3 Review of GraphsDefinitions and ConceptsGraphs & Matrix AlgebraGraphs Data Structures & Algorithms
4 Descriptive Statistics for NetworksNetwork MappingNetwork Characterization
Vertex DegreeCentralityComponentsConnectivity / CutsDynamic Networks
5 Wrapping UpLes Houches, avril 2014
Review of Graphs
Networks & Graphs
The language of graphs typically is adopted in talking about networks.
Will assume that everyone is fluent in the basics . . .. . . but a (very) quick review perhaps won’t hurt!
Les Houches, avril 2014
Review of Graphs Definitions and Concepts
Graphs
Formally2, a graph G = (V ,E ) is a mathematical structure consisting ofsets
V of vertices(also commonly called nodes)
E of edges(also commonly called links),
where elements of E are unordered pairs {u, v} of distinct verticesu, v ∈ V .
The values Nv = |V | and Ne = |E | are the order and size of the graph,respectively.
2Even more formally, graphs are defined uniquely only up to isomorphisms, i.e.,relabeling of vertices and edges that leave the structure unchanged.Les Houches, avril 2014
Review of Graphs Definitions and Concepts
Some nuances . . .
More generally, graphs may have
loops, and/or
multi-edges
A graph with either is called a multi-graph.
We will typically assume the absence of loops and multi-edges, i.e., thatour graphs are simple.
Les Houches, avril 2014
Review of Graphs Definitions and Concepts
Directed Graphs
A graph G for which each edge in E has an ordering to its vertices,
i.e., {u, v} is distinct from {v , u}, for u, v ∈ V
is called a directed graph or digraph.
Such edges are called directed edges or arcs, with the direction of an arc{u, v} read from left to right, from the tail u to the head v .
Arcs are said to be mutual if they share the same vertex pair, but with thevertices playing opposite roles of head and tail for each arc.
Les Houches, avril 2014
Review of Graphs Definitions and Concepts
Connectivity
It is necessary to have a language for discussing the connectivity of agraph. One of the most basic notions of connectivity is that of adjacency.
1 Two vertices u, v ∈ V are said to be adjacent if joined by an edge inE .
2 Similarly, two edges e1, e2 ∈ E are adjacent if joined by a commonendpoint in V .
In addition, a vertex v ∈ V is incident on an edge e ∈ E if v is anendpoint of e.
Les Houches, avril 2014
Review of Graphs Definitions and Concepts
Degree
The notion of ‘degree’ is used to summarize the connectivity of a vertex.
The degree of a vertex v , say dv , is the number of edges incident on v .
The degree sequence of a graph G is the sequence formed by arranging thevertex degrees dv in non-decreasing order, i.e.,
d(1) ≤ d(2) ≤ · · · ≤ d(Nv−1) ≤ d(Nv ) .
Les Houches, avril 2014
Review of Graphs Definitions and Concepts
Degree (cont.)
Note that
1 the sum of the elements of the degree sequence is equal to twice thenumber of edges in the graph (i.e., twice the size of the graph).
2 for digraphs, vertex degree is replaced by in-degree (i.e., d inv ) and
out-degree (i.e., doutv ), which count the number of edges pointing in
towards and out from a vertex, respectively.
Hence, digraphs have both an in-degree sequence and an out-degreesequence.
Les Houches, avril 2014
Review of Graphs Definitions and Concepts
Additional Concepts
Beyond these basics, it is useful to be able to discuss concepts of
MovementE.g., walks, trails, and paths; circuits and cycles.
Reachability / Connectivity / ComponentsE.g., Giant connected component; strong and weakly connectedcomponents.
DistanceE.g., shortest path / geodesic distance; diameter.
Les Houches, avril 2014
Review of Graphs Definitions and Concepts
Other ‘Flavors’ of Networks
There are a number of additional types of networks – beyond the basic(un)directed varieties – now commonly studied:
weighted networks
bipartite networks
multi-relational networks
dynamic/temporal networks
We will encounter a few of these here and there.
Les Houches, avril 2014
Review of Graphs Graphs & Matrix Algebra
Graphs and Matrix Algebra
Useful in the modeling and analysis of network data to be able tocharacterize a graph G and certain aspects of its structure using matricesand matrix algebra3.
The fundamental connectivity of a graph G may be captured in theNv × Nv binary, symmetric adjacency matrix A, where
Aij =
{1, if {i , j} ∈ E ,
0, otherwise ,
In words, A is non-zero for entries whose row-column indices correspond tovertices in G joined by an edge, and zero, for those that are not.
3Formal blending of graph theory with matrix algebrais the focus of algebraic graph theory.Les Houches, avril 2014
Review of Graphs Graphs & Matrix Algebra
Some Properties of A
More than just a storage mechanism, various operations applied to A yieldinformation on G .
Examples include
Degree
di = Ai+ =∑j
Aij
Walks
Arij = # of walks of length r between i and j on G
Eigen-structure
E.g., G is a regular graph if and only if the maximum degree dmax ofG is an eigenvalue of A.
Les Houches, avril 2014
Review of Graphs Graphs Data Structures & Algorithms
Graph Data Structures and Algorithms
The study of graph
data structures, and
algorithms
facilitates the transition from
graphs as purely mathematical objects to
graphs as practical tools for network modeling and analysis.
Contributions in this area primarily due to computer science.
Les Houches, avril 2014
Review of Graphs Graphs Data Structures & Algorithms
Graph Data Stuctures
Two common data structures for representing graphs:
1 adjacency matrixNv × Nv binary matrix (seen previously)
2 adjacency listAn array of size Nv , each element of which is a list, where the i-th listcontains the labels of the di vertices incident to vertex vi .
Also, a variation on adjacency lists is the idea of an edge list, which issimply a two-column list of all vertex pairs that are joined by an edge.
Les Houches, avril 2014
Review of Graphs Graphs Data Structures & Algorithms
Graph Algorithms
Suppose you have a graph stored on your computer. What questionsmight you like to ask?
For those used to working with ‘regular’ data (e.g., signals, images, surveysamples, regression analyses, etc.), it may come as a surprise that thenature of the question can be important!
Roughly speaking, questions can be broken down into categories:
1 Directly answerable from the adjacency matrix / list
2 Answerable, with a bit of work, in a ‘reasonable’ amount of time4
3 (Expected to be) unanswerable in any ‘reasonable’ amount of time
4Means polynomial in Nv and/or Ne .Les Houches, avril 2014
Descriptive Statistics for Networks
Outline
1 Welcome2 Background & Motivation
Network Analysis & StatisticsPrelude
3 Review of GraphsDefinitions and ConceptsGraphs & Matrix AlgebraGraphs Data Structures & Algorithms
4 Descriptive Statistics for NetworksNetwork MappingNetwork Characterization
Vertex DegreeCentralityComponentsConnectivity / CutsDynamic Networks
5 Wrapping UpLes Houches, avril 2014
Descriptive Statistics for Networks
Descriptive Statistics for Networks
Statisticians typically distinguish between descriptive and inferentialstatistics.
We will spend the rest of this lecture looking at the network analogue ofdescriptive statistics5, in the form of
network mapping
characterization of network graphs
May seem ‘soft’ . . . but it’s important!
This is basically descriptive statistics for networks.
Probably constitutes at least 2/3 of the work done in this area.
Note: It’s sufficiently different from standard descriptive statistics that it’ssomething unto itself.
5. . . and the remaining lectures looking at inferential statistics.Les Houches, avril 2014
Descriptive Statistics for Networks Network Mapping
Network Mapping
What is ‘network mapping’?
Production of a network-based visualization of a complex system.
What is ‘the’ network?
Network as a ‘system’ of interest;
Network as a graph representing the system;
Network as a visual object.
Analogue: Geography and the production of cartographic maps.
Les Houches, avril 2014
Descriptive Statistics for Networks Network Mapping
Example: Mapping Belgium
Which of these is ‘the’ Belgium?
Les Houches, avril 2014
Descriptive Statistics for Networks Network Mapping
Three Stages of Network Mapping
Continuing our geography analogue . . . a fourth stage might be‘validation’.
Les Houches, avril 2014
Descriptive Statistics for Networks Network Mapping
Stage 1: Collecting Relational Network Data
Begin with measurements on system ‘elements’ and ‘relations’.
Note that choice of ‘elements’ and ‘relations’ can produce very differentrepresentations of same system.
Sgg
Tim Dbt
Per
dCLKCyc
Pdp
dCLK Cyc
Tim
VriPer
Les Houches, avril 2014
Descriptive Statistics for Networks Network Mapping
Standard Statistical Issues Present Too!
Type of measurements (e.g., cont., binary, etc.) can influence qualityof information they contain on underlying ‘relation’.
Full or partial view of the system?(Analogues in spatial statistics . . .)
Sampling, missingness, etc.
Les Houches, avril 2014
Descriptive Statistics for Networks Network Mapping
Stage 2: Constructing Network Graphs
Sometimes measurements are direct declaration of edge/non-edge status.
More commonly, edges dictated after processing measurements
comparison of ‘similarity’ metric to threshold
voting among multiple views (e.g., router I-net)
Frequently ad hoc . . .. . . sometimes formal (e.g., network inference).
Even with direct and error free observation of edges, decisions may bemade to thin edges, adjust topology to match additional variables, etc.
Les Houches, avril 2014
Descriptive Statistics for Networks Network Mapping
Stage 3: Visualization
Goal is to embed a combinatorial object G = (V ,E ) intotwo- or three-dimensional Euclidean space.
Non-unique . . . not even well-defined!
Common to better define / constrain this problem by incorporating
conventions (e.g., straight line segs)
aesthetics (e.g., minimal edge crossing)
constraints (e.g., on relative placement of vertices, subgraphs, etc.)
Les Houches, avril 2014
Descriptive Statistics for Networks Network Mapping
Graph Layout: Art and Science.
Software for network visualization (aka graph layout) largely use a handfulof standard classes of methodsE.g., Circular, radial, analogies to physical systems, etc.
Many visualization packages, some general and some area-specific.
Examples in these talks primarily produced using
Pajek
R tools in igraph package
Better packages also allow for user interaction to manipulate further.
Les Houches, avril 2014
Descriptive Statistics for Networks Network Mapping
Layout ... Does it Matter?
Yes!
Layered, circular, and h-v layouts of the same tree.Les Houches, avril 2014
Descriptive Statistics for Networks Network Mapping
Visualization and Scale: Attention Needed!
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●●●
●●
●
●
●
●
●●●●●
●●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●●●●●●●
●●●●●●●
●●
●●●●
●
●●
●●●●●●●
●●
●
●
●●●●●●●●●●
●●●●●●●●●●●
●
●●●●●●●●●●●●
●●●
●●●
●
●●●●●●●
●
●●●●
●
●
●
●●●
●●
●
●●●●
●●●●●●
●●
●
●●●●●●●
●
●
●
●●●●●
●
●
●
●
●
●
●●●●●●●●●●●●●●●
●
●
●
Cap21
Commentateurs Analystes
Les Verts
liberaux
Parti Radical de Gauche
PCF − LCR
PS UDF
UMP
Top: Visualizations at the level of blogs(left: energy-based placement;right: projection)
Bottom: Visualization at the level of
political parties.
Les Houches, avril 2014
Descriptive Statistics for Networks Network Mapping
Mapping Dynamic Network Data
For dynamic networks the challenges are greater still, particularly forvisualization.
Options include use of weighted networks (left), slices (center), andedge-timelines (right).
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
0 to 12 hrs
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
12 to 24 hrs
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
24 to 36 hrs
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
36 to 48 hrs
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
48 to 60 hrs
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
60 to 72 hrs
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
72 to 84 hrs
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
84 to 96 hrs
0 20 40 60 80 100
020
040
060
080
010
00
Time (hours)
Inte
ract
ing
Pai
r (O
rder
ed b
y F
irst I
nter
actio
n)
ADM−ADMMED−MEDNUR−NURPAT−PATADM−MEDADM−NURADM−PATMED−NURMED−PATNUR−PAT
(Data represent person-to-person contacts over 96hrs within a hospital environment. (Src: Vanhems et al.))
Les Houches, avril 2014
Descriptive Statistics for Networks Network Characterization
Characterization of Network Graphs: Intro
Given a network graph representation of a system (i.e., perhaps a result ofnetwork mapping), often questions of interest can be phrased in terms ofstructural properties of the graph.
social dynamics can be connected to patterns of edges among vertextriples;
routes for movement of information can be approximated by shortestpaths between vertices;
‘importance’ of vertices can be captured through so-called centralitymeasures;
natural groups/communities of vertices can be approached throughgraph partitioning.
Les Houches, avril 2014
Descriptive Statistics for Networks Network Characterization
Characterization Intro (cont.)
Structural analysis of network graphs ≈ descriptive analysis; this is astandard first (and sometimes only!) step in statistical analysis ofnetworks.
Main contributors of tools are
social network analysis,
mathematics & computer science,
statistical physics
Many tools out there . . . two rough classes based on scale:
characterization of vertices/edges, and
characterization of network cohesion.
Les Houches, avril 2014
Descriptive Statistics for Networks Network Characterization
Characterization of Vertices/Edges
Examples include
Degree distribution
Vertex/edge centrality
Role/positional analysis
We’ll look briefly at degree and centrality.
Les Houches, avril 2014
Descriptive Statistics for Networks Network Characterization
Vertex Degree
The degree6 dv of a vertex v is the number of vertices in V incident to v .
i.e., dv = The number of neighbors of v
Define
fd = fraction of vertices v ∈ V with degree dv = d .
The degree distribution{fd}d≥0
is a summary of local connectivity across the graph G .
6For weighted networks there is the analogous notion of vertex strength.Les Houches, avril 2014
Descriptive Statistics for Networks Network Characterization
Degree Distributions
The degree distribution is an object of fundamental importance in graphtheory.
As such, it has been a source of considerable interest in empirical studies.
Much attention has been focused on distinguishing between two generalflavors of distributions i.e.,
homogeneous and heterogeneous.
Degree Distribution for Random Graph
Degree
Fre
quen
cy
0 2 4 6 8
010
2030
40
Degree Distribution for Power−Law Graph
Degree
Fre
quen
cy
0 5 10 15 20
020
4060
80
Les Houches, avril 2014
Descriptive Statistics for Networks Network Characterization
Fitting Heterogeneous Degree Distributions
Natural to want to fit these data.
But . . . in general (i.e., not just fornetworks) . . . the fitting of heavy-tailed densities is a non-trivial exer-cise.
Even in the case of a true power-lawdistribution, fitting must be handledwith some care.
Arguably best treated as a descriptivetask, rather than inferential.
(Shown: Data for yeast PPI network.)
Degree Distribution
Degree
Fre
quen
cy
0 20 40 60 80 100 120
050
010
0015
0020
00
1 2 5 10 20 50 100
5e-0
42e
-03
5e-0
32e
-02
5e-0
22e
-01
Log-Log Degree Distribution
Log-Degree
Log-
Inte
nsity
Les Houches, avril 2014
Descriptive Statistics for Networks Network Characterization
Centrality: Motivation
Many questions related to ‘importance’ of vertices.
Which actors hold the ‘reins of power’?
How authoritative is a WWW page considered by peers?
The deletions of which genes is more likely to be lethal?
How critical to traffic flow is a given Internet router?
Researchers have sought to capture the notion of vertex importancethrough so-called centrality measures.
Les Houches, avril 2014
Descriptive Statistics for Networks Network Characterization
Centrality: An Illustration
Clockwise from top left:(i) toy graph, with (ii)closeness, (iii) between-ness, and (iv) eigenvectorcentralities.
(Example and figures courtesy of Ulrik
Brandes.)
Les Houches, avril 2014
Descriptive Statistics for Networks Network Characterization
Higher-order Centrality
Indianapolis
Houston
Los Angeles
Sunnyvale
Seattle
DenverChicago New York
Wash. DC
Atlanta
Kansas City
Indianapolis
Houston
Los Angeles
Sunnyvale
Seattle
DenverChicago New York
Wash. DC
Atlanta
Kansas City
Conceptuallya, centrality general-izes naturally to higher orders.
For betweenness, two logical ex-tensions are
1 group betweenness (i.e.,‘union’/OR), and
2 co-betwenness (i.e.,‘intersection’/AND).
(See Kolaczyk, Chua, and Barthelemy (2009).)
aComputationally, such generalizations may notbe so straightforward!
Les Houches, avril 2014
Descriptive Statistics for Networks Network Characterization
Network Cohesion: Motivation
Many questions involve scales coarser than just individual vertices/edges.More properly considered questions regarding ‘cohesion’ of network.
Do friends of actors tend to be friends themselves?
Which proteins are most similar to each other?
Does the WWW tend to separate according to page content?
What proportion of the Internet is constituted by the ‘backbone’?
These questions go beyond individual vertices/edges.
Les Houches, avril 2014
Descriptive Statistics for Networks Network Characterization
Network Cohesion: Various Notions!
Various notions of ‘cohesion’.
density
clustering
connectivity
flow
partitioning
. . . and more . . .
We’ll look quickly at just examples relating to connectivity.
Les Houches, avril 2014
Descriptive Statistics for Networks Network Characterization
Components
Not uncommon in practice that a graph be unconnected!
A (connected) component of a graph G is a maximally connectedsub-graph.
Common to decompose graph into components. Often find this results in
giant component
smaller components
isolates
Frequently, reported analyses are for the giant component.
Les Houches, avril 2014
Descriptive Statistics for Networks Network Characterization
Components in Directed Graphs Become Interesting!
Tendrils
Strongly
Connected
Component
In−Component Out−Component
Tubes
(Due to Broder et al. ’00.)
Les Houches, avril 2014
Descriptive Statistics for Networks Network Characterization
Example: AIDS Blog Network
Left: Original network. Right: Network with vertices annotated bycomponent membership i.e.,
strongly connected component (yellow)
in-component (blue)
out-component (red)
tendrils (pink)Les Houches, avril 2014
Descriptive Statistics for Networks Network Characterization
Vertex/Edge-Connectivity
Question: “If an arbitrary subset of k vertices (edges) is removed from agraph G , is the remaining subgraph connected?”
The notions of k-vertex(edge)-connected seek to make this question andits answer precise.
E.g., A graph G is said to be k-vertex-connected if
Nv > k , and
the removal of an X ⊂ V , such that |X | < k leaves a subgraphG − X that is still connected.
Note: Connectivity closely related to results on ‘flow’.
Les Houches, avril 2014
Descriptive Statistics for Networks Network Characterization
Illustration: Detecting Malicious Internet Sources
Ding et al.a use the idea of cut-vertices to detect Internet IPaddresses associated withmalicious behavior.
Corresponds to a type of (anti)socialbehavior.
aDing, Q., Katenka, N., Barford, P., Kolaczyk, E.D.,
and Crovella, M. (2012). Intrusion as (Anti)social Communication:Characterization and Detection. Proceedings ofthe 2012 ACM SIGKDD Conference on KnowledgeDiscovery and Data Mining.
Source/Destination Network
Projected Source Network
Les Houches, avril 2014
Descriptive Statistics for Networks Network Characterization
Characterization of Dynamic Networks
Development of a comprehensive body of tools for characterizing dynamicnetworks lags far behind what we have for static networks.
increased demand relatively recent
graph-theoretic infrastructure less developed
potential for severe magnification of computational challenges
Les Houches, avril 2014
Descriptive Statistics for Networks Network Characterization
Characterization of Dynamic Networks (cont).
0102030
0102030
0102030
0102030
0102030
0102030
0102030
0102030
0−12hrs
12−24hrs
24−36hrs
36−48hrs
48−60hrs
60−72hrs
72−84hrs
84−96hrs
0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323335Degree
Cou
nt
Status
ADM
MED
NUR
PAT
Arguably still the most com-mon approach is to applymethods for static networksto consecutive time slices.
Left: Degree distributionsfor hospital contact data,every 12hrs.
Les Houches, avril 2014
Wrapping Up
Outline
1 Welcome2 Background & Motivation
Network Analysis & StatisticsPrelude
3 Review of GraphsDefinitions and ConceptsGraphs & Matrix AlgebraGraphs Data Structures & Algorithms
4 Descriptive Statistics for NetworksNetwork MappingNetwork Characterization
Vertex DegreeCentralityComponentsConnectivity / CutsDynamic Networks
5 Wrapping UpLes Houches, avril 2014
Wrapping Up
Final Thoughts
This first lecture only scratches the surface of mapping andcharacterization of networks.
More details will surely emerge in lectures of various other speakers thesenext two weeks.
Remaining lectures:
L1 Introduction, Background, and Descriptive Statistics (1.5hrs)
L2 Network Sampling (1hr)
L3 Network Modeling (1.5hrs)
L4 Additional Topics in Modeling/Analysis (1.5hr)
Les Houches, avril 2014