social network analysis half day - knowledge-integrity.com · 1 1 introduction to social network...
TRANSCRIPT
1
1
Introduction to
Social Network and LinkAnalysis
David Loshin
Knowledge Integrity, Inc.
TDWI 2007 Spring Conference
Boston, MA
2
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
2
3
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
3
Half-Day Agenda
� Introduction to Networks, social and otherwise
� Network Connectivity Basics
� Link and Network Analysis
� Issues and Considerations for BI
In this talk, we will discuss the notion of connectivity, and why models for analyzing
connections can add value to a business intelligence initiative. By reviewing the ways that objects interact through networks, we will explore whether the results of
this analysis can enhance profiles, predictive analytics, and general business intelligence
Objectives:
•Understand network connectivity basics
•Explore ways to represent networks
•Understand the types of analysis that can be performed
•Envisioning network analytics, data extraction and preparation
4
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
4
Introduction to Networks, Social and Otherwise
5
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
5
Networks, Links, and Coincidences?
� How many people do you know?
� Family, friends, co-workers, conference attendees
� Dozens? Hundreds? Thousands?
� How well do you know them?
� Very close, know them well, acquaintances, “just met”
� Concepts of Connectivity?
� You know 1,000 people
� They each know 1,000 people
� Therefore, you are potentially connected to 1,000,000 people
through just 1 link
� By 2 links, the network could potentially extend to 1,000,000,000 people!
� “Small World Theory” – we are all connected through a very small number of links (See Milgram, Bacon)
The notion of connectivity is intriguing, especially when considering individuals,
other types of parties, and the knowledge that can be derived through the analysis of connections.
For example, think about the example in this slide. Let’s assume that we all know about 1000 people. But is it really true that each individual is therefore linked to
1,000,000 people? Conceptually, that would be true as long as none of the 1000 people I know are completely different than the 1000 people that you know.
But in reality, we all seem to run around in similar circles, and so there is a great likelihood that many of the people that I know are the same people that you know.
The consequences of this is the effective “self-organization” of communities ( as well as sub-communities, and sub-sub-communities). By examining the
relationships that exist among groups of people, we can learn who are the influencers, who are the influenced, who spans critical communication boundaries,
and how information (or commerce, or viruses, etc.) flow through the selected community.
6
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
6
Euler’s Insight
Bridges of Konigsberg
One pastime of the residents of Konigsberg was to walk around town over the
bridges between the different land areas of town. One game was to see if one could start at one location, walk over every bridge just once, and end up at the starting
point. Mathematician Leonhard Euler abstracted the problem into a “graph” –acollection of nodes and links between them. By examining the graph, he was able
to determine that based on the degrees of the links between nodes, the challenge of the bridges of Konigsberg was actually impossible. However, this insight created the
branch of mathematics referred to as graph theory, which is the fundamental basis of network (and consequently, social network) analysis.
7
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
7
Network and Link Analysis
� Linkages exist everywhere
� Between individuals (“MCI Friends and Family”)
� Between locations (“Bridges of Konigsberg”)
� Between other types of objects (“Telephone network”)
� Between individuals and other kinds of objects (“Purchasing Preferences”)
� Between businesses (“D&B corporate hierarchies”)
� There are different kinds of links
� Each link has some sort of attribution
� Analyzing networks can provide insight for evaluating behavior patterns for different intelligence activities
There are many applications that rely on the power of the network. Each of these
networks represents some attempt to exploit the different kinds of connections that exist among small groups of individuals, larger groups of individuals as well as how
the groups themselves interact. Applications may be designed to seek out some interesting pattern within the network or to exploit the communication and
information exchanges provided by the network.
Every node and their each of their corresponding links carries certain
characteristics. Each node represents an entity, while each link carries attributes that describe the nature of the relationship.
8
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
8
Applications of Network/Link Analysis
� Enforcement: Criminal analysis, money laundering
� Fraud detection: spambot detection, call pattern analysis
� Marketing: Customer Behavior analysis, Segmentation, collaborative filtering
� Community analyses: Account proxy (account used by more than one individual, many accounts used by one individual), research collaboration, communities of interest
9
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
9
More Applications…
� Health care: Contagion, disease control
� Physical: Supply chain analysis
� Transfer/Communications: Spheres of Influence, Information flows, business partnerships
� Formal Relationships: Working relationships, Influential individuals, ownership, accountability, corporate structure
� Informal Relationships: Friendship networks, extended families, social interactions, insider networks
10
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
10
Evidence is All Around…
� Databases
� Transaction systems, logs, data warehouses
� Semi-structured data
� Email, web pages, public records, filings
� Unstructured data
� News items, prospectuses, filings
Typical business intelligence applications focus on the ability to organize information
for reporting and analysis across one or more dimensions, but are not usually configured to enable network analysis. Yet data warehouses contain significant
amounts of connectivity information that is suitable to network and link analysis. Other sources of information provide connectivity data – transaction systems,
database logs, software activity logs, as well as less structured systems such as emails, web logs, electronic public data filings, other public records (e.g., real estate
transactions, Uniform Commercial Code, etc.). In addition, text analysis applications can extract individual data out of unstructured data to establish connections.
11
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
11
Example: Death Notice
� Richard J. Palaima, Of Mattapan, Suddenly, Nov. 18, 2002. Beloved husband of Jonadee (Badayos) Palaima. Devoted son of Madelyn L. (George) Palaima of Braintree, and the late Richard A. Palaima. Devoted brother of John A. Palaima of Braintree, nephew of Catherine Cunningham of Rockland, Cousin & Godson of Robert Cunningham of Rockland. Funeral from the Mortimer N. Peck-Russell Peck Funeral Home, 516 Washington St., BRAINTREE, on Saturday at 9 a.m. Funeral Mass in St. Gregory's Church, Dorchester, at 10 a.m. Relatives and friends invited. Visiting hours Friday 2-4 & 7-9 p.m. Memorial donations may be sent to St. Gregory's Church, 2215 Dorchester Ave., Dorchester 02124.
This example was taken from the Boston Globe, and was available on line from
11/20/2002 - 11/21/2002. Death, birth, engagement, and wedding notices are good examples of publicly available information (published in the newspaper) configured
in semi-structured form that provide a lot of data about connections. In this example, we have a description of one individual and his immediate family, his location, and
his religious affiliation.
12
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
12
Example: Extracted Entities and Their Links
Richard A Plaima
John A Palaima
Braintree, MA
Richard J Paliama
Catherine Cunningham
Rockland, MA
Robert CunninghamSt. Gregory's Church
Dorchester, MA
Mattapan, MA
Deceased
married to
married to
mother of
father of
lives in
lives in
lives in
has living state of
brother of
has living state of
Madelyn L. (George) Palaima
lives in
sister of
mother of
Jondalee (Badayos) Palaimalives in
is godson/godfather of
is cousin of
is religiously affiliated withlocated in
13
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
13
Good SNA Resource
� Many examples taken from Robert A. Hanneman and Mark Riddle’s online text
� Introduction to social network methods
� http://faculty.ucr.edu/~hanneman/nettext/
14
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
14
Integrating Network/Link Analysis with BI
� Network information may be embedded within data warehouse
� However:
� Representations may not be appropriate for analysis
� Data may need to be transformed and managed using non-relational data structures
� Analysis lends itself to visual representation
� Must understand concepts associated with networks, connectivity, and qualification of linkage
� Objective: Gain a conceptual understanding of
� Network data and its representation
� Characteristics of network relationships
� Types of analysis performed
15
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
15
Network Connectivity Basics
16
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
16
Representing Network Data
� What is network data?
� Two types of objects:
� Actors (entities that participate in the network)
� Links (established relationships between the actors)
� Analysis focuses on:
� Who the actors are
� What their relationship is to holistic view of the community
� How the actors organize within the framework
� Different approaches to representation:
� Rectangular Data
� Adjacency Matrices
� Graph representation
The next set of slides introduces basic concepts of social networks:
-Actors
-Links
-Representations
17
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
17
Rectangular Data
� Relational structure provides a “rectangular” view of the “who knows who”relationship
ToFrom
56
16
65
15
34
43
23
13
32
12
51
61
31
21
6
5
4
3
2
1
ID
127FMartha
232FAbigail
334MSam
231MGeorge
225FBetsy
423MJohn
DegreeAgeSexName
Standard “rectangular” data, as it appears in most databases, can be used to
manage network links, but may be problematic for analysis.
Links are represented via an associated table, and while this provides information about the individual, the rectangular format makes it difficult to assess “holistic”
information, either about the modeled community as a whole, or about segments or patterns that emerge.
18
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
18
Adjacency Matrix
� This matrix represents the undirected, “who knows who”network among our actors using an adjacency matrix
-10001Abigail
1-0001Sam
00-100Martha
001-11George
0001-1Betsy
11011-John
AbigailSamMarthaGeorgeBetsyJohnChooser
Choice
In an adjacency matrix:
A ‘0’ means that there is no link between the chooser and choice
A ‘1’ means that there is a link between the chooser and choice
An adjacency matrix lends itself to certain types of analysis that cannot be done through standard rectangular representations.
19
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
19
Graphs
� Graphs contain vertices (nodes) and edges (links)
JohnSam
George
Betsy
Martha
Abigail
A graph is an abstract representation of the same set of information contained
within the adjacency matrix.
In a graph:
A vertex represents the actor
An edge represents a link between actors
Graph representations can feed visualization front ends, which supports different analytical processes and methods.
20
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
20
Connectivity Concepts
� Relationships and links
� Binary, directed, signed
� Measures of relationships
� Grouped ordinal, ranked ordinal, categorical, interval measures
The link between two individuals can be simple, such as the “who knows who”
relationship, or can be much more complex. The different classifications of connections may be based on how much information they carry. The next slides
describe different levels of information and provides some examples.
21
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
21
Binary Connectivity
� Represents a relationship determined by the answer to a true/false question
� Examples:
� Person A and person B know each other
� Organization X and organization Y contribute to the same charity
� Person A and person B have purchased product P
A binary connection essentially represents the positive response to a true/false
question, while the absence of the link reflects a negative response. Not that the examples provided here, the connection is reflexive, or undirected. In other words, if
A knows B, then B knows A.
22
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
22
Directed Connectivity
� The link is established based on how a question relates two entities in a directed way
� Examples:
� Person A has emailed person B
� Person A has visited web site W
� Organization X has purchased services from organization Y
A directed link differs from an undirected link in that there is no assumption of
reflexiveness. For example, A may have emailed to B, but that does not mean that B emailed A. If the relationship exists in both directions, then there will be two
directed arcs: one from A to B and one from B to A.
23
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
23
Signed Connectivity
� The link describes the nature of the relationship (e.g. positive, neutral, or negative)
� Examples:
� Person A hates product P
� Organization X has been categorized as an approved vendor
� Person A has provided a neutral score for Airline U’s service
This is the first characteristic of the connection that carries a value, although the
values indicate gross level data about the relationship. In this case, the sign indicates the nature of the relationship. For example, a +1 is a positive connection,
0 is a neutral connection, and -1 is a negative connection.
This notion suggests that descriptive metadata about links can embed more
complex knowledge about the network and how information flows through the network.
24
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
24
Grouped and Ranked Ordinal
� The links have magnitude, such as “dislikes” = -1, “strongly dislikes” = -2, “vehemently dislikes” = -3
� This provides more meaningful description of the connective characteristics
� Examples:
� Individuals and vacation destinations (desire to visit)
� Job references (references)
� In ranked ordinal, links are ranked in order of magnitude
� Actor X ranks the other actors in terms of who is liked most to the least
In a grouped ordinal model, the links carry both sign (to indicate the
positive/negative nature) and magnitude of the relationship. The greater the absolute magnitude, the greater the connective characteristic. Grouped ordinal
characterizes different quantitative measures:
- the “strength” of the connections- the frequency of the interaction
- the intensity of the relationship
In the ranked ordinal model, instead of gauging the links based on a quantitative
measure, the links are ordered based on their associative rank. While this is not common in network analysis, the information is often easy to assemble. For
example, one can calculate the order of “email connectivity” by counting the number of emails that are exchanged and putting together the ranking based on the counts.
25
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
25
Categorical
� Relationships are defined by category
� A business relationship is type “1,” personal relationship is type “2,” etc.
In a categorical linkage model, the attribution of the link reflects a segmentation of
the relationships based of their qualitative type.
26
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
26
Interval Measures
� Rankings reflect scaling
� Difference between 2 and 3 is same as difference between 20 and 21
Interval measures not only provide a ranked order, they also capture the relative
difference in “intensity” based on the interval order. This approach is the most sophisticated measurement framework. Interval measures can always be reduced
in complexity to one of the other measurement types.
27
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
27
Matrix Analysis
(3,3)
(2,3)
(1,3) (1,4)(1,2)(1,1)
(3,4)(3,2)(3,1)
(2,4)(2,2)(2,1)
Row 1
Row 2
Row 3
Col 1 Col 2 Col 3 Col 4
A matrix is a rectangular arrangement of link data. Each row and column is
assigned an identifier, and each cell within the matrix is indexed by its (row, column) address. For example, the shaded cell in the second row and the third column is
addressed “(2,3).” Matrices are one way to capture a representation of a network. Each cell contains the information associated with the link between the entity
represented by the row and the entity represented by the column.
28
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
28
Example: Adjacency Matrices
01-0B
011-A
0
1
B
0
1
A DC
-1D
1-C
A B
DC
Directed graph
Adjacency matrix
An adjacency matrix presents the network connectivity using labeled rows and
columns, with values within each cell representing the nature of the link. In this example, there is a directed graph representing the relationships among a set of
four people. The adjacency matrix shows the same set of relationships – if there is a directed arc one person to another, then the corresponding labeled cell has a “1,”
and has a “0” otherwise. In this case, there is no concept of the relationship existing from an entity to itself, so the diagonal, which represents the self-directed arc is left
with a “-.”
For an undirected graph, the relationships are essentially reciprocal, and the matrix
would be symmetric about the diagonal.
29
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
29
More About Matrices
01-0B
011-A
0
1
B
0
1
A DC1
-1D
1-C
11-0B
011-A
1
1
B
0
0
A DC2
-1D
1-C
00-1B
010-A
0
0
B
1
1
A DC3
-1D
1-C
00-0B
101-A
1
1
B
1
1
A DC4
-0D
1-CMultiple relationships can be captured in matrices of higher dimensionality
A matrix captured one set of relationships, but the analysis may capture multiple
relationships that exist between the same set of actors. In this case, we can layer a set of two-dimensional matrices into a third dimension to capture the complete set.
In this slide, each matrix is at its own layer, and we can examine the associations between all actors for one relationship by looking at one layer, or we can look at all
the links between any pair of actors by looking across the third dimension. In this slide, the highlighted cells for position (C, B) show the links between C and B.
30
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
30
Matrix Operations
� Matrix transpose
� Matrix addition and subtraction
� Matrix multiplication
01-0B
011-A
0
1
B
0
1
A DC
-1D
1-C
01-1B
010-A
0
1
B
0
1
A DC
-1D
1-C
O
01-0B
011-A
0
1
B
0
1
A DC
-1D
1-C
01-1B
010-A
0
1
B
0
1
A DC
-1D
1-C
02-1B
021-A
0
2
B
0
2
A DC
-2D
2-C
+ =
142
121
0
3
2
1
1
10203
0151
21 14
1141215
1978
12
7
3
11
16
111* =
Matrix A is the transpose of matrix B if, for each cell (i,j), the value of B(j,i) is the
same as A(i,j)
Matrix addition and subtraction are simplest: If we are adding matrices A and B into
matrix C, the value of C(i,j) is equal to A(i,j) + B(i,j). Subtraction is the same, except we subtract B(i,j) from A(i,j).
Matrix multiplication is more complex – the value of C(i,j) is equal to the sum of A(i,k) multiplied by B(k,j), for k=1 to the number of elements in each column of
matrix A and row of matrix B. In the example here, C(2,1) = (A(2,1) * B(1,2)) + (A(2,2) * B(2,2)) + (A(2,3) * B(3,2)) , which is equal to (1*1) + (2*3) + (1*1) = 8.
31
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
31
Adjacency Matrices and Multiplication
� Multiplying an adjacency matrix by itself once results in a matrix that counts the number of paths between nodes of length 2
0100B
0110A
0
1
B
0
1
A DC
01D
10C
0101B
0100A
0
1
B
0
1
A DC
01D
10C
* 1011B
1111A
1
1
B
1
0
A DC
10D
03C
=
A B
DC
The power X to which an adjacency matrix is raised results in a matrix counting the
number of paths between nodes of length X. In this example, we see that from node B to A there is one path of length 2, and from C to itself there are 3 paths of length
2. In turn, computing the Boolean square of the adjacency matrix tells us if there exists a path of length 2 between any two nodes.
Why is this relevant? Because the nature of network analysis is to explore connectivity, it is valuable to not review direct links between actors, but to explore
how connected the actors are, including the strength/weakness of connections, the distances, “influence,” and the robustness of the connections, among other
properties.
32
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
32
Example: Knoke Information Exchange Network
� Map of exchange of information between 10 organizations involved in the local political economy of social welfare services in a Midwestern city
33
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
33
Example: Knoke Information Network
� Graph and Adjacency matrix for the Knoke Information Network
34
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
34
Visualization Techniques
� Characterizing attribution of actors by shape, size of node, color
� Characterizing relationship/link by line size, type, thickness, decorations
� Example:
� Blue for non-government, red for government
� Square for generalists, circle for specialists
35
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
35
Graph Analysis
� Simple properties tell a lot about interaction and connectivity
� Social structures reflect aspects of both
� Global properties – the way the entire population interacts
� Local properties – the ways that individuals within small groups interact
� Analyze both the physical structure and the patterns of structure
Global properties are those that describe aspects of the entire community, while
local properties describe how small groups and individuals interact together within the communities. These properties are analyzed based on the kinds of structures
that exist inside the graph (subgraphs, cliques, components) as well as the patterns of the structures. An example of this latter point might be the recognition of a
common linkage pattern that carries some sociological meaning, such as the relationships between a pair of parents and their children.
Locality – Dyads and Triads
The most common subsets to review are
Dyads: groups of two actors
Triads: groups of three actors
With directed data, there are 4 possible relationships between 2 actors
With directed data, there are 64 relationships possible among 3 actors
Relationships exhibit hierarchy, equality, exclusion, “social standing,”patterns of behavior
36
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
36
Simple Graph Properties
� Counts
� Number of actors
� Number of possible connections
� Number of connections present
� Characteristics of network
� Size of population,
� How small groups differ from large groups: “cohesion, solidarity, moral density”
� Characteristics of individuals
� Number of connections
� Density of the network
� Source vs. sink
The first place to begin in network analysis is at the global level, looking at
properties that describe the entire network. For example, the counts associated with the graph provide some insight into its “density” – the number of possible
connections vs. the number of actually present connections. Next, characteristics of the network as reflected by the size of the population and the gross-level review of
groupings within the network. At a more granular level, examining the direct relationships among the individual entities and their relative connectivity gives some
insight into the population itself.
37
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
37
Size and Density
� Network size
� The count of the number of nodes
� Potential links
� There are (k*(k-1)) unique ordered pairs or actors
� The number of possible relationships grows exponentially
� Density is the ratio of links to the number of possible links
� The proportion of all possible links that are present
� Equal to the sum of links/number of possible links
� Provides insight into movement across the network,
qualifications of specific actors within the network
Many network analyses focus on the flow of “information” across the network, and
that becomes a recurring theme. In the next few slides, let’s look at network properties and consider their “information exchange” features. For example, the
ability to propagate information across the network is related both to its size and its density. The larger the network is, the more connections are needed to effectively
propagate information, which is why we explore its sized and its density.
38
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
38
Degree
� The number of edges/links attached to the actor
� The upper limit on number of connections each actor may have is (k-1)
� In-degree is the number of directed links into the node
� The out-degree is the number of directed links out of the node
� High in-degree indicates a “sink”
� High out-degree indicates a “source”
The degree of a node describes its level of connectedness in the network. Here we
distinguish between undirected and directed graphs. In undirected graphs, we just look at the degree, but in directed graphs, which have arcs that travel from one
node to another, we discuss in-degree, which is the number of directed arcs into a node, and out-degree, which is the number of directed arcs that leave a node.
39
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
39
Reachability and Point Connectivity
� Reachability:
� An actor is reachable by another if there are connections that can trace between them
� Provides insight into communication capability, robustness
� Connectivity:
� The number of nodes that have to be removed so that one actor could no longer reach another
� Provides insight into “redundancy” and robustness
40
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
40
Distance
� How far is it between nodes?
� Conceptually, the number of nodes that must be traversed to establish the connection between two actors
� Walk: A sequence that shows a traversal in the graph between two actors (ABC, ABDC, ABEBC)
� Cycle: A closed walk of distinct actors except for the originating node, which is also the destination (BCDB)
� Path: A walk in which each other actor and each other relation in the graph may be used at most one time
There may be different walks of different lengths that connect two actors
Distance can be assessed based on the characteristics of different kinds of walks:
The total number of walks of a particular length between any two actors
Distances can be scaled based on the size of the network
To assess lengths in instances where links have value, sum the values along
shortest path, or the minimum of the sums across all paths of size n or smaller
41
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
41
Geodesic Distance
� Geodesic distance is the number of relations in the shortest walk from one actor to another
� Flow: similar to bandwidth capacity – how many different paths are there that connect two actors
� Diameter: Largest geodesic distance in the graph
Geodesic distance is used to characterize the most efficient or optimal connection
between two actors. In any network, two actors may be connected via numerous paths, but depending on the nature of the connectivity, it might be assumed that
communication between any pair of actors would be performed along the shortest path.
Dense networks have mostly short geodesic distances. The largest geodesic distance for each actor is called its “eccentricity.” In graphs that are not completely
connected,
42
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
42
David
BobHelen
Joe
Rene
Kate
Jack
LenTed
Jill
Frank
A Simple Network
43
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
43
Univariate Statistics and Inferencing
� An examination of the statistics across rows (i.e., “sources”) or columns (“sinks”)
Mean: percentage of remaining nodes to which this nodes send link
Sum: number of links to remaining nodes
Variance: How variable or “predictable” the actor is with respect to others
The statistics here review the differences between the “roles” each node takes on
as sources of information. In this example, there are some significant sources – 2, 3, 5, and 8 are similar in terms of their role as information providers. Reviewing
these statistics allows us to make some inferences about the population. For example:
Actors that have high out-degree may be “communicators” or “influencers”
Actors with low out-degree are less likely to be influencers unless they are connected to the “right” other actors
Actors with high in-degree may be “powerful” in terms of information gathering
Actors with high in- and out-degree may be “facilitators” – they receive information
and pass it along
Actors with high out-degree and low in-degree may be “wannabes” or “outsiders”
44
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
44
Centrality
� The concept of centrality is bound to the concept of “power”or “influence” within the network
� The way that actors are embedded within networks imposes constraints on, or offers opportunities for the actor and the network
� Provides qualification of favorable position, control, essentialness within the network
� Three types of measures:
� Degree
� Closeness
� Betweenness
45
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
45
Examples
StarCircle
Line
These three graphs demonstrate different kinds of centrality characteristics. In the
star graph, node A has a high measure of centrality, but in the Circle graph no node has any greater centrality than any other. In the line graph, there are differing
approaches to looking at centrality. In one, the edge nodes (A and G) have less centrality than the others, but a different approach may incorporate position in the
line into the centrality measures also (e.g., D is more central than C or E, etc.).
46
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
46
Degree Centrality
� Degree centrality is based on the number of links in and out of each node
� Actors with high in-degree are “prominent”
� Actors with high out-degree are “influential”
� Normalized degrees are based on percentage of remaining actors
47
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
47
Closeness and Betweenness
� Closeness measures how close each node is to the others in the network
� Betweenness characterizes the degree to which any individual node exists between other nodes
The geodesic distance enables actors in favorable positions to communicate faster.
Closeness examines the shortest distances between a node and each of the other nodes. For example, in the star network, the A node has a higher degree of
closeness to the other nodes than any other one. In the circle network, the closeness measure is essentially identical for each of the nodes. However, in the
line network, the node in the center of the line is “closer” in total to the other nodes than the others.
Betweenness is a measure of how a node in the network lies on the critical paths between other nodes. For example, in the star network, node A has a high level of
betweenness, since it is one the path between every other set of nodes.
48
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
48
Grouping and Graph Basics
� The structure of connectivity in which an actor is embedded within some structure, which is then embedded within a larger structure, etc.
� Components
� Clique
� Blocks and Cutpoints
� Lots of other stuff!!!
Two vertices are in a connected component if there exists a path between them.
Connected components can be drawn as subgraphs with empty space between them.
A clique is a subgraph where every node is connected to every other node.
Blocks are locations in the graph that would become disjoint if a node or link were
removed. That removed link or node is called a cutpoint.
49
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
49
Link and Network Analysis
50
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
50
Structural Analysis
� Evaluating substructures that exist and interact within the network
� Dyads and Triads compose many graphs
� Looking at position and structure within graph provides insight into social structure and embeddedness
� Dyads and Triads form into larger graph substructures
� Evaluating overlapping membership in graph substructures exposes “influential” entities within the network
51
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
51
Cliques
� A clique is a maximally connected subgraph
� In a clique, each node is connected to every other node
Cliques indicate a close relationship among the members, and often exist within
communities based on entity profile similarities. For example, individuals with the same racial, ethnic, or religious identities may form smaller, tighter group
relationships.
The smallest cliques are dyads, followed by triads. These building blocks may be
combined into larger cliques as well.
By looking at the types of relationships between nodes in the cliques, you can see
behavior patterns emerge as well. For example, in a triad, the link between two of the members may be much stronger than between either of those two and the third
member.
Another interesting aspect is to evaluate overlapping membership. Actors that are
members of multiple cliques may have greater influence, while sets of actors that share clique memberships may be particularly close also.
52
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
52
Example
� In our Knoke example, the nodes for COMM and MAYOR share 5 clique memberships
53
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
53
N-Clique and N-Clans
� The definition of a clique is very restrictive
� There are subgraph relationships that relax the clique requirements
� An N-clique is a subgraph where every node is connected to every other node by a distance of N
� N-cliques may be connected by nodes that are not member of the clique, suggesting a stricter definition for an n-clan
� An n-clan has the additional constraint that the connections must be made through other members of the n-clique
For n-cliques, the most frequent value used is 2. This is conceptually equivalent to “
a friend of a friend.”
There are other approaches to modifying the constraints associated with clique-
style connectivity, such as:
K-plex, which defines the clique if every node has a connection to all but K out of N nodes
K-core, which defines the clique if every node is connected to K out of N nodes
54
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
54
Components
� Components are subgraphs that are connected within the network, but are disjoint from other subgraphs
55
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
55
Blocks and Cutpoints
� If a node or a link were removed, how would that affect the connectivity structure?
� A node that, when it is removed subdivides the graph into components is called a “cutpoint”
� Those divisions are called “blocks”
By removing a node in the graph and bisecting the network into blocks, we identify
nodes that are key to the communication or information exchange patterns.
56
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
56
Summary: Link Analysis
� Explores the relationships between objects, and how objects are linked
� Identify nodes that are “key” within a network
� Determine the links are critical to the operations of the network
� Assess the existence of relevant sub-networks
� Evaluate “spheres of influence”
� Clustering entities into communities of interest
� Geographic
� Intellectual
� Psychographic
57
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
57
Example – Collaborative Filtering
Customers who bought this item also boughtBusiness Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications by Larissa T. Moss Performance Dashboards: Measuring, Monitoring, and Managing Your Business by Wayne W. EckersonEnterprise Dashboards: Design and Best Practices for IT by Shadan MalikThe Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition) by Ralph Kimball Business Intelligence for the Enterprise by Mike Biere
Business Intelligence: The Savvy Manager's Guide (The Savvy Manager's Guides) (Paperback) by David Loshin "Imagine that you are the sales manager for a large retail organization and that you were able, within some probability, to predict how much money..." (more) Key Phrases: productivity analytics, business rules system, business rules approach, Postal Service, Data Warehousing Institute, United States (more...) (5 customer reviews)
58
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
58
Example – Organizational Behavior
� Evaluation of communication patterns between individuals within an organization
� Attribution of actors with demographic data (age, seniority, gender, department, education, etc.)
� Assess:
� Is work being performed within or outside of the formal structure of the organizations?
� Are there self-organized “invisible barriers” emerging based on individual characteristics?
� Are there informal communities of interest that may seed innovation across division boundaries?
� How effective are the different communication channels?
59
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
59
Example - Terrorist Network
� Connectivity map by Valdis Krebs, based on data available from news sources on the world wide web
60
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
60
Other Examples/Resources
� Analyzing degree, centrality to assess risk of infection in a population (http://intl-aje.oxfordjournals.org/cgi/content/abstract/162/10/1024)
� Integrating raw data from multiple sources, including phone records, bank transactions, surveillance reports, vehicle sales to create a criminal network analysis program (http://ai.bpa.arizona.edu/COPLINK/publications/crimenet/Xu_CACM.doc)
� Enabling targeted marketing within self-organizing networks (www.linkedin.com, www.myspace.com)
� Krebs’ article on Identifying Terrorist Networks (http://firstmonday.org/issues/issue7_4/krebs/)
� Corporate board members and influencers (http://www.theyrule.net/)
http://ai.bpa.arizona.edu/COPLINK/publications/crimenet/Xu_CACM.doc
61
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
61
Issues and Considerations for Business Intelligence
62
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
62
Challenges
� Establishing a business justification
� Network Models and Metadata
� Data warehouse data extraction and transformation
� Semantic Analysis, Entity Extraction, and Establishing Linkage
� Graph Management
63
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
63
Business Justification
� Are all of these examples proving known concepts after the fact?
� Where can this technique add value to existing activities?
� What are the costs related to implementation and socialization?
One of the criticisms of network and link analysis is the belief that the analysis
exposes interesting notions that are actually already known. In evaluating the terrorist data, for example, are we determining (after the fact) that certain parties
were connected, even though that fact was already known? Alternatively, one might ask the same question a different way: Today, would this kind of analysis contribute
to the determination of suspicious group behavior that would warrant action?
At a more concrete level, we can evaluate the types of applications to which SNA is
applied and review what benefits are expected out of the process and how those benefits are measured. For example, consider Amazon’s use of collaborative
filtering – the cost is not significant, but there is a perception of high value, especially as it facilitates self-organized predictive modeling.
64
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
64
Data Extraction and Transformation
� Linearized or rectangular data is not necessarily suitable for network or link analysis
� Synchronization of warehouse data with network data
� Challenge:
� Develop models for representation of network data
� Provide services for transformation into and out of graph structures
� Provide front-end applications for visualization
65
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
65
Networks and Data Models
� SNA applications require data configured to represent nodes, links, and associated link weights
� Example:
� Nodes: Ken Dave Jill Bob
� Arcs:
� Ken Dave 10
� Ken Jill 1
� Dave Bob 6
Different applications expect the input data representing the network to be
configured in a way that can be parsed easily and configured into the tool. In the example here, the node names are enumerated, followed by an enumeration of arcs
consisting of the source node, the target node, and the weight of the link. In turn, the applications will maintain an internal representation of the network (perhaps in
an adjacency matrix), as well as maintain additional metadata related to actions taken by the analyst, which means that the internal representation of the network
may be modified and written out during the analysis.
The challenge is to create the transformation mechanisms to extract rectangular
data from traditional data sets, organize the data into the appropriate network representation, as well as transforming the output of the SNA application back into
rectangular format for use in traditional BI activities. This requires understanding the SNA application’s model as well as how the connectivity is represented within the
source data sets.
66
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
66
Social Network Metadata
� Actors
� Label or name
� Identifier
� Actor Demographics
� Links
� Relational characteristic
� Connectivity measure
� Weight/value
� Other issues to consider:
� “Link distance”
� Positioning
� Graph structure
The network embeds relationships, but the nature of how those relationships
correspond to the modeled objects must be maintained. This means that the metadata associated with each of the objects should be captured, especially if it can
be visualized along with the network itself. For example, in the Knoke network, the shapes and colors used indicated different aspects of the modeled actors.
The same is true for the links – different types may be represented using different line types, while the weights may be indicated using line thickness or even distance
between the nodes. Because visualization is critical, the positioning of nodes on the “template” of the map may be relevant, as well as the shape that the graph should
take (e.g., circle, hub and spokes, random node placement, etc.).
Eventually, most analyses will need to be reintegrated with the original source data
sets; identifiers need to be assigned to each node, while these identifiers also need to be linked back to the original data.
67
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
67
Semantic Analysis and Entity Extraction
� Semi-structured and unstructured text contain references to entities and their relationships
� Challenge:
� Provide ability to identify entities within a document
� Characterize entity types based on context
� Establish connectivity on more than a naïve basis
� Transform extracted information into format suitable for integration and analysis
68
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
68
Entity Extraction
� Recall our earlier example:� Richard J. Palaima, Of Mattapan, Suddenly, Nov. 18, 2002. Beloved
husband of Jonadee (Badayos) Palaima. Devoted son of Madelyn L. (George) Palaima of Braintree, and the late Richard A. Palaima. Devoted brother of John A. Palaima of Braintree, nephew of Catherine Cunningham of Rockland, Cousin & Godson of Robert Cunningham of Rockland. Funeral from the Mortimer N. Peck-Russell Peck Funeral Home, 516 Washington St., BRAINTREE, on Saturday at 9 a.m.
Funeral Mass in St. Gregory's Church, Dorchester, at 10 a.m. Relatives and friends invited. Visiting hours Friday 2-4 & 7-9 p.m. Memorial donations may be sent to St. Gregory's Church, 2215 Dorchester Ave., Dorchester 02124.
� Text mining applications can identify names, locations, roles, and organizations within semi- and unstructured data
� Often the relationship may be as simple as “appearing in the same context”
One approach for evaluating connections is to analyze text, extract the entities and
related metadata (locations, roles, etc.), insert that data into a database, then extract it for the purposes for network analysis. By analyzing many similar document
corpuses (e.g., public records, filings, directories, news articles) and inserting them into a data set, linkages can emerge from the data. For a good example, consult
www.theyrule.net.
69
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
69
Establishing Linkage
� Quality of data used to refer to entities may be variant and suspect
� Challenge:
� Effectively characterize how attribution contributes to similarity scoring
� Exploit data quality tools for standardization and linkage from different kinds of source data
Data cleansing, matching, and linkage tools have been used for many years for
identifying duplicates among data records, as well as using similarity scoring mechanisms for householding and establishing hierarchies. The same techniques
can be used to examine attributes associated with each record, parse and standardize the data, then apply the similarity scoring and linkage capabilities to the
candidate entities.
70
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
70
Graph Management
� Graph data structures are different than relational databases
� Graphs do not inherently provide persistence
� Large networks take up large amount of memory space
� Issues:
� Managing large graphs in a performance-efficient manner
� Manipulating graph models in real time
� Persistence of graph models
� Drawing conclusions about what is demonstrated in the graph
Some of the interesting challenges that need to be addressed include:
Providing a usable graph management utility that provides reasonable
performance
Providing persistence for graphs
Enabling transformation into and out of graph format
71
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
71
Interesting Resources
� Books
� “Linked: How Everything Is Connected to Everything Else and What It Means,” Albert-Laszlo Barabasi
� “The Tipping Point,” Malcolm Gladwell
� Web sites
� http://www.insna.org/
� http://www.ire.org/sna/
� http://faculty.ucr.edu/~hanneman/nettext/
� Software
� The R project http://www.r-project.org/
� SNA tools for R http://erzuli.ss.uci.edu/R.stuff/
� UCINET trial download http://www.analytictech.com/downloaduc6.htm
� Pajek http://vlado.fmf.uni-lj.si/pub/networks/pajek/
72
© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com
(301) 754-6350
72
Questions?
� If you have questions, comments, or suggestions, please contact me
David Loshin
301-754-6350