social network analysis half day - knowledge-integrity.com · 1 1 introduction to social network...

72
1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference Boston, MA

Upload: ledang

Post on 16-Feb-2019

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

1

1

Introduction to

Social Network and LinkAnalysis

David Loshin

Knowledge Integrity, Inc.

TDWI 2007 Spring Conference

Boston, MA

Page 2: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

2

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

2

Page 3: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

3

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

3

Half-Day Agenda

� Introduction to Networks, social and otherwise

� Network Connectivity Basics

� Link and Network Analysis

� Issues and Considerations for BI

In this talk, we will discuss the notion of connectivity, and why models for analyzing

connections can add value to a business intelligence initiative. By reviewing the ways that objects interact through networks, we will explore whether the results of

this analysis can enhance profiles, predictive analytics, and general business intelligence

Objectives:

•Understand network connectivity basics

•Explore ways to represent networks

•Understand the types of analysis that can be performed

•Envisioning network analytics, data extraction and preparation

Page 4: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

4

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

4

Introduction to Networks, Social and Otherwise

Page 5: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

5

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

5

Networks, Links, and Coincidences?

� How many people do you know?

� Family, friends, co-workers, conference attendees

� Dozens? Hundreds? Thousands?

� How well do you know them?

� Very close, know them well, acquaintances, “just met”

� Concepts of Connectivity?

� You know 1,000 people

� They each know 1,000 people

� Therefore, you are potentially connected to 1,000,000 people

through just 1 link

� By 2 links, the network could potentially extend to 1,000,000,000 people!

� “Small World Theory” – we are all connected through a very small number of links (See Milgram, Bacon)

The notion of connectivity is intriguing, especially when considering individuals,

other types of parties, and the knowledge that can be derived through the analysis of connections.

For example, think about the example in this slide. Let’s assume that we all know about 1000 people. But is it really true that each individual is therefore linked to

1,000,000 people? Conceptually, that would be true as long as none of the 1000 people I know are completely different than the 1000 people that you know.

But in reality, we all seem to run around in similar circles, and so there is a great likelihood that many of the people that I know are the same people that you know.

The consequences of this is the effective “self-organization” of communities ( as well as sub-communities, and sub-sub-communities). By examining the

relationships that exist among groups of people, we can learn who are the influencers, who are the influenced, who spans critical communication boundaries,

and how information (or commerce, or viruses, etc.) flow through the selected community.

Page 6: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

6

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

6

Euler’s Insight

Bridges of Konigsberg

One pastime of the residents of Konigsberg was to walk around town over the

bridges between the different land areas of town. One game was to see if one could start at one location, walk over every bridge just once, and end up at the starting

point. Mathematician Leonhard Euler abstracted the problem into a “graph” –acollection of nodes and links between them. By examining the graph, he was able

to determine that based on the degrees of the links between nodes, the challenge of the bridges of Konigsberg was actually impossible. However, this insight created the

branch of mathematics referred to as graph theory, which is the fundamental basis of network (and consequently, social network) analysis.

Page 7: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

7

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

7

Network and Link Analysis

� Linkages exist everywhere

� Between individuals (“MCI Friends and Family”)

� Between locations (“Bridges of Konigsberg”)

� Between other types of objects (“Telephone network”)

� Between individuals and other kinds of objects (“Purchasing Preferences”)

� Between businesses (“D&B corporate hierarchies”)

� There are different kinds of links

� Each link has some sort of attribution

� Analyzing networks can provide insight for evaluating behavior patterns for different intelligence activities

There are many applications that rely on the power of the network. Each of these

networks represents some attempt to exploit the different kinds of connections that exist among small groups of individuals, larger groups of individuals as well as how

the groups themselves interact. Applications may be designed to seek out some interesting pattern within the network or to exploit the communication and

information exchanges provided by the network.

Every node and their each of their corresponding links carries certain

characteristics. Each node represents an entity, while each link carries attributes that describe the nature of the relationship.

Page 8: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

8

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

8

Applications of Network/Link Analysis

� Enforcement: Criminal analysis, money laundering

� Fraud detection: spambot detection, call pattern analysis

� Marketing: Customer Behavior analysis, Segmentation, collaborative filtering

� Community analyses: Account proxy (account used by more than one individual, many accounts used by one individual), research collaboration, communities of interest

Page 9: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

9

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

9

More Applications…

� Health care: Contagion, disease control

� Physical: Supply chain analysis

� Transfer/Communications: Spheres of Influence, Information flows, business partnerships

� Formal Relationships: Working relationships, Influential individuals, ownership, accountability, corporate structure

� Informal Relationships: Friendship networks, extended families, social interactions, insider networks

Page 10: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

10

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

10

Evidence is All Around…

� Databases

� Transaction systems, logs, data warehouses

� Semi-structured data

� Email, web pages, public records, filings

� Unstructured data

� News items, prospectuses, filings

Typical business intelligence applications focus on the ability to organize information

for reporting and analysis across one or more dimensions, but are not usually configured to enable network analysis. Yet data warehouses contain significant

amounts of connectivity information that is suitable to network and link analysis. Other sources of information provide connectivity data – transaction systems,

database logs, software activity logs, as well as less structured systems such as emails, web logs, electronic public data filings, other public records (e.g., real estate

transactions, Uniform Commercial Code, etc.). In addition, text analysis applications can extract individual data out of unstructured data to establish connections.

Page 11: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

11

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

11

Example: Death Notice

� Richard J. Palaima, Of Mattapan, Suddenly, Nov. 18, 2002. Beloved husband of Jonadee (Badayos) Palaima. Devoted son of Madelyn L. (George) Palaima of Braintree, and the late Richard A. Palaima. Devoted brother of John A. Palaima of Braintree, nephew of Catherine Cunningham of Rockland, Cousin & Godson of Robert Cunningham of Rockland. Funeral from the Mortimer N. Peck-Russell Peck Funeral Home, 516 Washington St., BRAINTREE, on Saturday at 9 a.m. Funeral Mass in St. Gregory's Church, Dorchester, at 10 a.m. Relatives and friends invited. Visiting hours Friday 2-4 & 7-9 p.m. Memorial donations may be sent to St. Gregory's Church, 2215 Dorchester Ave., Dorchester 02124.

This example was taken from the Boston Globe, and was available on line from

11/20/2002 - 11/21/2002. Death, birth, engagement, and wedding notices are good examples of publicly available information (published in the newspaper) configured

in semi-structured form that provide a lot of data about connections. In this example, we have a description of one individual and his immediate family, his location, and

his religious affiliation.

Page 12: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

12

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

12

Example: Extracted Entities and Their Links

Richard A Plaima

John A Palaima

Braintree, MA

Richard J Paliama

Catherine Cunningham

Rockland, MA

Robert CunninghamSt. Gregory's Church

Dorchester, MA

Mattapan, MA

Deceased

married to

married to

mother of

father of

lives in

lives in

lives in

has living state of

brother of

has living state of

Madelyn L. (George) Palaima

lives in

sister of

mother of

Jondalee (Badayos) Palaimalives in

is godson/godfather of

is cousin of

is religiously affiliated withlocated in

Page 13: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

13

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

13

Good SNA Resource

� Many examples taken from Robert A. Hanneman and Mark Riddle’s online text

� Introduction to social network methods

� http://faculty.ucr.edu/~hanneman/nettext/

Page 14: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

14

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

14

Integrating Network/Link Analysis with BI

� Network information may be embedded within data warehouse

� However:

� Representations may not be appropriate for analysis

� Data may need to be transformed and managed using non-relational data structures

� Analysis lends itself to visual representation

� Must understand concepts associated with networks, connectivity, and qualification of linkage

� Objective: Gain a conceptual understanding of

� Network data and its representation

� Characteristics of network relationships

� Types of analysis performed

Page 15: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

15

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

15

Network Connectivity Basics

Page 16: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

16

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

16

Representing Network Data

� What is network data?

� Two types of objects:

� Actors (entities that participate in the network)

� Links (established relationships between the actors)

� Analysis focuses on:

� Who the actors are

� What their relationship is to holistic view of the community

� How the actors organize within the framework

� Different approaches to representation:

� Rectangular Data

� Adjacency Matrices

� Graph representation

The next set of slides introduces basic concepts of social networks:

-Actors

-Links

-Representations

Page 17: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

17

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

17

Rectangular Data

� Relational structure provides a “rectangular” view of the “who knows who”relationship

ToFrom

56

16

65

15

34

43

23

13

32

12

51

61

31

21

6

5

4

3

2

1

ID

127FMartha

232FAbigail

334MSam

231MGeorge

225FBetsy

423MJohn

DegreeAgeSexName

Standard “rectangular” data, as it appears in most databases, can be used to

manage network links, but may be problematic for analysis.

Links are represented via an associated table, and while this provides information about the individual, the rectangular format makes it difficult to assess “holistic”

information, either about the modeled community as a whole, or about segments or patterns that emerge.

Page 18: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

18

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

18

Adjacency Matrix

� This matrix represents the undirected, “who knows who”network among our actors using an adjacency matrix

-10001Abigail

1-0001Sam

00-100Martha

001-11George

0001-1Betsy

11011-John

AbigailSamMarthaGeorgeBetsyJohnChooser

Choice

In an adjacency matrix:

A ‘0’ means that there is no link between the chooser and choice

A ‘1’ means that there is a link between the chooser and choice

An adjacency matrix lends itself to certain types of analysis that cannot be done through standard rectangular representations.

Page 19: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

19

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

19

Graphs

� Graphs contain vertices (nodes) and edges (links)

JohnSam

George

Betsy

Martha

Abigail

A graph is an abstract representation of the same set of information contained

within the adjacency matrix.

In a graph:

A vertex represents the actor

An edge represents a link between actors

Graph representations can feed visualization front ends, which supports different analytical processes and methods.

Page 20: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

20

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

20

Connectivity Concepts

� Relationships and links

� Binary, directed, signed

� Measures of relationships

� Grouped ordinal, ranked ordinal, categorical, interval measures

The link between two individuals can be simple, such as the “who knows who”

relationship, or can be much more complex. The different classifications of connections may be based on how much information they carry. The next slides

describe different levels of information and provides some examples.

Page 21: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

21

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

21

Binary Connectivity

� Represents a relationship determined by the answer to a true/false question

� Examples:

� Person A and person B know each other

� Organization X and organization Y contribute to the same charity

� Person A and person B have purchased product P

A binary connection essentially represents the positive response to a true/false

question, while the absence of the link reflects a negative response. Not that the examples provided here, the connection is reflexive, or undirected. In other words, if

A knows B, then B knows A.

Page 22: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

22

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

22

Directed Connectivity

� The link is established based on how a question relates two entities in a directed way

� Examples:

� Person A has emailed person B

� Person A has visited web site W

� Organization X has purchased services from organization Y

A directed link differs from an undirected link in that there is no assumption of

reflexiveness. For example, A may have emailed to B, but that does not mean that B emailed A. If the relationship exists in both directions, then there will be two

directed arcs: one from A to B and one from B to A.

Page 23: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

23

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

23

Signed Connectivity

� The link describes the nature of the relationship (e.g. positive, neutral, or negative)

� Examples:

� Person A hates product P

� Organization X has been categorized as an approved vendor

� Person A has provided a neutral score for Airline U’s service

This is the first characteristic of the connection that carries a value, although the

values indicate gross level data about the relationship. In this case, the sign indicates the nature of the relationship. For example, a +1 is a positive connection,

0 is a neutral connection, and -1 is a negative connection.

This notion suggests that descriptive metadata about links can embed more

complex knowledge about the network and how information flows through the network.

Page 24: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

24

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

24

Grouped and Ranked Ordinal

� The links have magnitude, such as “dislikes” = -1, “strongly dislikes” = -2, “vehemently dislikes” = -3

� This provides more meaningful description of the connective characteristics

� Examples:

� Individuals and vacation destinations (desire to visit)

� Job references (references)

� In ranked ordinal, links are ranked in order of magnitude

� Actor X ranks the other actors in terms of who is liked most to the least

In a grouped ordinal model, the links carry both sign (to indicate the

positive/negative nature) and magnitude of the relationship. The greater the absolute magnitude, the greater the connective characteristic. Grouped ordinal

characterizes different quantitative measures:

- the “strength” of the connections- the frequency of the interaction

- the intensity of the relationship

In the ranked ordinal model, instead of gauging the links based on a quantitative

measure, the links are ordered based on their associative rank. While this is not common in network analysis, the information is often easy to assemble. For

example, one can calculate the order of “email connectivity” by counting the number of emails that are exchanged and putting together the ranking based on the counts.

Page 25: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

25

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

25

Categorical

� Relationships are defined by category

� A business relationship is type “1,” personal relationship is type “2,” etc.

In a categorical linkage model, the attribution of the link reflects a segmentation of

the relationships based of their qualitative type.

Page 26: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

26

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

26

Interval Measures

� Rankings reflect scaling

� Difference between 2 and 3 is same as difference between 20 and 21

Interval measures not only provide a ranked order, they also capture the relative

difference in “intensity” based on the interval order. This approach is the most sophisticated measurement framework. Interval measures can always be reduced

in complexity to one of the other measurement types.

Page 27: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

27

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

27

Matrix Analysis

(3,3)

(2,3)

(1,3) (1,4)(1,2)(1,1)

(3,4)(3,2)(3,1)

(2,4)(2,2)(2,1)

Row 1

Row 2

Row 3

Col 1 Col 2 Col 3 Col 4

A matrix is a rectangular arrangement of link data. Each row and column is

assigned an identifier, and each cell within the matrix is indexed by its (row, column) address. For example, the shaded cell in the second row and the third column is

addressed “(2,3).” Matrices are one way to capture a representation of a network. Each cell contains the information associated with the link between the entity

represented by the row and the entity represented by the column.

Page 28: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

28

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

28

Example: Adjacency Matrices

01-0B

011-A

0

1

B

0

1

A DC

-1D

1-C

A B

DC

Directed graph

Adjacency matrix

An adjacency matrix presents the network connectivity using labeled rows and

columns, with values within each cell representing the nature of the link. In this example, there is a directed graph representing the relationships among a set of

four people. The adjacency matrix shows the same set of relationships – if there is a directed arc one person to another, then the corresponding labeled cell has a “1,”

and has a “0” otherwise. In this case, there is no concept of the relationship existing from an entity to itself, so the diagonal, which represents the self-directed arc is left

with a “-.”

For an undirected graph, the relationships are essentially reciprocal, and the matrix

would be symmetric about the diagonal.

Page 29: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

29

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

29

More About Matrices

01-0B

011-A

0

1

B

0

1

A DC1

-1D

1-C

11-0B

011-A

1

1

B

0

0

A DC2

-1D

1-C

00-1B

010-A

0

0

B

1

1

A DC3

-1D

1-C

00-0B

101-A

1

1

B

1

1

A DC4

-0D

1-CMultiple relationships can be captured in matrices of higher dimensionality

A matrix captured one set of relationships, but the analysis may capture multiple

relationships that exist between the same set of actors. In this case, we can layer a set of two-dimensional matrices into a third dimension to capture the complete set.

In this slide, each matrix is at its own layer, and we can examine the associations between all actors for one relationship by looking at one layer, or we can look at all

the links between any pair of actors by looking across the third dimension. In this slide, the highlighted cells for position (C, B) show the links between C and B.

Page 30: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

30

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

30

Matrix Operations

� Matrix transpose

� Matrix addition and subtraction

� Matrix multiplication

01-0B

011-A

0

1

B

0

1

A DC

-1D

1-C

01-1B

010-A

0

1

B

0

1

A DC

-1D

1-C

O

01-0B

011-A

0

1

B

0

1

A DC

-1D

1-C

01-1B

010-A

0

1

B

0

1

A DC

-1D

1-C

02-1B

021-A

0

2

B

0

2

A DC

-2D

2-C

+ =

142

121

0

3

2

1

1

10203

0151

21 14

1141215

1978

12

7

3

11

16

111* =

Matrix A is the transpose of matrix B if, for each cell (i,j), the value of B(j,i) is the

same as A(i,j)

Matrix addition and subtraction are simplest: If we are adding matrices A and B into

matrix C, the value of C(i,j) is equal to A(i,j) + B(i,j). Subtraction is the same, except we subtract B(i,j) from A(i,j).

Matrix multiplication is more complex – the value of C(i,j) is equal to the sum of A(i,k) multiplied by B(k,j), for k=1 to the number of elements in each column of

matrix A and row of matrix B. In the example here, C(2,1) = (A(2,1) * B(1,2)) + (A(2,2) * B(2,2)) + (A(2,3) * B(3,2)) , which is equal to (1*1) + (2*3) + (1*1) = 8.

Page 31: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

31

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

31

Adjacency Matrices and Multiplication

� Multiplying an adjacency matrix by itself once results in a matrix that counts the number of paths between nodes of length 2

0100B

0110A

0

1

B

0

1

A DC

01D

10C

0101B

0100A

0

1

B

0

1

A DC

01D

10C

* 1011B

1111A

1

1

B

1

0

A DC

10D

03C

=

A B

DC

The power X to which an adjacency matrix is raised results in a matrix counting the

number of paths between nodes of length X. In this example, we see that from node B to A there is one path of length 2, and from C to itself there are 3 paths of length

2. In turn, computing the Boolean square of the adjacency matrix tells us if there exists a path of length 2 between any two nodes.

Why is this relevant? Because the nature of network analysis is to explore connectivity, it is valuable to not review direct links between actors, but to explore

how connected the actors are, including the strength/weakness of connections, the distances, “influence,” and the robustness of the connections, among other

properties.

Page 32: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

32

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

32

Example: Knoke Information Exchange Network

� Map of exchange of information between 10 organizations involved in the local political economy of social welfare services in a Midwestern city

Page 33: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

33

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

33

Example: Knoke Information Network

� Graph and Adjacency matrix for the Knoke Information Network

Page 34: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

34

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

34

Visualization Techniques

� Characterizing attribution of actors by shape, size of node, color

� Characterizing relationship/link by line size, type, thickness, decorations

� Example:

� Blue for non-government, red for government

� Square for generalists, circle for specialists

Page 35: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

35

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

35

Graph Analysis

� Simple properties tell a lot about interaction and connectivity

� Social structures reflect aspects of both

� Global properties – the way the entire population interacts

� Local properties – the ways that individuals within small groups interact

� Analyze both the physical structure and the patterns of structure

Global properties are those that describe aspects of the entire community, while

local properties describe how small groups and individuals interact together within the communities. These properties are analyzed based on the kinds of structures

that exist inside the graph (subgraphs, cliques, components) as well as the patterns of the structures. An example of this latter point might be the recognition of a

common linkage pattern that carries some sociological meaning, such as the relationships between a pair of parents and their children.

Locality – Dyads and Triads

The most common subsets to review are

Dyads: groups of two actors

Triads: groups of three actors

With directed data, there are 4 possible relationships between 2 actors

With directed data, there are 64 relationships possible among 3 actors

Relationships exhibit hierarchy, equality, exclusion, “social standing,”patterns of behavior

Page 36: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

36

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

36

Simple Graph Properties

� Counts

� Number of actors

� Number of possible connections

� Number of connections present

� Characteristics of network

� Size of population,

� How small groups differ from large groups: “cohesion, solidarity, moral density”

� Characteristics of individuals

� Number of connections

� Density of the network

� Source vs. sink

The first place to begin in network analysis is at the global level, looking at

properties that describe the entire network. For example, the counts associated with the graph provide some insight into its “density” – the number of possible

connections vs. the number of actually present connections. Next, characteristics of the network as reflected by the size of the population and the gross-level review of

groupings within the network. At a more granular level, examining the direct relationships among the individual entities and their relative connectivity gives some

insight into the population itself.

Page 37: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

37

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

37

Size and Density

� Network size

� The count of the number of nodes

� Potential links

� There are (k*(k-1)) unique ordered pairs or actors

� The number of possible relationships grows exponentially

� Density is the ratio of links to the number of possible links

� The proportion of all possible links that are present

� Equal to the sum of links/number of possible links

� Provides insight into movement across the network,

qualifications of specific actors within the network

Many network analyses focus on the flow of “information” across the network, and

that becomes a recurring theme. In the next few slides, let’s look at network properties and consider their “information exchange” features. For example, the

ability to propagate information across the network is related both to its size and its density. The larger the network is, the more connections are needed to effectively

propagate information, which is why we explore its sized and its density.

Page 38: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

38

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

38

Degree

� The number of edges/links attached to the actor

� The upper limit on number of connections each actor may have is (k-1)

� In-degree is the number of directed links into the node

� The out-degree is the number of directed links out of the node

� High in-degree indicates a “sink”

� High out-degree indicates a “source”

The degree of a node describes its level of connectedness in the network. Here we

distinguish between undirected and directed graphs. In undirected graphs, we just look at the degree, but in directed graphs, which have arcs that travel from one

node to another, we discuss in-degree, which is the number of directed arcs into a node, and out-degree, which is the number of directed arcs that leave a node.

Page 39: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

39

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

39

Reachability and Point Connectivity

� Reachability:

� An actor is reachable by another if there are connections that can trace between them

� Provides insight into communication capability, robustness

� Connectivity:

� The number of nodes that have to be removed so that one actor could no longer reach another

� Provides insight into “redundancy” and robustness

Page 40: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

40

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

40

Distance

� How far is it between nodes?

� Conceptually, the number of nodes that must be traversed to establish the connection between two actors

� Walk: A sequence that shows a traversal in the graph between two actors (ABC, ABDC, ABEBC)

� Cycle: A closed walk of distinct actors except for the originating node, which is also the destination (BCDB)

� Path: A walk in which each other actor and each other relation in the graph may be used at most one time

There may be different walks of different lengths that connect two actors

Distance can be assessed based on the characteristics of different kinds of walks:

The total number of walks of a particular length between any two actors

Distances can be scaled based on the size of the network

To assess lengths in instances where links have value, sum the values along

shortest path, or the minimum of the sums across all paths of size n or smaller

Page 41: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

41

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

41

Geodesic Distance

� Geodesic distance is the number of relations in the shortest walk from one actor to another

� Flow: similar to bandwidth capacity – how many different paths are there that connect two actors

� Diameter: Largest geodesic distance in the graph

Geodesic distance is used to characterize the most efficient or optimal connection

between two actors. In any network, two actors may be connected via numerous paths, but depending on the nature of the connectivity, it might be assumed that

communication between any pair of actors would be performed along the shortest path.

Dense networks have mostly short geodesic distances. The largest geodesic distance for each actor is called its “eccentricity.” In graphs that are not completely

connected,

Page 42: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

42

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

42

David

BobHelen

Joe

Rene

Kate

Jack

LenTed

Jill

Frank

A Simple Network

Page 43: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

43

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

43

Univariate Statistics and Inferencing

� An examination of the statistics across rows (i.e., “sources”) or columns (“sinks”)

Mean: percentage of remaining nodes to which this nodes send link

Sum: number of links to remaining nodes

Variance: How variable or “predictable” the actor is with respect to others

The statistics here review the differences between the “roles” each node takes on

as sources of information. In this example, there are some significant sources – 2, 3, 5, and 8 are similar in terms of their role as information providers. Reviewing

these statistics allows us to make some inferences about the population. For example:

Actors that have high out-degree may be “communicators” or “influencers”

Actors with low out-degree are less likely to be influencers unless they are connected to the “right” other actors

Actors with high in-degree may be “powerful” in terms of information gathering

Actors with high in- and out-degree may be “facilitators” – they receive information

and pass it along

Actors with high out-degree and low in-degree may be “wannabes” or “outsiders”

Page 44: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

44

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

44

Centrality

� The concept of centrality is bound to the concept of “power”or “influence” within the network

� The way that actors are embedded within networks imposes constraints on, or offers opportunities for the actor and the network

� Provides qualification of favorable position, control, essentialness within the network

� Three types of measures:

� Degree

� Closeness

� Betweenness

Page 45: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

45

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

45

Examples

StarCircle

Line

These three graphs demonstrate different kinds of centrality characteristics. In the

star graph, node A has a high measure of centrality, but in the Circle graph no node has any greater centrality than any other. In the line graph, there are differing

approaches to looking at centrality. In one, the edge nodes (A and G) have less centrality than the others, but a different approach may incorporate position in the

line into the centrality measures also (e.g., D is more central than C or E, etc.).

Page 46: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

46

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

46

Degree Centrality

� Degree centrality is based on the number of links in and out of each node

� Actors with high in-degree are “prominent”

� Actors with high out-degree are “influential”

� Normalized degrees are based on percentage of remaining actors

Page 47: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

47

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

47

Closeness and Betweenness

� Closeness measures how close each node is to the others in the network

� Betweenness characterizes the degree to which any individual node exists between other nodes

The geodesic distance enables actors in favorable positions to communicate faster.

Closeness examines the shortest distances between a node and each of the other nodes. For example, in the star network, the A node has a higher degree of

closeness to the other nodes than any other one. In the circle network, the closeness measure is essentially identical for each of the nodes. However, in the

line network, the node in the center of the line is “closer” in total to the other nodes than the others.

Betweenness is a measure of how a node in the network lies on the critical paths between other nodes. For example, in the star network, node A has a high level of

betweenness, since it is one the path between every other set of nodes.

Page 48: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

48

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

48

Grouping and Graph Basics

� The structure of connectivity in which an actor is embedded within some structure, which is then embedded within a larger structure, etc.

� Components

� Clique

� Blocks and Cutpoints

� Lots of other stuff!!!

Two vertices are in a connected component if there exists a path between them.

Connected components can be drawn as subgraphs with empty space between them.

A clique is a subgraph where every node is connected to every other node.

Blocks are locations in the graph that would become disjoint if a node or link were

removed. That removed link or node is called a cutpoint.

Page 49: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

49

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

49

Link and Network Analysis

Page 50: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

50

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

50

Structural Analysis

� Evaluating substructures that exist and interact within the network

� Dyads and Triads compose many graphs

� Looking at position and structure within graph provides insight into social structure and embeddedness

� Dyads and Triads form into larger graph substructures

� Evaluating overlapping membership in graph substructures exposes “influential” entities within the network

Page 51: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

51

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

51

Cliques

� A clique is a maximally connected subgraph

� In a clique, each node is connected to every other node

Cliques indicate a close relationship among the members, and often exist within

communities based on entity profile similarities. For example, individuals with the same racial, ethnic, or religious identities may form smaller, tighter group

relationships.

The smallest cliques are dyads, followed by triads. These building blocks may be

combined into larger cliques as well.

By looking at the types of relationships between nodes in the cliques, you can see

behavior patterns emerge as well. For example, in a triad, the link between two of the members may be much stronger than between either of those two and the third

member.

Another interesting aspect is to evaluate overlapping membership. Actors that are

members of multiple cliques may have greater influence, while sets of actors that share clique memberships may be particularly close also.

Page 52: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

52

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

52

Example

� In our Knoke example, the nodes for COMM and MAYOR share 5 clique memberships

Page 53: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

53

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

53

N-Clique and N-Clans

� The definition of a clique is very restrictive

� There are subgraph relationships that relax the clique requirements

� An N-clique is a subgraph where every node is connected to every other node by a distance of N

� N-cliques may be connected by nodes that are not member of the clique, suggesting a stricter definition for an n-clan

� An n-clan has the additional constraint that the connections must be made through other members of the n-clique

For n-cliques, the most frequent value used is 2. This is conceptually equivalent to “

a friend of a friend.”

There are other approaches to modifying the constraints associated with clique-

style connectivity, such as:

K-plex, which defines the clique if every node has a connection to all but K out of N nodes

K-core, which defines the clique if every node is connected to K out of N nodes

Page 54: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

54

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

54

Components

� Components are subgraphs that are connected within the network, but are disjoint from other subgraphs

Page 55: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

55

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

55

Blocks and Cutpoints

� If a node or a link were removed, how would that affect the connectivity structure?

� A node that, when it is removed subdivides the graph into components is called a “cutpoint”

� Those divisions are called “blocks”

By removing a node in the graph and bisecting the network into blocks, we identify

nodes that are key to the communication or information exchange patterns.

Page 56: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

56

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

56

Summary: Link Analysis

� Explores the relationships between objects, and how objects are linked

� Identify nodes that are “key” within a network

� Determine the links are critical to the operations of the network

� Assess the existence of relevant sub-networks

� Evaluate “spheres of influence”

� Clustering entities into communities of interest

� Geographic

� Intellectual

� Psychographic

Page 57: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

57

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

57

Example – Collaborative Filtering

Customers who bought this item also boughtBusiness Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications by Larissa T. Moss Performance Dashboards: Measuring, Monitoring, and Managing Your Business by Wayne W. EckersonEnterprise Dashboards: Design and Best Practices for IT by Shadan MalikThe Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition) by Ralph Kimball Business Intelligence for the Enterprise by Mike Biere

Business Intelligence: The Savvy Manager's Guide (The Savvy Manager's Guides) (Paperback) by David Loshin "Imagine that you are the sales manager for a large retail organization and that you were able, within some probability, to predict how much money..." (more) Key Phrases: productivity analytics, business rules system, business rules approach, Postal Service, Data Warehousing Institute, United States (more...) (5 customer reviews)

Page 58: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

58

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

58

Example – Organizational Behavior

� Evaluation of communication patterns between individuals within an organization

� Attribution of actors with demographic data (age, seniority, gender, department, education, etc.)

� Assess:

� Is work being performed within or outside of the formal structure of the organizations?

� Are there self-organized “invisible barriers” emerging based on individual characteristics?

� Are there informal communities of interest that may seed innovation across division boundaries?

� How effective are the different communication channels?

Page 59: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

59

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

59

Example - Terrorist Network

� Connectivity map by Valdis Krebs, based on data available from news sources on the world wide web

Page 60: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

60

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

60

Other Examples/Resources

� Analyzing degree, centrality to assess risk of infection in a population (http://intl-aje.oxfordjournals.org/cgi/content/abstract/162/10/1024)

� Integrating raw data from multiple sources, including phone records, bank transactions, surveillance reports, vehicle sales to create a criminal network analysis program (http://ai.bpa.arizona.edu/COPLINK/publications/crimenet/Xu_CACM.doc)

� Enabling targeted marketing within self-organizing networks (www.linkedin.com, www.myspace.com)

� Krebs’ article on Identifying Terrorist Networks (http://firstmonday.org/issues/issue7_4/krebs/)

� Corporate board members and influencers (http://www.theyrule.net/)

http://ai.bpa.arizona.edu/COPLINK/publications/crimenet/Xu_CACM.doc

Page 61: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

61

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

61

Issues and Considerations for Business Intelligence

Page 62: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

62

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

62

Challenges

� Establishing a business justification

� Network Models and Metadata

� Data warehouse data extraction and transformation

� Semantic Analysis, Entity Extraction, and Establishing Linkage

� Graph Management

Page 63: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

63

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

63

Business Justification

� Are all of these examples proving known concepts after the fact?

� Where can this technique add value to existing activities?

� What are the costs related to implementation and socialization?

One of the criticisms of network and link analysis is the belief that the analysis

exposes interesting notions that are actually already known. In evaluating the terrorist data, for example, are we determining (after the fact) that certain parties

were connected, even though that fact was already known? Alternatively, one might ask the same question a different way: Today, would this kind of analysis contribute

to the determination of suspicious group behavior that would warrant action?

At a more concrete level, we can evaluate the types of applications to which SNA is

applied and review what benefits are expected out of the process and how those benefits are measured. For example, consider Amazon’s use of collaborative

filtering – the cost is not significant, but there is a perception of high value, especially as it facilitates self-organized predictive modeling.

Page 64: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

64

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

64

Data Extraction and Transformation

� Linearized or rectangular data is not necessarily suitable for network or link analysis

� Synchronization of warehouse data with network data

� Challenge:

� Develop models for representation of network data

� Provide services for transformation into and out of graph structures

� Provide front-end applications for visualization

Page 65: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

65

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

65

Networks and Data Models

� SNA applications require data configured to represent nodes, links, and associated link weights

� Example:

� Nodes: Ken Dave Jill Bob

� Arcs:

� Ken Dave 10

� Ken Jill 1

� Dave Bob 6

Different applications expect the input data representing the network to be

configured in a way that can be parsed easily and configured into the tool. In the example here, the node names are enumerated, followed by an enumeration of arcs

consisting of the source node, the target node, and the weight of the link. In turn, the applications will maintain an internal representation of the network (perhaps in

an adjacency matrix), as well as maintain additional metadata related to actions taken by the analyst, which means that the internal representation of the network

may be modified and written out during the analysis.

The challenge is to create the transformation mechanisms to extract rectangular

data from traditional data sets, organize the data into the appropriate network representation, as well as transforming the output of the SNA application back into

rectangular format for use in traditional BI activities. This requires understanding the SNA application’s model as well as how the connectivity is represented within the

source data sets.

Page 66: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

66

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

66

Social Network Metadata

� Actors

� Label or name

� Identifier

� Actor Demographics

� Links

� Relational characteristic

� Connectivity measure

� Weight/value

� Other issues to consider:

� “Link distance”

� Positioning

� Graph structure

The network embeds relationships, but the nature of how those relationships

correspond to the modeled objects must be maintained. This means that the metadata associated with each of the objects should be captured, especially if it can

be visualized along with the network itself. For example, in the Knoke network, the shapes and colors used indicated different aspects of the modeled actors.

The same is true for the links – different types may be represented using different line types, while the weights may be indicated using line thickness or even distance

between the nodes. Because visualization is critical, the positioning of nodes on the “template” of the map may be relevant, as well as the shape that the graph should

take (e.g., circle, hub and spokes, random node placement, etc.).

Eventually, most analyses will need to be reintegrated with the original source data

sets; identifiers need to be assigned to each node, while these identifiers also need to be linked back to the original data.

Page 67: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

67

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

67

Semantic Analysis and Entity Extraction

� Semi-structured and unstructured text contain references to entities and their relationships

� Challenge:

� Provide ability to identify entities within a document

� Characterize entity types based on context

� Establish connectivity on more than a naïve basis

� Transform extracted information into format suitable for integration and analysis

Page 68: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

68

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

68

Entity Extraction

� Recall our earlier example:� Richard J. Palaima, Of Mattapan, Suddenly, Nov. 18, 2002. Beloved

husband of Jonadee (Badayos) Palaima. Devoted son of Madelyn L. (George) Palaima of Braintree, and the late Richard A. Palaima. Devoted brother of John A. Palaima of Braintree, nephew of Catherine Cunningham of Rockland, Cousin & Godson of Robert Cunningham of Rockland. Funeral from the Mortimer N. Peck-Russell Peck Funeral Home, 516 Washington St., BRAINTREE, on Saturday at 9 a.m.

Funeral Mass in St. Gregory's Church, Dorchester, at 10 a.m. Relatives and friends invited. Visiting hours Friday 2-4 & 7-9 p.m. Memorial donations may be sent to St. Gregory's Church, 2215 Dorchester Ave., Dorchester 02124.

� Text mining applications can identify names, locations, roles, and organizations within semi- and unstructured data

� Often the relationship may be as simple as “appearing in the same context”

One approach for evaluating connections is to analyze text, extract the entities and

related metadata (locations, roles, etc.), insert that data into a database, then extract it for the purposes for network analysis. By analyzing many similar document

corpuses (e.g., public records, filings, directories, news articles) and inserting them into a data set, linkages can emerge from the data. For a good example, consult

www.theyrule.net.

Page 69: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

69

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

69

Establishing Linkage

� Quality of data used to refer to entities may be variant and suspect

� Challenge:

� Effectively characterize how attribution contributes to similarity scoring

� Exploit data quality tools for standardization and linkage from different kinds of source data

Data cleansing, matching, and linkage tools have been used for many years for

identifying duplicates among data records, as well as using similarity scoring mechanisms for householding and establishing hierarchies. The same techniques

can be used to examine attributes associated with each record, parse and standardize the data, then apply the similarity scoring and linkage capabilities to the

candidate entities.

Page 70: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

70

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

70

Graph Management

� Graph data structures are different than relational databases

� Graphs do not inherently provide persistence

� Large networks take up large amount of memory space

� Issues:

� Managing large graphs in a performance-efficient manner

� Manipulating graph models in real time

� Persistence of graph models

� Drawing conclusions about what is demonstrated in the graph

Some of the interesting challenges that need to be addressed include:

Providing a usable graph management utility that provides reasonable

performance

Providing persistence for graphs

Enabling transformation into and out of graph format

Page 71: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

71

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

71

Interesting Resources

� Books

� “Linked: How Everything Is Connected to Everything Else and What It Means,” Albert-Laszlo Barabasi

� “The Tipping Point,” Malcolm Gladwell

� Web sites

� http://www.insna.org/

� http://www.ire.org/sna/

� http://faculty.ucr.edu/~hanneman/nettext/

� Software

� The R project http://www.r-project.org/

� SNA tools for R http://erzuli.ss.uci.edu/R.stuff/

� UCINET trial download http://www.analytictech.com/downloaduc6.htm

� Pajek http://vlado.fmf.uni-lj.si/pub/networks/pajek/

Page 72: Social Network Analysis half day - knowledge-integrity.com · 1 1 Introduction to Social Network and Link Analysis David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference

72

© 2006 Knowledge Integrity Incorporatedwww.knowledge-integrity.com

(301) 754-6350

72

Questions?

� If you have questions, comments, or suggestions, please contact me

David Loshin

301-754-6350

[email protected]