graph-based proximity measures · a na ve approach for constructing the knn graph would compare...

6

Graph-based Proximity Measures

Kevin A. Wilson, Nathan D. Green, Laxmikant Agrawal, XibinGao, Dinesh Madhusoodanan, Brian Riley, and James P. Sigmon

North Carolina State University

CONTENTS

6.1 Defining the Proximity of Vertices in Graphs . . . . . . . . . . . . . . . . . . . 1356.2 Shared Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.2.1 Proximity Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1396.2.2 Computing the Similarity of Web Pages . . . . . . . . . . . . . . . . 1406.2.3 Shared Nearest Neighbor Definition . . . . . . . . . . . . . . . . . . . . . 1436.2.4 Strengths and Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.3 Evaluating Relatedness Using Neumann Kernels . . . . . . . . . . . . . . . . 1466.3.1 Calculating Co-citation and Bibliographic Coupling . . . . 1476.3.2 Document and Term Correlation . . . . . . . . . . . . . . . . . . . . . . . . 1516.3.3 Neumann Kernels in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . 1526.3.4 Derivation of Neumann Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 1536.3.5 Choosing the Decay Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1576.3.6 Relationships among Neumann Kernels, HITS, and

Co-citation/Bibliographic Coupling . . . . . . . . . . . . . . . . . . . . . 1576.3.7 Strengths and Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

6.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1606.4.1 Co-citation Analysis with Shared Nearest Neighbor . . . . 160

6.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1616.5.1 Inter-graph Proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1616.5.2 In-graph Proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1626.5.3 Node Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

6.1 Defining the Proximity of Vertices in Graphs

In order to apply several different graph-based data mining techniques likeclassification and clustering (discussed further in Chapter 9 Classification,

135

136 Practical Graph Mining with R

and Chapter 8, Cluster Analysis, respectively), it is necessary to first defineproximity measures between vertices in a single graph. This chapter focuseson within-graph proximity measures that will be used in the application ofseveral data mining techniques later in the book. We introduce two differentalgorithms for measuring proximity between vertices in a graph:

1. Shared Nearest Neighbor (SNN). Shared Nearest Neighbordefines proximity, or similarity, between two vertices in terms of thenumber of neighbors (i.e., directly connected vertices) they have incommon. SNN is described in detail in Section 6.2.3.

2. The Neumann Kernel. The Neumann Kernel is a generalizationof HITS described in Chapter 5 that models the relatedness of ver-tices within a graph based on the number of immediate and moredistant connections between vertices. It uses a tunable parameterto control how much weight is given to the more distant connec-tions, which can produce results based entirely on the immediateneighbors or results that take the full structure of the graph intoconsideration, like HITS. Section 6.3 discusses Neumann Kernels.

Before we proceed it is important to define some common terminology andnotation. Table 6.1 defines the key terms in this chapter with which the readershould be familiar. To ensure consistency, Table 6.2 defines the notation thatwill be used throughout the chapter.

Graph-based Proximity Measures 137

TABLE 6.1: Glossary of terms used in this chapter

Term Description

Citation graph A citation graph is a representation that showsthe citation relationships (cited references andciting articles) between two objects (e.g., articles,documents or papers).

Co-citation matrix A matrix that connects vertices cited by the samesource

Bibliographic couplingmatrix

A matrix that connects vertices with one or moreshared bibliographic references

Diffusion factor A tunable parameter that controls the balancebetween relatedness and importance

Term-document matrix A matrix in which the rows represent terms,columns represent documents, and entries rep-resent a function of their relationship (e.g., fre-quency of a given term in a document)

Documentcorrelation matrix

A matrix in which the rows and the columns rep-resent documents, and entries represent the se-mantic similarity between two documents

Termcorrelation matrix

A matrix in which the rows and the columns rep-resent terms, and entries represent the semanticsimilarity between two terms

Neumann Kernel A specific kernel function applied to computingterm-document similarity


TABLE 6.2: Description of notation used in this chapter

Generalnotation

Meaning

G A directed, unweighted, simple graphV (G) Vertex set of graph GE(G) Edge set of graph G|V (G)| Number of vertices in graph GA Adjacency matrix derived from graph GA−1 Inverse of matrix AI Identity matrix|δ+(v)| Outdegree of vertex v|δ−(v)| Indegree of vertex v∆+(G) Maximum outdegree of graph G∆−(G) Maximum indegree of graph G

HITSnotation

Meaning

x Column vector containing the authority weightsy Column vector containing the hub weightsI Operation that updates the set of authority weights (x)O Operation that updates the set of hub weights (y)k Number of iterations to run the algorithmx(i) Authority weight vector after the ith iterationy(i) Hub weight vector after the ith iterationx(0) Initial authority weight vectory(0) Initial hub weight vectorxj Hub weight for vertex jyj Authority weight for vertex j

NeumannKernelnotation

Meaning

X Term-document matrixK Document correlation matrixT Term correlation matrixC Citation matrixB Bibliographic coupling matrixP Co-citation coupling matrixλm Maximum eigenvalue of matrixγ Diffusion/Decay factor

SNN notation Meaning

k Number of nearest neighbors to include in proximity cal-culation


6.2 Shared Nearest Neighbor

In Shared Nearest Neighbor (SNN), vertices are characterized by their neigh-bors. The underlying assumption is that if two vertices have several neighborsin common, then they are more likely to be similar. Figure 6.1 provides ahigh-level overview of the SNN algorithm.

This section begins with an introduction to proximity graphs, illustratedthrough an example of similar web pages. We then discuss the conceptof nearest neighbors, a proximity measure based on the number of neigh-bors of a given vertex. Finally, we discuss the SNN proximity measure indetail.

6.2.1 Proximity Graphs

A proximity, or similarity, graph on a set of objects is a graph obtained byconnecting two objects in the set by an edge if the two objects are simi-lar to each other in some sense. Proximity measures are used to determinethe extent to which the two vertices are connected or belong to a commongroup. The specific definition of proximity depends on the specific applica-tion for which it is used. For example, in a co-authorship network, authorsmight be connected if they publish in the same field or in the same jour-nals. In other cases, proximity can estimate the likelihood that a missinglink will exist in the future, as we discussed in Chapter 5, Link Analysis.For instance, on social networking sites, the missing link might represent thecommunities a user would like to join. Proximity measures can also be usedto track the propagation of a product or an idea within a social networksetting.

Consider the graphs shown in Figures 6.2 and 6.3. If we define two verticesto be similar if they are connected by an edge, then vertex s1 in Figure 6.2is similar to a1 and vertex a1 is similar to t1. If we define two vertices to besimilar if they share common neighbors, then vertices s2 and t2 in Figure 6.3

Compute proximity matrix

Adjacency matrix for graph

Generate KNN graph

Apply parameter, k, to generate

SNN graph

FIGURE 6.1: A high-level overview of the SNN algorithm described in detailin Algorithm 6.


s1 a1 t1

FIGURE 6.2: Example proximity graph in which vertices are similar only ifthey are connected by an edge.

s2 t2

a3

a2

FIGURE 6.3: Proximity graph to show proximity based on vertices sharingneighbors.

can be considered similar because they share a2 and a3 as common neighbors.This definition of proximity forms the basis of the SNN proximity measure.Though vertices s2 and t2 are considered similar for sharing two neighbors,we can generalize this concept, so that only vertices that share at least kneighbors are considered similar, for some integer k.

Before discussing SNN similarity, it is important to understand the conceptof the k-Nearest Neighbor (KNN) graph. The KNN graph is a proximity graphcreated by connecting each vertex in an edge-weighted graph to its k nearestvertices. The result is a graph such that each vertex v ∈ V (G) is linked withk of its nearest neighbors, where k is between 0 and |V (G)|.

A naıve approach for constructing the KNN graph would compare everypair of vertices in the graph and run in O(|V (G)|2) time; however, the KNNgraph can be calculated in O(k|V (G)|log(|V (G)|)) [6]. The KNN graph hasapplications in cluster analysis and outlier detection.

For example, consider the proximity graph described in Table 6.3. Thistable gives pairwise proximity values among six objects. For a value of k = 2,the KNN graph appears in Figure 6.4.

In the rest of this chapter, we focus on creating an SNN graph.


TABLE 6.3: An example of a proximity matrix

points x1 x2 x3 x4 x5 x6

x1 0 1.5 1.6 2.8 1.7 2.1x2 1.5 0 2 3 4 4.5x3 1.6 2 0 1.1 1.2 2.6x4 2.8 3 1.1 0 3 2x5 1.7 4 1.2 3 0 1.1x6 2 4.5 2.6 2 1.1 0

X1

X2

X3

X5

X4

X6

FIGURE 6.4: The KNN graph obtained from the proximity matrix in Table6.3 and k = 2.

6.2.2 Computing the Similarity of Web Pages

This section introduces an alternative to direct similarity measures [6], SharedNearest Neighbor. SNN is an indirect approach to similarity based on theprinciple that if two vertices have more than k neighbors in common, thenthey can be considered similar to one another, even if a direct edge betweenthem does not exist.

We illustrate the concept of an SNN using the undirected graph of webpages in Figure 6.5. In the figure, the vertices of the graph represent webpages, and the edges represent hyperlinks between the web pages. The adja-cency matrix appears in Table 6.4. In R, this graph can be constructed asfollows.

1 #Get all required packages

2 library(igraph)

3 library(ProximityMeasure)

4 #Adjacency matrix for web pages graph

5 A<-matrix(c(0, 1, 0, 0, 1, 0,

6 1, 0, 1, 0, 1, 0,

7 0, 1, 0, 1, 0, 0,

8 0, 0, 1, 0, 1, 1,


GOOGLE

BINGREDIFFMAIL

ALTAVISTA

YAHOO

WIKIPEDIA

FIGURE 6.5: Undirected graph of web page links for SNN proximity measure.

TABLE 6.4: Adjacency matrix for the undirected proximity graph illustratedin Figure 6.5

Wiki Google Bing Yahoo Altavista Rediff

Wikipedia 0 1 0 0 1 0Google 1 0 1 0 1 0Bing 0 1 0 1 0 0Yahoo 0 0 1 0 1 1Altavista 1 1 0 1 0 0Rediffmail 0 0 0 1 0 0


9 1, 1, 0, 1, 0, 0,

10 0, 0, 0, 1, 0, 0),

11 nrow=6,ncol=6,FALSE);

12 #Labeling all vertices of graph by respective web pages

13 G = graph.adjacency(A)

14 V(G)$label<-c("Wikipedia","Google",15 "Bing","Yahoo","Altavista","Rediffmail");

We first load the required packages, igraph and ProximityMeasure, thendefine the adjacency matrix for the graph. Finally, we label the vertices tomatch the graph shown in Figure 6.5. We can observe from the graph that thevertices representing Altavista and Bing have two shared vertices, namelyYahoo and Google, whereas Yahoo and Google share Altavista and Bingas neighbors. We can identify these “shared nearest neighbors” in R by us-ing the SNN GRAPH function to identify all the vertices that have two sharedneighbors:

1 #Converting adjacency matrix to graph

2 G<-graph.adjacency(A,mode=

3 c("undirected"),weighted=NULL);

4 #Set minimum number of shared neighbors in SNN graph

5 k<-2;

6 #Calling SNN function

7 G2<-SNN GRAPH(A , k);

8 #Plot the SNN graph returned by SNN GRAPH function.

9 tkplot(G2,vertex.label.color="black",

10 vertex.size=29);

Prior to running the algorithm we convert the adjacency matrix to an R graphand set the number of shared neighbors, k = 2. We then run the SNNalgorithm and plot the result, which is shown in Figure 6.6. Thus, theSNN GRAPH function returns a graph of connected vertices that have atleast k shared neighbors, where k is an input parameter that can be changedas required by the application.

In Figure 6.6, the vertices that have at least two common neighbors areconnected by an edge. At a value of k = 2, “Altavisa” and “Bing” are similar toeach other, as are “Google” and “Yahoo.” Having worked through an example,we are now in a position to formally define the Shared Nearest Neighborproximity measure.

6.2.3 Shared Nearest Neighbor Definition

This section formally introduces the Shared Nearest Neighbor graph.


GOOGLE

BING

REDIFFMAIL

ALTAVISTA

YAHOO

WIKIPEDIA

FIGURE 6.6: Graph illustrating the results of the SNN Algorithm applied tothe graph shown in Figure 6.5, with k = 2.

An SNN graph is a special type of KNN graph in which an edge ex-ists between two vertices vi and vj if there are at least k vertices adja-cent to both vi and vj . The formal definition of the SNN graph appearsbelow.

Definition 6.1 Shared Nearest Neighbor graphLet V = {v1, v2, v3, . . . , vn} be the set of nodes in a KNN graph G. In theSNN graph derived from G, the neighbors of vi are the nodes vj , j 6= i,such that vi and vj have at least k neighbors in common or, equivalently,if there exist at least k distinct paths of length two between vi and vjin G.

i j

FIGURE 6.7: A KNN graph.

In Figure 6.7, each of the two black vertices i and j have eight near-


FIGURE 6.8: An SNN graph derived from the KNN graph of Figure 6.7, wherek = 4.

est neighbors, including each other. Four of those nearest neighbors areshared, shown in gray. Thus, the two black vertices are similar whenk = 4 because vertices vi and vj share four common neighbors. Thissimilarity is indicated by an edge between them in the SNN graph inFigure 6.8.

Algorithm 6 illustrates how the SNN graph can be calculated for an un-weighted KNN graph G. Since there are three nested loops in the algorithm,the computational complexity for Algorithm 6 is O(n3). Note that if G is aweighted graph, we can apply this algorithm to the KNN graph of G.

Input: G—an undirected graphInput: k—a natural number (number of shared neighbors)Output: G′—Shared Nearest Neighbor graph.

1 Initialize the SNN graph G′ with |V (G)| vertices and no edges2 foreach i = 1 to V (G) do3 foreach j = i+ 1 to V (G) do4 counter = 05 foreach m = 1 to V (G) do6 if vertex i and vertex j both have an edge with vertex m

then7 counter = counter + 1

8 end9 if counter ≥ k then

10 Connect an edge between vertex i and vertex j in G′

11 end

12 end13 return G′

Algorithm 6: SNN Algorithm

A more efficient way to calculate the SNN graph is to use the fact thatthe number of common neighbors between two different vertices equals thenumber of paths of length two between them. If A is the adjacency matrix ofgraph G, then A2 counts the number of paths of length two between every pair


C21

c22

C11

C23

C15

FIGURE 6.9: Figure indicating the ability of shared nearest neighbor to formlinks between nodes in different clusters.

of vertices, and if we are using efficient matrix multiplication, this calculationrequires less than O(n3) time.

6.2.4 Strengths and Weaknesses

One of the main advantages of SNN is that it considers not only direct as-sociations between vertices, but also indirect connections. This ability can beused to find similarities between vertices that are not adjacent.

Another advantage of SNN in the context of clustering (see Chapter 8,Clustering) is its ability to handle clusters of varying size, shape, and density.One situation that SNN handles well is when two nodes are relatively close,but the nodes belong to different clusters, illustrated in Figure 6.9.

However, SNN does not take into account the weight of the links betweenthe vertices in a graph. For example, SNN could establish a link betweentwo vertices with common neighbors that are not very strongly related to thevertices, which may represent poor similarity.


6.3 Evaluating Relatedness Using Neumann Kernels

With SNN, we dealt with undirected but possibly edge-weighted graphs. Inthis section, we introduce a concept of proximity that works for both directedand undirected graphs. As an example of a directed graph, if nodes correspondto journal articles, then directed edges may correspond to a citation of onearticle by another.

Ito et al. [10] and Shimbo et al. [16] developed a kernel-based frameworkfor calculating both importance and relatedness. They show that the HITSalgorithm (described in Chapter 5, Link Analysis) can be generalized using aNeumann Kernel, resulting in an efficient and powerful mechanism for deter-mining “relatedness” between vertices.

When considering relatedness, we emphasize two techniques: co-citationsand bibliographic couplings. Co-citation is calculated between a pair of doc-uments and is a measure of how many documents cite both documents inthis pair. Bibliographic coupling is a measure of how many common citationsexist in both of the documents. Each of these can be easily obtained from theadjacency matrix of the graph; however, unlike with web hyperlink data, bothof these graphs are undirected (there is no order to citations, etc.).

6.3.1 Calculating Co-citation and Bibliographic Coupling

To illustrate how the Neumann Kernel represents relatedness and importance,we consider an example of citations in journal articles [10]. Figure 6.10 rep-resents eight journal articles and the citations that exist between them. Inthis graph, each vertex represents a journal article, and each directed edgerepresents a citation of one article by another. We can define the adjacencymatrix for this graph using R.

n1 n2 n3 n4

n5 n6 n7 n8

FIGURE 6.10: Directed graph to model journal article citations, where eachedge indicates citations of one article by another. For example, article n5 citesboth articles n1 and n2.


1 #Adjacency matrix for citation graph

2 C<-matrix(c(0,0, 0, 0, 0, 0, 0, 0,

3 0, 0, 0, 0, 0, 0, 0, 0,

4 0, 0, 0, 0, 0, 0, 0, 0,

5 0, 0, 0, 0, 0, 0, 0, 0,

6 1, 1, 0, 0, 0, 0, 0, 0,

7 0, 1, 1, 0, 0, 0, 0, 0,

8 0, 0, 1, 1, 0, 0, 0, 0,

9 0, 0, 1, 1, 0, 0, 0, 0),

10 nrow=8,ncol=8,

11 byrow=TRUE)

As stated above, co-citation and bibliographic coupling are two measuresof vertex relatedness. Figure 6.11 illustrates the co-citation graph and bibli-ographic coupling graph for the citation graph shown in Figure 6.10. In thebibliographic coupling graph, two journal articles (vertices) are connected ifboth of them cite at least one common journal article. In Figure 6.11 (A),nodes n5 and n6 are connected by an edge of weight 1 because they cite onlyone common node, n2. Similarly, in Figure 6.11 (B), nodes n3 and n4 are con-nected by an edge of weight 2 because they are both cited by nodes n7 andn8.

Both of these graphs can be generated directly from the adjacency matrixof the citation graph. The formal definitions for the bibliographic couplingand co-citation matrices are as follows.

Definition 6.2 Bibliographic coupling matrixThe bibliographic coupling matrix, B, is defined as:

B = CCT,

where C is the adjacency matrix of the citation graph.

Definition 6.3 Co-citation matrixThe co-citation matrix, P, is defined as:

P = CTC,

where C is the adjacency matrix of the citation graph.

We can calculate the co-citation matrix P as follows:


n1 n2 n3 n4

n5 n6 n7 n8

1

211

2 2 2 2

(A) Bibliographic coupling graph.

n1 n2 n3 n4

n5 n6 n7 n8

1

2

1

1 2 3

2

(B) Co-citation graph.

FIGURE 6.11: Bibliographic coupling graph (A) and co-citation graph (B) forthe citation graph in Figure 6.10.


P = CTC (6.1)

=

0 0 0 0 1 0 0 00 0 0 0 1 1 0 00 0 0 0 0 1 1 10 0 0 0 0 0 1 10 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0

(6.2)

×

0 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 01 1 0 0 0 0 0 00 1 1 0 0 0 0 00 0 1 1 0 0 0 00 0 1 1 0 0 0 0

(6.3)

=

1 1 0 0 0 0 0 01 2 1 0 0 0 0 00 1 3 2 0 0 0 00 0 2 2 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0

(6.4)

In R, this calculation can be performed using the matrix multiplicationand matrix transpose functions:

1 t(C) %*% C

Similarly, we can calculate the bibliographic coupling matrix as B = C ·CT. In R, we calculate this matrix using:

1 C %*% t(C)

Co-citation and bibliographic coupling can be applied as a measure ofrelatedness to a domain such as journal article citations. Although the ap-plication appears limited, this technique is generalizable to many differentdomains through the use of Neumann Kernels. In the next section, we willshow that the concepts of co-citation and bibliographic coupling are analo-gous to the concepts of “document correlation matrix” and “term correlationmatrix” from the information retrieval field. We leverage this equivalence toprovide a formal definition for the Neumann Kernel.


TABLE 6.5: Term-document matrix for the sentences “I like this book” and“We wrote this book”

D1 D2

I 1 0like 1 0this 1 1book 1 1We 0 1wrote 0 1

6.3.2 Document and Term Correlation

The Neumann Kernel method was originally developed by Kandola, Shawe-Taylor and Cristianini [12] for information retrieval, and it can be applied todocuments, papers, articles, journals, and other types of data. Shimbo et al.[16] describe the algorithm in terms of matrix operations on a term-documentmatrix.

Definition 6.4 Term-document matrixThe term-document matrix X represents a relationship between documentsand terms in those documents, where documents are represented as columns,terms are represented as rows, and entries in the matrix are the number oftimes a given term appears in a given document. A term-document matrixcan be viewed as a bipartite graph where one partite set corresponds to thedocuments and the other corresponds to terms contained in the documents.

Consider the two short documents:

D1 = “I like this book,” andD2 = “We wrote this book.”

The term-document matrix X for the above example is shown in Table6.5. For example, in the given documents, the term ‘book’ appears once ineach of the documents, so the corresponding entries in the matrix are bothone.

Given a term-document matrix, we can generate a document correlationmatrix, which is a matrix that summarizes the common terms between aseries of documents. In the previous example, documents D1 and D2 have twoterms in common, ‘this’ and ‘book.’ The document correlation matrix for thisexample appears in Table 6.6. We formally define the document correlationmatrix below.


TABLE 6.6: Document correlation matrix for the sentences “I like this book”and “We wrote this book”

D1 D2

D1 4 2D2 2 4

TABLE 6.7: Term correlation matrix for the sentences “I like this book” and“We wrote this book”

I like this book We wrote

I 1 1 1 1 0 0like 1 1 1 1 0 0this 1 1 2 2 1 1book 1 1 2 2 1 1We 0 0 1 1 1 1wrote 0 0 1 1 1 1

Definition 6.5 Document correlation matrixThe document correlation matrix K is defined by:

K = XTX,

where X is the original term-document matrix.

We can also define a term correlation matrix T, which represents the co-occurrence of terms in a set of documents. The term correlation matrix forour earlier example is shown in Table 6.7. For example, the term ‘this’ ap-pears twice with the term ‘book’ (once in each of the documents). Hence, thecorresponding value in the term correlation matrix is 2. We formally definethe term correlation matrix below.

Definition 6.6 Term correlation matrixThe term correlation matrix T is defined by:

T = XXT,

where X is the original term-document matrix.

6.3.3 Neumann Kernels in Practice

Neumann Kernels provide a generalized framework for scoring vertices interms of importance (similar to HITS) or relatedness (similar to co-citationand bibliographic coupling) based on the value of a configurable parameter,


TABLE 6.8: Adjacency matrix for the directed graph in Figure 6.10

n1 n2 n3 n4 n5 n6 n7 n8

n1 0 0 0 0 0 0 0 0n2 0 0 0 0 0 0 0 0n3 0 0 0 0 0 0 0 0n4 0 0 0 0 0 0 0 0n5 1 1 0 0 0 0 0 0n6 0 1 1 0 0 0 0 0n7 0 0 1 1 0 0 0 0n8 0 0 1 1 0 0 0 0

the decay factor , denoted by γ. We discuss how to choose a value for γ and theimplications of different values for γ in the next two sections. First, though,we illustrate the general operation of the Neumann Kernel algorithm, followedby its formal derivation. This section links together the previous sections onthe co-citation and bibliographic coupling relatedness measures and term anddocument correlation matrices.

Table 6.8 defines the adjacency matrix for the graph shown in Figure 6.10.Given the citation matrix in Table 6.8, we can use R to calculate the NeumannKernel:


2 library(igraph)


4 #Encode graph as adjacency matrix

5 C<-matrix(c(0, 0, 0, 0, 0, 0, 0, 0,

6 0, 0, 0, 0, 0, 0, 0, 0,

7 0, 0, 0, 0, 0, 0, 0, 0,

8 0, 0, 0, 0, 0, 0, 0, 0,

9 1, 1, 0, 0, 0, 0, 0, 0,

10 0, 1, 1, 0, 0, 0, 0, 0,

11 0, 0, 1, 1, 0, 0, 0, 0,

12 0, 0, 1, 1, 0, 0, 0, 0),

13 nrow=8,ncol=8,

14 byrow=TRUE)

15 #Create graph from adjacency matrix

16 G<-graph.adjacency(C,mode=c("directed"),weighted=NULL)

17 #Call to Neumann Kernel function for graph G.

18 Neumann_Kernel(G)


6.3.4 Derivation of Neumann Kernels

The Neumann Kernel in its original form is defined in terms of the term-document matrix X. In preparation for computing the Neumann Kernels, wefirst compute the document correlation matrix K = XTX and term correla-tion matrix T = XXT [10]. The Neumann Kernel defines two matrices, Kγ

and Tγ , with γ as a decay factor. These matrices are defined mathematicallyas follows:

Nγ(K) = Kγ = Nγ(XTX) = XTX + γ(XTX)2 + . . . (6.5)

Nγ(T) = Tγ = Nγ(XXT) = XXT + γ(XXT)2 + . . . (6.6)

Equation 6.5 computes a weighted sum of (XTX)n over every naturalnumber n. Since the (i, j)-element of (XTX)n represents the number of pathsof length n between vertices i and j in the matrix XTX, each element inthe kernel matrix represents a weighted sum of the number of paths betweenvertices. In other words, these equations relate not just documents or termsthat share an edge, but also documents or terms that are more distantlyconnected [10].

In the above equations, Kγ gives the similarity between documents and Tγ

gives the similarity between terms, as XTX (which is equal to the documentcorrelation matrix K) appears in Equation 6.5 and XXT (which is equal to theterm correlation matrix T) appears in Equation 6.6. Each of these equationsuses γ to control the balance between importance and relatedness.

Now, we will try to simplify the equations so that we can compute thevalues of Kγ and Tγ more easily. Equations 6.5 and 6.6 can be rewritten as:

Kγ = XTX

∞∑n=0

γn(XTX)n (6.7)

Tγ = XXT∞∑n=0

γn(XXT)n (6.8)

Since K = XTX and T = XXT, Equations 6.7 and 6.8 become:

Kγ = K∞∑n=0

γnKn (6.9)

Tγ = T∞∑n=0

γnTn (6.10)

These equations still need to calculate Kn and Tn, though, which is com-putationally expensive. So, we use the idea of a convergent geometric seriesto reduce the time complexity of the algorithm. A well-known mathematicalfact is that the geometric series


∞∑n=0

an,

where a is a scalar, converges to 11−a under the condition that |a| < 1. Simi-

larly, the geometric series

∞∑n=0

An,

where A is a matrix, converges to (I−A)−1 under the condition that |λm| < 1,where λm is the maximum eigenvalue of A. From the Perron-Frobenius theo-rem [1], the maximum eigenvalue λm of a matrix lies between the maximumindegree and maximum outdegree of the matrix.

Using the notations ∆−(G) and ∆+(G) to represent the maximum inde-gree and outdegree of G, respectively, we can write the lower bound for themaximum eigenvalue as:

λm ≥ min{∆−(G),∆+(G)}. (6.11)

If γ is chosen such that γ < 1λm

, Equations 6.9 and 6.10, respectively, convergeto:

Kγ = K(I − γK)−1 (6.12)

Tγ = T(I − γT)−1 (6.13)

When γ = 0, the above two equations become:

Kγ = K, (6.14)

Tγ = T, (6.15)

which are the document correlation matrix and the term correlation matrix,respectively. In this case, the Neumann Kernel gives the relatedness betweendocuments and between terms when γ = 0. Similarly, when γ is larger, thenthe kernel output matches with HITS.

In Equations 6.12 and 6.13, there are two matrix multiplications, one in-verse calculation, and one matrix subtraction, for a reduction in the compu-tational complexity of the kernel calculation relative to Equations 6.9 and6.10.

Figure 6.12 provides a high-level overview of the Neumann Kernel algo-rithm.

We give a pseudocode description of the Neumann Kernel technique in Al-gorithm 7. The input to this algorithm is a directed graph G. Line 3 calculates


Calculate K and T matrices

Adjacency matrix for graph

Compute diffusion

factor

Calculate Neumann Kernel

matrices

FIGURE 6.12: A high-level overview of the Neumann Kernel algorithm de-scribed in detail in Algorithm 7.

Input: G—a directed graph representing a collection of items (e.g.,web pages)

Output: Kγ and Tγ , Neumann Kernel matrices1 A← adjacency matrix of G

2 Compute AT

3 K← ATA

4 T← AAT

5 Compute ∆+(G) and ∆−(G)

6 Choose γ between 0 and 1min{∆+(G),∆−(G)}

7 Compute Kγ ← K(I− γK)−1

8 Compute Tγ ← T(I− γT)−1

9 return Kγ , Tγ

Algorithm 7: A pseudocode description of the Neumann Kernel algo-rithm


the transpose of the adjacency matrix of G, and lines 4 and 5 multiply theadjacency matrix and its transpose to get matrices K and T. Line 6 computesthe maximum indegree and outdegree of G, and line 7 uses these values tochoose a value for the decay factor γ. How best to choose this value for γ isdiscussed in Section 6.3.5. Finally, lines 8 and 9 apply Equations 6.12 and 6.13to calculate Kγ and Tγ , and line 10 returns these two values.

6.3.5 Choosing the Decay Factor

The decay factor γ is a mechanism for controlling how much relative weightis given to vertices farther away in the graph when calculating relatedness.The underlying assumption is that vertices closer to the vertex in question aremore strongly related than those farther away, and our choice of γ allows usto balance the contributions of vertices that are closer and more distant.

Choosing a value of γ = 0 allows us to give no weight at all to verticesoutside of the immediate neighbors. For this choice of γ, the output of theNeumann Kernel will simply be relatedness among the vertices, i.e., the doc-ument correlation matrix K and the term correlation matrix T, as noted inEquations 6.14 and 6.15.

Larger values of γ increase the contribution of the vertices farther awayin the graph, giving us a value that scores the importance of each node, likeHITS. As discussed previously, the value of γ must be less than 1

λmso that the

geometric sequences in Equations 6.5 and 6.6 can converge to Equations 6.12and 6.13. Since we know that λm must lie between ∆+(G) and ∆−(G), weneed to choose γ to be less than

1

min{∆+(G),∆−(G)}.

In the next section, we will discuss more concrete examples on how theNeumann Kernel, document and term correlation matrices, and HITS com-pare.

6.3.6 Relationships among Neumann Kernels, HITS, andCo-citation/Bibliographic Coupling

As we have discussed previously, Neumann Kernels provide a frameworkfor calculating the importance and relatedness of vertices within a graph.This technique is a generalization of both the HITS algorithm and the co-citation/bibliographic coupling matrix technique. We illustrate this conceptusing the graph shown in Figure 6.10 and the associated adjacency (cita-tion) matrix in Table 6.8. In order to provide a comparison between thesetechniques, we first apply the Neumann Kernel using different values of γ tocalculate relatedness between the vertices of the graph. We can then assessthe importance of each node by taking the sum of the relatedness values tothe other vertices.


Choosing a value of γ = 0 results in the co-citation matrix K (see Fig-ure 6.9 (B)):

K0 = P (6.16)

= CTC (6.17)

=

1 1 0 0 0 0 0 01 2 1 0 0 0 0 00 1 3 2 0 0 0 00 0 2 2 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0

(6.18)

If we compare the sum of the rows in this matrix, we see that using γ =0 (or the technique of constructing the co-citation matrix) would rank theauthority of the papers as:

n3 > n2 = n4 > n1 > n5 = n6 = n7 = n8.

However, if we increase the decay factor to γ = 0.207, we get the followingmatrix for K:

K0.207 =

10 34 82 58 0 0 0 034 127 315 223 0 0 0 082 315 805 572 0 0 0 058 223 572 408 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 0

, (6.19)

which gives a ranking of:

n3 > n4 > n2 > n1 > n5 = n6 = n7 = n8.

We now compare these results to the HITS algorithm. We can run HITSon our example graph using the following R code:


2 library(igraph)


4 #Adjacency matrix for citation graph

5 C<-matrix(c(0, 0, 0, 0, 0, 0, 0, 0,

6 0, 0, 0, 0, 0, 0, 0, 0,


TABLE 6.9: Authority and hub scores (from the HITS algorithm) for citationdata example in Figure 6.10

Article Authority Hub

n1 0.08466727 0.00000000n2 0.31116868 0.00000000n3 0.77293020 0.00000000n4 0.54642878 0.00000000n5 0.00000000 0.18042241n6 0.00000000 0.49413333n7 0.00000000 0.60136512n8 0.00000000 0.60136512

7 0, 0, 0, 0, 0, 0, 0, 0,

8 0, 0, 0, 0, 0, 0, 0, 0,

9 1, 1, 0, 0, 0, 0, 0, 0,

10 0, 1, 1, 0, 0, 0, 0, 0,

11 0, 0, 1, 1, 0, 0, 0, 0,

12 0, 0, 1, 1, 0, 0, 0, 0),

13 nrow=8,ncol=8,

14 byrow=TRUE)

15 #Create graph from adjacency matrix

16 G<-graph.adjacency(C,mode=c("directed"),weighted=NULL)

17 #Number of iterations

18 k<-10

19 #Call to HITS function for graph G and iterations k=10.

20 output<-HITS(G,k)

21 #Authority & Hub Vector for nodes

22 output

The hub and authority scores produced by HITS appear in Table 6.9. Notably,the authority scores from HITS rank the vertices as

n3 > n4 > n2 > n1 > n5 = n6 = n7 = n8,

which matches the Neumann Kernel authority ranking using γ = 0.207 [10].As the parameter γ increases, the measures induced by the kernel assume

the character of global importance rather than relatedness. At γ = 0.207, theauthority ranking produced by the Neumann Kernel matches the authorityranking of HITS. Thus, by varying the value of the decay factor γ, the Neu-mann Kernel can calculate authority and hub scores that include only localinformation, like in the co-citation matrix, or scores that depend on the globalstructure of the graph, like the HITS algorithm.


6.3.7 Strengths and Weaknesses

The weaknesses of the Neumann Kernel are similar to those of HITS describedin Chapter 5, Link Analysis. Its primary weakness, like HITS, is topic drift [5].A new weakness appears when comparing Neumann Kernels with other kernelmethods, such as the Laplacian diffusion (see Chapter 4, An Introduction toKernels). Unlike Neumann Kernels, Laplacian matrices penalize loops, whichwas shown to be detrimental in a study comparing graph kernel methods forcollaborative recommendation systems [7]. These same types of errors mayappear in a co-citation graph structure [10].

The main strength of the Neumann Kernel is its ability to balance relat-edness and importance, much like a generalized version of HITS. By tuningthe parameters of the kernel, one can find a harmony between relatedness andimportance. Also, the Neumann Kernel technique subsumes other known al-gorithms, such as co-citation coupling and bibliographic coupling [16], whichis a very useful feature for link analysis.

6.4 Applications

In this section, we discuss a co-citation analysis example using Shared NearestNeighbor with a real-world data set. The data set can be downloaded fromthe book website, along with the associated R code.

6.4.1 Co-citation Analysis with Shared Nearest Neighbor

To illustrate the Shared Nearest Neighbor (SNN) algorithm, we apply theSNN algorithm to a subset of the citation data set obtained from http:

//vlado.fmf.uni-lj.si/pub/networks/data/. The graph is a paper cita-tion network [2] from the 2001 Graph Drawing Conference proceedings. Itcontains a vertex for every paper in the proceedings for every Graph Drawingconference between the years 1994 and 2000, where the edges represent thecitation of one paper by another. The dataset consists of 311 papers and 646citation relationships. We can use the following R code to load the citationdata and construct the SNN graph.

1 #Load libraries

2 library(igraph)


4 #Load the Citations data using

5 data(citation)

6 #Convert the data to an undirected graph

7 m<-as.matrix(citation[,1:2])

8 G<-graph.edgelist(m)


9 #Get the adjacency matrix

10 A<-get.adjacency(G)

11 #Choose an appropriate value for k (nearest neighbors)

12 k <- 3

13 #Call SNN_GRAPH(adjacency_matrix, k);

14 G1<-SNN_GRAPH(A,k)

15 #Plot the SNN graph

16 plot(G1)

In the above code, we construct the 3-SNN graph for the Graph Drawingcitation graph. In the resulting graph, two vertices (papers) would be joinedby an edge if those papers cited three Graph Drawing papers in common. Thisgraph might allow us to tell whether different authors have a tendency to citethe same papers.

6.5 Bibliographic Notes

In this chapter, we have detailed two techniques that deal with in-graph prox-imity. Alternatively, one can consider the proximity between multiple graphs.Three ways to calculate inter-graph proximity are examining graph isomor-phisms [4], graph edit distances [8], and maximum common subgraphs [14].

6.5.1 Inter-graph Proximity

As a counter to other approaches such as edge slides or edge rotations [9],isomorphisms are described mathematically with the technique of judging thedistance between two graphs as the distance between the isomorphic mappingsof vertices u and v. The distance between the two graphs via this method growsas the u, v mappings differ with their isomorphic translations [4].

A similar technique for calculating proximity between graphs is the ideaof graph edit distance. Instead of computing the distances between pairs ofvertices in isomorphic mappings, graph edit distance measures the number ofedits (vertex or edge relabelings, insertions, or deletions) until the two graphsare equal. There are a variety of methods for calculating graph edit distance.A survey of several different techniques is covered in [8].

A third technique, the Maximum Common Subgraph (MCS), identifies thelargest subgraph common to a pair of graphs. Intuitively, more similar graphswould have a larger common subgraph. Major applications for MCS includebiology and chemistry, which often look at chemical structures for similarity.For more information on this subject, see [14] and [15].


6.5.2 In-graph Proximity

There are other algorithms like SNN that compare the structure of a singlegraph to determine similarity between vertices. One technique similar in con-cept to SNN is SimRank, in which two vertices are similar if the vertices arereferenced (linked) by similar objects. To do this, vertices are paired and as-signed a similarity score. The scores propagate to neighboring vertices in aniterative fashion until the similarity scores converge. See [11] for details onhow these similarity scores are calculated and propagated.

6.5.3 Node Importance

We have discussed the Neumann Kernel technique and its relation to the HITSalgorithm for determining the importance of a node in a graph, though thereare many other algorithms for this task. One such technique, popularized byGoogle, is the PageRank algorithm. PageRank, which is discussed in moredetail in Chapter 5, Link Analysis, essentially performs a random walk on theWeb. Though PageRank doesn’t explicitly define authority and hub nodes,the contributed ranks resemble the authority/hub relationship modeled inalgorithms such as HITS [3]. Another algorithm, SALSA ranks websites basedon a specific query. SALSA esentially performs two random walks, a forwardand a backward Markov chain walk; these two walks are equivalent to theauthority/hub behavior of HITS, while being lighter computationally [13].

6.6 Exercises

1. Given the term-document matrix,

X =

1 2 34 5 67 8 9

,calculate:

(a) The document correlation matrix for X

(b) The term correlation matrix for X

2. Consider the graph shown in Figure 6.13:

(a) Construct the adjacency matrix that corresponds to the graphin Figure 6.13.

(b) List all the pairs of nodes that have at least two shared neigh-bors in common.


(c) Do any of the nodes share three neighbors? If so, list all thepairs that do and the neighboring nodes they have in common.

(d) Suppose node 3 is removed from the graph. Do we have anypairs of nodes that still have at least two shared neighbors incommon? Justify your answer.

2

1

6

3

4 5

7

8

FIGURE 6.13: Example graph for Exercise 2.

3. Given the directed graph shown in Figure 6.14:

0

2

3

1

4


(a) Give the adjacency matrix, A, for this graph.

(b) Calculate the transpose of the adjacency matrix, AT.

(c) Calculate the document correlation matrix, AAT.


(d) Calculate the term correlation matrix, ATA.

(e) Calculate the maximum indegree of the graph.

(f) Calculate the maximum outdegree of the graph.

(g) Calculate the valid range for the decay factor γ in the NeumannKernel.

(h) Verify your results using the igraph package. List the R com-mands used.

4. (a) Describe how the shared nearest neighbor algorithm groups ver-tices.

(b) Based on your answer to part a, how many total nodes arerequired at minimum to find a pair of nodes with n sharednearest neighbors? Explain your answer.

(c) Given the list of vertices with x and y coordinates shown below,construct an adjacency matrix of the vertices, where verticesare considered to be neighbors (adjacent) if they are withina distance of 4, using Euclidean distance. Note, the verticesshould not be considered adjacent to themselves.

Node Position Node PositionA (1, 2) F (9, 3)B (4, 6) G (4, 1)C (4, 4) H (7, 1)D (5, 4) I (2, 7)E (2, 6) J (8, 5)

(d) Using the adjacency matrix that you have created, find and listall the pairs of nodes that have at least two shared neighborsin common.

(e) List all the pairs of nodes that have at least three shared neigh-bors in common.

5. (a) What does it mean for a matrix to be symmetric?

(b) Is the matrix ATA symmetric for any adjacency matrix A?Justify your answer.

(c) Is the matrix AAT symmetric for any adjacency matrix A?Justify your answer.

6. An alpha group is a set of vertices in which each vertex is connectedto every other vertex. Figure 6.15 shows an alpha group of size 4.

(a) Suppose we are looking for pairs of vertices that share at leasttwo neighboring nodes. How many pairs of vertices would weget for an alpha group of size 4? How many in alpha group ofsize 5? Size 6?

(b) Using your results from part (a), devise a function that findshow many pairs of vertices have at least two common neighborsin an alpha group of size n. (Hint: also think about alpha groupsof size 3 or less.)


B A

D C


7. (a) Describe how the HITS algorithm relates to the Neumann Ker-nel technique.

(b) How does the decay factor γ affect the output of the NeumannKernel algorithm?

Bibliography

[1] A. Berman and R. Plemmons. Nonnegative Matrices in the MathematicalSciences. New York: Academic Press, 1979.

[2] T. Biedl and F. Brandenburg. Graph-drawing contest report. In Pro-ceedings of the 16th International Symposium on Graph Drawing, pages82–85, 2002.

[3] S. Brin and L. Page. The anatomy of a large-scale hypertextual websearch engine. Computer Networks and ISDN Systems, 30:107–117, 1998.

[4] G. Chartrand, G. Kubicki, and M. Schultz. Graph similarity and distancein graphs. Aequationes Mathematicae, 55:129–145, 1998.

[5] D.J. Cook and L.B. Holder. Mining Graph Data. Wiley-Interscience,2006.

[6] L. Ertoz, M. Steinbach, and V. Kumar. A new shared nearest neigh-bor clustering algorithm and its applications. In Proceedings of the 1stWorkshop on Clustering High Dimensional Data and Its Applications,2002.


[7] F. Fouss, L. Yen, A. Pirotte, and M. Saerens. An experimental investiga-tion of graph kernels on a collaborative recommendation task. Proceed-ings of the 6th International Conference on Data Mining, pages 863–868,2006.

[8] X. Gao, B. Xiao, D. Tao, and X. Li. A survey of graph edit distance.Pattern Analysis & Applications, 13:113–129, 2010.

[9] W. Goddard and H.C. Swart. Distances between graphs under edgeoperations. Discrete Mathematics, 161:121–132, 1996.

[10] T. Ito, M. Shimbo, T. Kudo, and Y. Matsumoto. Application of ker-nels to link analysis. In Proceedings of the 11th ACM SIGKDD Interna-tional Conference on Knowledge Discovery in Data Mining, pages 586–592, 2005.

[11] G. Jeh and J. Widom. SimRank: A measure of structural-context simi-larity. In Proceedings of the 8th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, pages 538–543, 2002.

[12] J.S. Kandola, J. Shawe-Taylor, and N. Cristianini. Learning semanticsimilarity. In Proceedings of the 16th Annual Conference on Neural In-formation Processing Systems, pages 657–664, 2002.

[13] M.A. Najork. Comparing the effectiveness of HITS and SALSA. InProceedings of the 16th ACM Conference on Information and KnowledgeManagement, pages 157–164, 2007.

[14] J.W. Raymond, E.J. Gardiner, and P. Willett. RASCAL: Calculation ofgraph similarity using maximum common edge subgraphs. The ComputerJournal, 45:631–644, 2002.

[15] J.W. Raymond and P. Willett. Maximum common subgraph isomorphismalgorithms for the matching of chemical structures. Journal of Computer-Aided Molecular Design, 16:521–533, 2002.

[16] M. Shimbo, T. Ito, D. Mochihashi, and Y. Matsumoto. On the propertiesof von Neumann kernels for link analysis. Machine Learning, 75:37–67,2009.

graph-based proximity measures · a na ve approach for constructing the knn graph would compare...

Documents