community discovery in social networks via heterogeneous link … · 2014-09-08 · 3 problem...

Community Discovery in Social Networks via Heterogeneous Link Associationand Fusion

Lei Meng ∗ Ah-Hwee Tan ∗

Abstract

Discovering social communities of web users through clus-

tering analysis of heterogeneous link associations has drawn

much attention. However, existing approaches typically re-

quire the number of clusters a prior, do not address the

weighting problem for fusing heterogeneous types of links

and have a heavy computational cost. In this paper, we

explore the feasibility of a newly proposed heterogeneous

data clustering algorithm, called Generalized Heterogeneous

Fusion Adaptive Resonance Theory (GHF-ART), for discov-

ering communities in heterogeneous social networks. Differ-

ent from existing algorithms, GHF-ART performs real-time

matching of patterns and one-pass learning which guarantee

its low computational cost. With a vigilance parameter to

restrain the intra-cluster similarity, GHF-ART does not need

the number of clusters a prior. To achieve a better fusion of

multiple types of links, GHF-ART employs a weighting func-

tion to incrementally assess the importance of all the feature

channels. Extensive experiments have been conducted to an-

alyze the performance of GHF-ART on two heterogeneous

social network data sets and the promising results compar-

ing with existing methods demonstrate the effectiveness and

efficiency of GHF-ART.

1 Introduction

Clustering [4] for discovering communities in social net-works [1], aiming at identifying groups of users withcommon interests and behavior, has been an importanttask for the understanding of collective social behaviorand associative mining such as social link prediction andrecommendation [2, 3]. However, with the popularity ofsocial websites such as Facebook, users may communi-cate and interact with each other easily and diversely,such as posting blogs and tagging documents. The avail-ability of those social media data, on one hand, facili-tates the extraction of rich link information among usersfor further analysis. On the other hand, new challengeshave been identified for traditional clustering techniqueson community discovery from heterogeneous social net-works, such as the scalability to large social networks,

∗School of Computer Engineering, Nanyang TechnologicalUniversity, Singapore, Email: {meng0027, asahtan}@ntu.edu.sg

techniques for link representation and methods for thefusion of heterogeneous types of links.

In the recent years, many works have been done onthe clustering of heterogeneous data. Existing meth-ods may be considered in four categories: multi-viewclustering approach [5, 12, 13, 17], spectral clusteringapproach [11, 14, 21, 23], matrix factorization approach[6, 15], aggregation approach [8, 9]. However, they haveseveral limitations for clustering heterogeneous sociallinks in practice. Firstly, existing algorithms typical-ly involve iterative optimization which does not scalewell to big data sets. Secondly, most of them need thenumber of clusters a prior, which is hard to decide inpractice. Thirdly, most of those algorithms do not con-sider the weighting problem when fusing multiple typesof links. Since different types of links have their ownmeanings and levels of feature values, equal or empiricalweights for them may bias their importance in similaritymeasure and may not yield optimal performance.

In this paper, we explore the feasibility of General-ized Heterogeneous Fusion Adaptive Resonance Theory(GHF-ART) for identifying user groups in the heteroge-neous social networks. GHF-ART [10], extended fromFusion ART [16], has been proposed for clustering we-b multimedia data through the fusion of an arbitraryrich level of heterogeneous data resources such as im-ages, articles and surrounding text. For clustering datapatterns of social networks, we develop a set of specificfeature representation and learning rules for GHF-ARTto handle various heterogeneous types of social links,including relational links, textual links in articles andtextual links in short text.

GHF-ART has several key properties different fromexisting approaches. First, GHF-ART performs onlineand one-pass learning so that the clustering process canbe done in just a single round of pattern presentation.Second, GHF-ART does not need the number of clus-ters a prior. Third, GHF-ART globally and locally eval-uates the similarity between patterns across and in eachfeature channel. Fourth, the obtained similarities fromall the feature channels are fused by a weighting func-tion, termed Robustness Measure (RM ), which adap-tively tunes the weights of different feature channels.

We analyze the performance of GHF-ART on twopublic social network data sets, namely the YouTubedata set [8] and the BlogCatalog data set [7], in termsof parameter selection, clustering performance compari-son, performance in Robustness Measure and time cost.From our experimental results, we analyze the param-eter selection methods for GHF-ART and show thatGHF-ART outperforms and is much faster than manyexisting heterogeneous data clustering algorithms.

The remainder of paper is summarized as follows.Section 2 reviews existing works on the problem ofheterogeneous data clustering. Section 3 formulates theproblem of community discovery in the heterogeneoussocial networks. The technical details of GHF-ART aredescribed in Section 4. Section 5 presents the analysisof experimental results and the last section concludesour work.

2 Related Work

Our work on identifying social groups of users via het-erogeneous social links is related to the problem of het-erogeneous data clustering. Considering different mod-el formulation, existing approaches can be categorizedinto four categories: 1) The multi-view clusteringapproach [5, 12, 13, 17] considers to use two clusteringmodels for two types of independent features. Subse-quently, the learnt parameters of them are further re-fined by learning from each other iteratively. However,this approach is restricted to two types of links. 2) Thespectral clustering approach [11, 14, 21, 23] typi-cally models each feature modality as a graph and usesdifferent unified objective function to identify an overallbest cut of the graphs, which is typically an embeddingvector and needs traditional clustering algorithms to ob-tain the final results. 3) The Matrix factorizationapproach [6, 15] factorizes a similarity matrix into t-wo or three matrices and minimize the reconstructionerror objective, which identifies the cluster membershipof patterns by finding a cluster indicator matrix thatcontains the projection values of each data pattern toa pre-defined number of clusters. 4) The aggregationapproach [8, 9] follows the idea of first obtaining therelational vectors [8] or similarities [9] between patternsfor each type of features and then integrating them toproduce the final results.

3 Problem Statement

The community discovery problem in the heterogeneoussocial networks is to identify the user’s social groupsby evaluating different types of links between userssuch that group members interact with each other morefrequently and share more common interests than thoseoutside the group.

Figure 1: The architecture of GHF-ART for integratingK types of feature vectors.

Considering a set of users U = {u1, . . . , uN} andtheir associated multiple types of links L = {l1, . . . , lK},which, as described in section 1, may be the contactlinks, textual links or visual links. Each user un canbe represented by a multi-channel input pattern I ={x1, . . . ,xK}, where xk is a feature vector extractedfrom the k-th link.

The community discovery task, in this work, is toidentify a set of clusters C = {c1, . . . , cJ} accordingto the similarities among the user patterns evaluatedwithin and across different types of links. As a result,given a user uN ∈ cJ and two users up ∈ cJ and uq 6∈ cJ ,for ∀p, q such that up, up ∈ U , we have SuN ,up

> SuN ,uq,

where SuN ,updenotes the overall similarity between

uN and up. Namely, users in the same cluster mayconsistently have a higher degree of similarity in termsof all types of links than those belonging to the otherclusters.

4 GHF-ART for Clustering HeterogeneousSocial Links

As shown in Fig. 1, GHF-ART consists of K indepen-dent feature channels in the input field designated tohandle an arbitrarily rich level of heterogeneous linksand a category field. The clustering process of GHF-ART processes one input pattern at a time, which com-prises four key steps: 1) Category choice: select abest-matching cluster, called a winner, across all thefeature channels; 2) Template matching: Evaluateif the degree of similarity between the input pattern Iand the winner satisfies a threshold, called the vigilancecriteria, for each feature channel; 3) Resonance andReset: If the vigilance criteria is violated, a reset oc-curs so that a new winner is selected from the rest ofthe clusters in the category field; Otherwise, a resonanceoccurs which leads to the learning of winner from theinput pattern for all the feature channels. 4) NetworkExpansion: If no cluster meets the vigilance criteria,a new cluster is generated to encode the new pattern.The dynamics of GHF-ART is summarized as follows.

Input vectors: Let I = {xk|Kk=1} denote the multi-channel input pattern, where xk is the feature vectorfor the k-th feature channel. Note that the min-maxnormalization should be employed to make sure that the

input values are in the interval [0, 1]. The complementcoding [18] is used to normalize the input feature vectorthrough which xk is concatenated with its complementvector xk in the input field such that xki = 1− xki .

Weight vectors: Let {wkj |Kk=1} denote the weight vec-

tors associated with the j-th cluster cj in the categoryfield F2.

Parameters: The GHF-ART’s dynamics is determinedby choice parameter α > 0, learning parameter β ∈[0, 1], contribution parameters γk ∈ [0, 1] for k =1, . . . ,K and vigilance parameters ρ ∈ [0, 1].

4.1 Heterogeneous Link Representation

4.1.1 Density-based Features for RelationalLinks Relational links, such as contact and co-subscription links, uses the number of interactions asthe strength of connection between users. Considering aset of users U = {u1, . . . , uN}, each user un is represent-ed by a feature vector FDn = [fn,1, . . . , fn,N ], whereinfn,N reflects the density of interactions between the userun and the N -th user in the user set U .

4.1.2 Text-similarity Features for ArticlesText-similarity features are used to represent the ar-ticles of users with long paragraphs such as blogs. Con-sidering the word list G = {g1, . . . , gM} of all the Mdistinct keywords from the articles of a set of usersU = {u1, . . . , uN}, the feature vector of un can be repre-sented by FAn = [fn,1, . . . , fn,M ], where fn,M indicatesthe importance of keyword gM to represent the user un,which can be valued by the term frequency-inverse doc-ument frequency (tf-idf).

4.1.3 Tag-similarity Features for Short TextTag-similarity features are used to represent short text,such as tags and comments. The key difference ofshort text from article is that short text consists offew but meaningful words. Considering a set of userU = {u1, . . . , uN} and the corresponding word listG = {g1, . . . , gH} of all the H distinct tags, thefeature vector of user un can be expressed by FSn =[fn,1, . . . , fn,H ]. Following the representation methodfor meta-information in [10], fn,h for h = 1, ...,H isgiven by

(4.1) fS,h =

{1, if gh ∈ Gn0, otherwise

.

4.2 Heterogeneous Link Fusion for SimilarityMeasure GHF-ART measures the similarity betweenthe input pattern and each cluster in the category fieldthrough a two-way similarity measures: a bottom-up

measure to select a winner cluster by considering theoverall similarity across all the feature channels; and atop-down measure to evaluate if the similarity for eachfeature channel meets the vigilance criteria threshold.

4.2.1 Bottom-Up Similarity for CategoryChoice In the first step, a choice function is employedto evaluate the overall similarity between the inputpattern and the template weight of each cluster in thecategory field, which is defined by

T (cj , I) =

K∑k=1

γk|xk ∧wk

j |α+ |wk

j |,(4.2)

where the fuzzy AND operation ∧ is defined by (p ∧q)i ≡ min(pi,qi), and the `1 norm |.| is defined by |p| ≡∑i pi. The choice function evaluates the proportion

of intersection between the feature vectors of the inputpattern and the prototypes of the winner across all thefeature channels so that the winner cluster with the bestmatching feature distribution in the category field isidentified.

4.2.2 Top-Down Similarity for TemplateMatching After identifying the winner cluster, amatch function is used to evaluate if the selectedwinner matches the input pattern in terms of eachfeature channel. For the k-th feature channel, thematch function is defined by

M(cj∗ ,xk) =

|xk ∧wkj∗ |

|xk|.(4.3)

If the match function values for all the K featurechannels satisfies the vigilance criteria defined byM(cj∗ ,x

k) > ρ for k = 1, ...K, a resonance occurs sothat the input pattern is categorized into the winnercluster. Otherwise, a reset occurs to select a new win-ner from the rest of the clusters in the category field.

4.3 Learning from Heterogeneous Links

4.3.1 Learning from Density-based and Text-similarity Features Assuming the k-th feature chan-nel is for the density-based features, the learning func-tion for the k-th feature channel of the winner clustercj∗ is defined by

wkj∗ = β(xk ∧wk

j∗) + (1− β)wkj∗ .(4.4)

We can observe that the values of the new weight vectorwill not be greater than the old ones so that this learningfunction may incrementally identify the key featuresby preserving the key features which have stably highvalues while depressing the features which are unstablein values.

4.3.2 Learning from Tag-similarity FeaturesAssuming the k-th feature channel is for the tag-sinilarity features of short text, given the k-th featurevector xk = [xk1 , . . . , x

kH ] of the input pattern I, the

winner cluster cj∗ with L users and the correspondingweight vector wk

j∗ = [wkj∗,1, . . . , wkj∗,H ] of cj∗ for the

k-th feature channel, the learning function for wkj∗,h isdefined by

wkj∗,h =

{ηwkj∗,h if xkh = 0

η(wkj∗,h + 1L ) otherwise

,(4.5)

where η = LL+1 . (4.5) models the cluster prototype for

the tag-similarity features by the probabilistic distribu-tion of tag occurrences. Thus, the similarity betweentag-similarity features can be considered as the numberof common words. During each round of learning, thekeywords with high frequency to occur in the clusterare given high weights while those of the noisy wordsare incrementally decreased.

4.4 Adaptive Weighting of HeterogeneousLinks GHF-ART employs the Robustness Measure (R-M ) to adaptively tune γ for different feature channel-s, which evaluates the importance of different featurechannels by considering the intra-cluster scatters.

Considering a cluster cj with L users, each of whichis denoted by Il = {x1

l , . . . ,xKl } for l = 1, . . . , L and the

corresponding weight vectors for the K feature channelsdenoted by Wj = {w1

j , . . . ,wKj }, the Difference for the

k-th feature channel of cj is defined by

Dkj =

1L

∑l |wk

j − xkl ||wk

j |.(4.6)

Considering all the clusters, the contribution parameterfor the k-th feature channel γk is defined by

γk =exp(− 1

J

∑j D

kj )∑K

k=1 exp(− 1J

∑j D

kj ).(4.7)

The respective incremental update equations forthe contribution parameters are further derived for thefollowing two cases:

• Resonance in existing cluster: Assuming theinput pattern IL+1 = {x1

L+1, . . . ,xKL+1} is assigned

to an existing cluster cj . For the k-th featurechannel, the corresponding update equations forthe density-based and text-similarity features andtag-similarity features are defined by (4.8) and (4.9)respectively:(4.8)

Dkj =

η

|wkj |

(|wkj |Dk

j + |wkj − wk

j |+1

L|wk

j − xkL+1|)

Algorithm 1 GHF-ART

Input: Input patterns In = {xk|Kk=1}, α, β and ρ.1: set n = 1.2: repeat3: Present In = {xk|Kk=1} into the input field.4: For ∀cj , calculate the choice function T (cj , In) in

(4.2).5: Identify the winner cluster cj∗ so that j∗ =

arg maxj:cj∈F2 T (cj , In).6: Calculate the match function M(cj∗ ,x

k) for k =1, . . . ,K in (4.3).

7: If ∃k such that M(cj∗ ,xk) < ρk, set T (cj∗ , In) =

0, go to 5.8: If cj∗ exists, Update wk

j∗ for k = 1, . . . ,Kaccording to (4.4) and (4.5) respectively andupdate γ according to (4.7)-(4.9).

9: If no cluster meets the vigilance criteria, Createa new node cJ+1 such that wk

J+1 = xk for k =1, . . . ,K, update γ according to (4.10)

10: n = n+ 1.11: until All the input patterns are presented.Output: Cluster Assignment Array {An|Nn=1}.

(4.9)

Dkj =

η

|wkj |

(ηDkj + |wk

j − ηwkj |+

1

L|wk

j − xkL+1|).

After the update for all feature channels, the newcontribution parameter can then be obtained bycalculating (4.7).

• Generation of new cluster: When generating anew cluster, the differences of other clusters remainunchanged. Therefore, it just introduces a propor-tionally change of the Difference. Considering therobustness Rk (k = 1, ...,K) for all of the featurechannels, the update equation for the k-th featurechannel is derived as:

(4.10) γk =Rk∑Kk=1 R

k=

(Rk)η∑Kk=1(Rk)η

.

4.5 Time Complexity Comparison The timecomplexity of GHF-ART with Robustness Measure hasbeen demonstrated to be O(nincnf ) in [10], where niis the number of input patterns, nc is the number ofclusters and nf denotes the number of features acrossall of the feature channels.

In comparison with existing heterogeneous dataclustering algorithms, the time complexity of LMF [6] isO(tninc(nc + nf )), PMM [8] is O(n3i + tncninf )), SRC[14] is O(tn3i + ncninf )) and NMF [15] is O(tncninf ),where t is the number of iteration. We can observe thatGHF-ART has a much lower time complexity.

Figure 2: The clustering performance of GHF-ART onthe YouTube data set in terms of SSE-Ratio by varyingthe values of α, β and γ respectively.

5 Experiments

5.1 YouTube Data Set

5.1.1 Data Description The YouTube data set 1

is a heterogeneous social network data set, which is o-riginally used to study the community detection prob-lem via heterogeneous interactions of users. This dataset contains 15, 088 users from YouTube website andinvolves five different types of links, including contactnetwork, co-contact network, co-subscription network,co-subscribed network and favorite network. Detaileddescriptions can be found in [8].

5.1.2 Evaluation Measure Since there is no groundtruth labels of users in this data set, we adopt five eval-uation measures to evaluate cluster quality: 1) Cross-Dimension Network Validation (CDNV ) [8]; 2) Aver-age Density (AD) measures the probability if two user-s have connection in the same cluster which is aver-aged by the number of clusters and feature modalities;3) Intra-cluster sum-of-squared error (Intra-SSE ) mea-sures the weighted average of SSE within clusters acrossfeature modalities; 4) Between-cluster SSE (Between-SSE ) measures the average distance between two clustercenters in a clustering to evaluate how well-separatedthe clusters are from each other; and 5) SSE-Ratio =Intra-SSE/Between-SSE takes both the precision andrecall aspects of the clustering performance.

5.1.3 Parameter Selection Analysis We initial-ized α = 0.01, β = 0.6 and ρ = 0.6 and studied thechange in performance of GHF-ART in terms of SSE-Ratio by varying one of them while fixing others, asshown in Fig. 2. We observe that despite some smallfluctuations, the performance of GHF-ART is roughlyrobust to the change in the values of α and β. Re-garding the vigilance parameter ρ, we find that the per-formance is improved when ρ increases up to 0.65 anddegrades when ρ > 0.85. To study the reasons, we an-

1http://socialcomputing.asu.edu/datasets/YouTube

Figure 3: The cluster structures generated by GHF-ART on the Youtube data set in terms of different valuesof vigilance parameter ρ.

alyzed the cluster structures generated under differentvalues ρ, which is shown in Fig. 3. We observe that theincrease of ρ leads to the generation of more clusters,which may contribute to the compactness of clusters.At ρ = 0.9, a significant number of small clusters aregenerated, which degrades the performance in terms ofrecall.

To study the selection of ρ, we analyzed the clusterstructure at ρ = 0.5 and 0.7 in which the best perfor-mance is obtained. We observe that when ρ increasesfrom 0.5 to 0.7, the number of small clusters with lessthan 100 patterns increases. Therefore, we may assumethat when a suitable ρ is reached, the number of smallclusters starts to increase. In this case, the number ofsmall clusters is nearly 10% of the total number of clus-ters. If this idea works, an interesting empirical way toselect a reasonable value of ρ may be tuning the value ofρ until a small amount of small clusters are identified.

5.1.4 Clustering Performance Comparison Wecompared the performance of GHF-ART with fourexisting heterogeneous data clustering algorithms asdescribed in section 2, namely the Spectral RelationalClustering (SRC) [14], Linked Matrix Factorization(LMF) [6], Non-negative Matrix Factorization (NMF)[15] and Principal Modularity Maximization (PMM)[8]. Since SRC and PMM need K-means to obtain thefinal clusters, we also employed K-means with Euclideandistance metric as a baseline.

To make a fair comparison, since GHF-ART needsto perform min-max normalization, we applied thenormalized data as input to the other algorithms. ForGHF-ART, we fixed α = 0.01 and β = 0.6. For K-means, we concatenated the feature vectors of the fivetypes of links. For SRC, we set them the same values ofGHF-ART. The number of iteration for K-means, SRC,LMF, NMF and PMM was set by 50.

We obtained the clustering results of GHF-ARTwith different values of ρ ranging from 0.3 to 0.9 andthose of K-means, SRC, LMF, NMF and PMM withdifferent pre-defined number of clusters ranging from

Table 1: The clustering Results of GHF-ART, K-means, SRC, LMF, NMF and PMM under the best setting ofpre-defined number of clusters (“k”) (ρ = 0.6 and 0.65 when k = 35 and 37 respectively for GHF-ART) in termsof CDNV , Average Density (AD), Intra-SSE, Between-SSE and SSE-Ratio on the YouTube data set.

CDNV AD Intra-SSE Between-SSE SSE-Ratiovalue k value k value k value k value k

K-means 0.2446 43 0.0572 40 7372.4 41 9.366 40 774.14 41SRC 0.2613 37 0.0691 35 6593.6 36 10.249 35 652.34 36LMF 0.2467 39 0.0584 38 6821.3 41 9.874 37 694.72 40NMF 0.2741 36 0.0766 35 6249.5 36 10.746 34 591.57 35PMM 0.2536 36 0.0628 37 6625.8 37 9.627 34 702.25 35

GHF-ART 0.2852 37 0.0834 37 5788.6 37 10.579 35 563.18 37

20 to 100. The best performance of each algorithm foreach evaluation measure is reported in Table 1. Weobserve that the best performance of each algorithmis typically achieved with 34 − 41 clusters. GHF-ARTusually achieves the best performance with ρ = 0.65which is more consistent than other algorithms. GHF-ART outperforms other algorithms in terms of all theevaluation measures except between-SSE, in which theresult of GHF-ART is still competitive to the best one.

5.1.5 Correlation Analysis of HeterogeneousNetworks We first ran GHF-ART under α = 0.01,β = 0.6 and ρ = 0.65 and showed the track of contribu-tion parameters for each type of links during clusteringin Fig. 4. We observed that the weights for all types offeatures begin with 0.2. The sudden change at n = 1500is due to the the incrementally presenting of new pat-terns. After n = 12000, the weight values of all types offeatures become stable.

We further analyzed the probability that pairsof connected patterns fall into the same cluster tostudy how each type of relational network affects theclustering results, which is shown in Fig. 5. We observethat the order of relational networks is consistent withthe results shown in Fig. 4, which demonstrates theperformance of Robustness Measure. Contact networkachieves much higher probability than other relationalnetworks. This may be due to that the contact networkis much sparser than the other four networks. As thus,the links of contact network are more representative andless links of patterns exist between clusters.

5.2 BlogCatalog Data Set

5.2.1 Data Description The BlogCatalog data set2 is crawled in [7] and used for discovering the over-lapping social groups of users. It consists of the rawdata of 88, 784 users, each of which involves the friend-ship to other users and the published blogs. Each blog

2http://dmml.asu.edu/users/xufei/datasets.html#Blogcatalog

Figure 4: Track of contribution parameters for fivetypes of links during clustering with the increase in thenumber of input patterns.

Figure 5: The probability that pairs of patterns fallinto the same cluster if connected in each of the fiverelational networks.

of a user is described by several pre-defined categories,user-generated tags and six snippets of blog content.

We extracted three types of links, including afriendship network and two textual similarity networksin terms of blog content and tags. By filtering infrequentwords from tags and blogs, we obtained 66, 418 users,6, 666 tags and 17, 824 words from blogs. As suggestedin [7], we used the most frequent category in the blogsof a user as the class label and got 147 class labels.

5.2.2 Evaluation Measure With the ground truthlabels, we used Average Precision, Cluster Entropy andClass Entropy [22], Purity [19] and Rand Index [20]

Table 2: The clustering Results of GHF-ART, K-means, SRC, LMF, NMF and PMM under the best setting ofpre-defined number of clusters (“k”) (ρ = 0.15, 0.2 and 0.25 when k = 158, 166 and 174 respectively for GHF-ART) on the BlogCatalog data set in terms of Average Precision(AP), Cluster Entropy (Hcluster), Class Entropy(Hclass), Purity and Rand Index (RI ).

AP Hcluster Hclass Purity RIvalue k value k value k value k value k

K-means 0.6492 185 0.5892 185 0.5815 165 0.6582 185 0.5662 170SRC 0.7062 175 0.5163 175 0.4974 160 0.7167 175 0.6481 170LMF 0.6626 175 0.5492 175 0.5517 155 0.6682 175 0.6038 165NMF 0.7429 175 0.4836 175 0.4883 155 0.7791 175 0.6759 165PMM 0.6951 170 0.5247 170 0.5169 165 0.6974 170 0.6103 165

GHF-ART 0.7884 174 0.4695 174 0.4865 158 0.8136 174 0.6867 166

Figure 6: The clustering performance of GHF-ART onthe BlogCatalog data set in terms of Rand Index byvarying the values of α, β and γ respectively.

Figure 7: The cluster structures generated by GHF-ART on the BlogCatalog data set in terms of differentvalues of vigilance parameter ρ.

as clustering evaluation measures. Average Precision,Cluster Entropy and Purity evaluate the intra-clustercompactness. Class Entropy evaluates how well theclasses are represented by the minimum number of clus-ters. Rand Index considers both cases. In our exper-iments, Average Precision is defined by the weightedsum of precision of all the clusters.

5.2.3 Parameter Selection Analysis We studiedthe influence of parameters to the performance of GHF-ART on the BlogCatalog data set with initial settingsα = 0.01, β = 0.6 and ρ = 0.2, as shown in Fig. 6.We observed that, consistent with those in Fig. 2, theperformance of GHF-ART is robust to the change inthe choice and learning parameters. As expected, the

performance of GHF-ART varies a lot due to the changein ρ. This curve may also be explained by the samereason for that in Fig. 2.

To validate our findings to select a suitable ρin section 5.1.3, we analyzed the cluster structurescorresponding to the four key points of ρ, as shownin Fig. 7. At ρ = 0.2, we observe that nearly 20small clusters with less than 100 patterns are generated.Interestingly, we find that the number of small clustersis also around 10% of the total number of clusters, whichfits the findings that we observed on the YouTube dataset in section 5.1.3. This demonstrates the feasibility ofthe empirical way to select ρ: Run GHF-ART severaltimes by tuning the value of ρ until 10% of the identifiedclusters are small clusters having less than 100 patterns.

5.2.4 Clustering Performance Comparison Wecompared the performance of GHF-ART on the Blog-Catalog data set with the same set of algorithms com-pared in the YouTube data set under the same param-eter settings as mentioned in section 5.1.4, except thenumber of clusters. We varied the value of ρ from 0.1 to0.4 with an interval of 0.05 and the number of clustersfrom 150-200 with an interval of 5. The best perfor-mance for each algorithm with the number of clustersis shown in Table 2. We observe that GHF-ART ob-tained much better performance (at least 4% improve-ment) than other algorithms in terms of Average Pre-cision, Cluster Entropy and Purity. This indicates thatGHF-ART may well identify similar patterns and pro-duce more compact clusters. Competitive performanceis obtained by SRC and NMF in terms of Class Entropy.Considering the number of clusters under the best set-tings, we find that GHF-ART identifies a similar num-ber of clusters to other algorithms, which demonstratesthe effectiveness of GHF-ART.

5.2.5 Case Study We further studied the identifiedcommunities by GHF-ART. First, we listed the discov-ered five biggest clusters, as shown in Table 3. We ob-

Table 3: The five biggest clusters identified by GHF-ART with class labels, top tags, cluster size and Precision.Cluster Rank Class Label Top Tags Cluster Size Precision

1 Personal music, life, art, movies, Culture 2692 0.74422 Blogging news, blog, blogging, SEO, Marketing 2064 0.81663 Health health, food, beauty, weight, diet 1428 0.76934 Personal life, love, travel, family, friends 1253 0.68715 Entertainment music, movies, news, celebrity, funny 1165 0.6528

Figure 8: Time cost of GHF-ART, K-means, SRC,LMF, NMF and PMM on the BlogCatalog Dataset withthe increase in the number of input patterns.

serve that those clusters are well formed to reveal theuser communities since more than 1000 patterns aregrouped with a reasonable level of precision. We al-so observe that most of the top tags discovered by thecluster weight values are semantically related to theircorresponding classes. Interestingly, the clusters ranked1 and 4 belong to the class “Personal”. This may bebecause, according to our organized statistics, “Person-al” is much larger than other classes. However, in thetop 5 tags, only “life” is shared by them. To have an in-sight of the relation between these two clusters, we plotthe tag clouds for them. As shown in Fig. 9, we ob-serve that the two clusters share more key tags such as“love”, “travel”, “personal” and “film”. Furthermore,when looking into the large amount of smaller tags inthe clouds, we find that such tags in Fig. 9(a) are morerelated to “music” and enjoying “life”, such as “game”,“rap” and “sport”, while those in Fig. 9(b) are morerelated to “family” life, such as “kids”, “parenting” and“wedding”. Therefore, although the shared key tagsindicate their strong relations to the same class “Per-sonal”, they are separated into two communities due tothe difference in the potential trends of sub-key tags.

5.2.6 Time Cost Analysis To evaluate the efficien-cy of GHF-ART on big data, we further analyzed thetime cost of GHF-ART, K-means, SRC, LMF, NMF andPMM with the increase in the number of input pattern-s. To make a fair comparison, we set the number ofclusters k = 166 for K-means, SRC, LMF, NMF andPMM and set ρ = 0.2 for GHF-ART so that the num-

Figure 9: The tag clouds generated for the (a) 1st and(b) 4th biggest clusters. A larger font of tag indicates ahigher weight in the cluster

bers of the generated clusters for all the algorithms arethe same. In Fig. 8, we can observe that GHF-ARTruns much faster than the other algorithms. Where-as the other algorithms incur a great increase of timecost with the increase in the number of input patterns,GHF-ART maintains a relatively small increase. Thisdemonstrates the scalability of GHF-ART to big data.

6 Conclusion

In this paper, we explored the feasibility of GHF-ARTfor the community discovery problem in the heteroge-neous social networks. Comparing with existing hetero-geneous data clustering algorithms [6, 8, 14, 15], as men-tioned in section 2, for clustering heterogeneous socialnetworks, GHF-ART has several advantages including:1) Scalability to big data: GHF-ART performs real-time matching of patterns and one-pass learning whichguarantee the low computational cost; 2) No need ofthe number of clusters a prior: GHF-ART employs avigilance parameter to restrain the intra-cluster simi-larity so that clusters may be incrementally identified;3) Considering heterogeneity of links: GHF-ART con-

siders different representation of learning functions forheterogeneous types of links, which is flexible and mayproduce better representation for heterogeneous links;4) Incorporating global and local similarity evaluation;and 5) Weighting algorithm for heterogeneous link fu-sion.

We have analyzed the performance of GHF-ARTin terms of parameter selection, clustering performancecomparison, performance in Robustness Measure, timecost and case studies. We show that, although GHF-ART needs to set three parameters, the performanceof GHF-ART is robust to the choice and learningparameters. We have further illustrated an empiricalway to select a suitable value for the vigilance parameterwhich has been demonstrated to be feasible on both theYouTube and BlogCatalog data sets. The experimentalresults also demonstrate the effectiveness of RobustnessMeasure and show that GHF-ART outperforms existingheterogeneous data clustering algorithms and has amuch lower computational cost.

Although our work has so far obtained encourag-ing experimental results, there are several directions forfurther investigation. Firstly, as GHF-ART uses featurevectors to represent social links, the dimension of thosefor relational networks are the number of users, whichresults in a high space complexity. Therefore, featurereduction techniques or hashing methods are preferredto be employed to reduce computer consumption. Sec-ondly, visual data such as images and videos are becom-ing more important in our social life and should also beconsidered as an important social link between users.Thus, identifying a social network data set with visuallinks and studying the feasibility of GHF-ART for effec-tive fusion of visual links together with relational andtextual links will be included into our future work.

References

[1] J. Yang and J. Leskovec, Defining and Evaluating Net-work Communities based on Ground-truth, In ICDM,pp. 745–754, 2012.

[2] Y. Dong, J. Tang, S. Wu, J. Tian, N. V. Chawla, J.Rao and H. Cao, Link Prediction and Recommendationacross Heterogeneous Social Networks, In ICDM, pp.181–190, 2012.

[3] Y. Yang, N. Chawla, Y. Sun and J. Han, PredictingLinks in Multi-Relational and Heterogeneous Networks,In ICDM, pp. 755–764, 2012.

[4] J. J. Whang, X. Sui, Y. Sun and I. S. Dhillon, Scalableand Memory-Efficient Clustering of Large-Scale SocialNetworks, In ICDM, pp. 705–714, 2012.

[5] I. Drost, S. Bickel and T. Scheffer, Discovering Com-munities in Linked Data by Multi-View Clustering,From Data and Information Analysis to Knowledge En-gineering, pp. 342–349, 2006.

[6] W. Tang, Z. Lu and I. S. Dhillon, Clustering withMultiple Graphs, In ICDM, pp. 1016–1021, 2009.

[7] X. Wang, L. Tang, H. Gao and H. Liu, DiscoveringOverlapping Groups in Social Media, In ICDM, pp.569–578, 2010.

[8] L. Tang, X. Wang and H. Liu, Uncovering Groups viaHeterogeneous Interaction Analysis, In ICDM, pp. 503–512, 2009.

[9] G. Bisson and C. Grimal, Co-clustering of Multi-ViewDatasets: a Parallelizable Approach, In ICDM, pp.828–833, 2012.

[10] L. Meng, A.-H. Tan and D. Xu, Semi-supervised Het-erogeneous Fusion for Multimedia Data Co-clustering,IEEE Transactions on Knowledge and Data Engineer-ing, 2013.

[11] D. Zhou and C. J. C. Burges, Spectral Clustering andTransductive Learning with Multiple Views, In ICML,pp. 1159–1166, 2007.

[12] K. Chaudhuri, S. M. Kakade, K. Livescu and K. Sridha-ran, Multi-View Clustering via Canonical CorrelationAnalysis, In ICML, pp. 129–136, 2009.

[13] A. Kumar and H. Daume III, A Co-training Approachfor Multi-View Spectral Clustering, In ICML, pp. 393–400, 2011.

[14] B. Long, X. Wu, Z. Zhang and P. Yu, Spectral Clus-tering for Multi-Type Relational Data, In ICML, pp.585-592, 2006.

[15] Y. Chen, L. Wang and M. Dong, Non-negative MatrixFactorization for Semisupervised Heterogeneous DataCoclustering, IEEE Transactions on Knowledge andData Engineering, vol. 22, no. 10, pp. 1459–1474, 2010.

[16] A.-H. Tan, G. A. Carpenter and S. Grossberg, Intelli-gence through Interaction: Towards a Unified Theoryfor Learning, In LNCS, vol. 4491, pp. 1094–1103, 2007.

[17] S. Bickel and T. Scheffer, Multi-View Clustering, InICDM, pp. 19–26, 2004.

[18] G. A. Carpenter, S. Grossberg and D. B. Rosen,Fuzzy ART: Fast Stable Learning and Categorizationof Analog Patterns by an Adaptive Resonance System,Neural Networks, vol. 4, no. 6, pp. 759–771, 1991.

[19] Y. Zhao and G. Karypis, Criterion functions for doc-ument clustering: experiments and analysis, TechnicalReport, 2002.

[20] R. Xu and D. C. Wunsch II, BARTMAP: A ViableStructure for Biclustering, Neural Networks, vol. 24,no. 7, pp. 709–716, 2011.

[21] M. Rege, M. Dong and J. Hua, Graph TheoreticalFramework for Simultaneously Integrating Visual andTextual Features for Efficient Web Image Clustering,In WWW, pp. 317–326, 2008.

[22] J. He, A.-H. Tan, C.-L. Tan and S-Y. Sung, On Quanti-tative Evaluation of Clustering Systems, In Clusteringand Information Retrieval, Kluwer Academic Publish-ers, pp. 105–133, 2003.

[23] X. Wang, B.Qian, J. Ye and I. Davidson, Multi-Objective Multi-View Spectral Clustering via ParetoOptimization, In SDM, pp. 234–242, 2013.

community discovery in social networks via heterogeneous link … · 2014-09-08 · 3 problem...

Documents