paper1

Improving Trend Analysis Using Social Network Features

ABSTRACTIn recent years, big volumes of data have been massivelystudied by organizations trying to extract and use informa-tion produced by people in the internet. In this context,trend analysis is one of the most important areas exploredby researchers. Typically, good prediction results are hardto obtain because the complexity of the problem. This pa-per goes beyond simple trend identification methods by in-cluding the structure of the information sources, i.e., socialnetwork metrics, as an additional dimension to model andpredict trends through time. The results show that the in-clusion of such metrics improved the accuracy of the pre-diction. Our experiments used the publications’ titles fromall the Brazilian PhD in Computer Science for the periodsanalyzed in order to evaluate the developed trend predictionapproach.

CCS Concepts•Computing methodologies → Model developmentand analysis; Machine learning approaches;

Keywordssocial network; trend analysis; data mining

1. INTRODUCTIONThe data produced by people in the internet has increased ina huge way mainly because of the great number of businessand social media applications. It is known that this data canbe stored, treated, organized and analyzed to be useful formany kinds of organizations. Trend analysis is one of theresearch areas that can be used to provide insights aboutusers’ behavior in the World Wide Web. For example, somee-commerce companies can analyze the users’ purchase be-havior to improve the logistic and sale process, whilst statemanagers can identify potential research areas to invest in.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. To copyotherwise, or republish, to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.SAC’16, April 4-8, 2016, Pisa, ItalyCopyright 2016 ACM 978-1-4503-3739-7/16/04. . . $15.00http://dx.doi.org/xx.xxxx/xxxxxxx.xxxxxxx

The problem of finding significant trends from this big vol-ume of data is challenging. The challenge arises from thedynamicity of the data combined with other factors, suchas, the influence of the entities that are producing the data.In order to model and predict trends many studies have usedtime series and content based approaches.

Trends can also be analyzed from the social structure of thedata sources. The social structure plays an important rolein the information spreading dynamics context. Althoughthere have been several works studying trend identificationfrom time series and content based data, and others focusingin social network techniques, literature still lacks researchesthat combine these two concepts. This work starts from thepremise that combining social network and time series datacan reach better results in trend modeling and prediction.Thus, this work aims to improve trend prediction capabil-ity by attaching social network metrics to time series andcontent based trend analysis models.

In this paper, we propose an approach which constructs aprediction model combining time series with social networkmetrics. The approach was tested and validated using datafrom the publications’ titles of the Brazilian PhD in Com-puter Science and then compared to time series regressionmodel.

This paper is organized as follows. Section 2 describes somebasic concepts and related work. Section 3 details the method-ology used. The results are described in Section 4. At last,conclusions are presented in Section 5.

2. RELATED WORK

2.1 Time Series and Content Based Trend Anal-ysis

Analyzing information frequency behavior through time toidentify patterns has been the focus of trend analysis studiesin the last years. The use of time as the independent vari-able and variables usually extracted from the meta-analysisof data have been the only factors used for most of the stud-ies [18]. Frequency indexes for trend analysis of terms [1,17] and price for stock market applications [16] have beenwell explored by researchers.

Sometimes the variables have to be built and not just ex-tracted. The identification of trends in texts (such as chats,forums, and academic papers) is one of these cases [18]. For

Figure 1: Schematic data flow of the work process

these applications, the step of mining the data and build-ing the dependent variables becomes very important. Thus,many techniques have been developed to extract topics fromtext to automate all the trend analysis process [9]. Fur-thermore, besides just identifying topics with potential forpopularity, some works have gone deeper trying to find outthe behavior of these topics. In the last years, some studies[8, 13] have built models to understand the stability of top-ics or classify topics as trends depending of the behavior ofsubtopics.

2.2 Social Network Trend AnalysisThe use of social network theory in trend analysis can beseen as an important improvement in the area. Informa-tion is produced by individuals and individuals have socialcharacteristics that matters in the way that information dif-fuses [3]. The structure of the social connections are notthe only factor: some individuals have more influence thanothers and there are methods being developed to identifythem [14]. Another important issue in the area is findingthe starting point of the trends in the network, rather thanfinding the most influential nodes, and it has been a researchtopic as well [2]. Therefore, social network information arebeing used in several ways to predict trends based on thenetwork behavior [12].

Several kinds of indicators can derive from scholarly dataand knowledge can be discovered in a quantitative way [10].Research productivity, for example, can be measured bymodels that use citation index and academic social networkanalysis [4]. Our work also uses data from a science and tech-nology system aiming to identify trends areas and themes ofresearch.

3. METHODOLOGYIn this work, we have assessed the curricula of 5,642 Brazil-ian PhDs in Computer Science that have published scien-tific papers between 1991 and 2012. For each curriculum,we have gathered all the published full papers’ titles. Then,we automatically extracted terms from these titles and per-formed trend analysis for each term. Figure 1 illustratesthe schematic data flow used in this work. Following, eachmethod is described.

3.1 Data GatheringBrazil has an unique curricula platform called Lattes Plat-

form1. This system maintains the curricula of the majorresearchers working in Brazil and actually contains morethan 4 million curricula registered. In this work, all theinformation used is from Lattes Platform.

For data gathering, all the Computer Science PhDs curriculawere selected for the periods analyzed (comprising 5,642 cur-ricula). The information extracted was organized and storedin a relational database using the methodology describedby Digiampietri et al. [6]. From these curricula, 55,710 ti-tles were identified from papers published between the years1991 and 2012. For the experiments, the prediction wasmade for the year of 2012.

3.2 Term ExtractionIn this step, the goal was to automate the data preparation.The first part of term extraction was to split the titles intosubsets of words or sequence of words without stop-words.As an example, the title Social Network Analysis For DigitalMedia was splitted into the following terms: Social, Network,Analysis, Digital, Media, Social Network, Network Analysis,Digital Media and Social Network Analysis.

With all the possible sets of terms, we used a scoring systemto identify the most important terms. This scoring methodwas based on the adjacent frequency of the words that com-poses the terms. The equation used to calculate the impor-tance of each candidate term is

LRF (CT ) = f(CT )×(

T∏i=1

(LF (Ni)+1)(RF (Ni)+1))1/T > 1.0

f(CT ) is the frequency of the candidate term CT , LF (Ni)and RF (Ni) indicates the frequency of the left and rightcandidates, respectively. This equation is described in detailby Nakagawa et al. [11].

We observed that the composed terms had more significancethan the simple terms for the subjects discussed in the pub-lications. Therefore, we selected the 1,638 most importantcomposed terms. These 1,638 terms were extracted in thethree different periods used for the experiments as explainedin Section 4.

3.3 Social Network Analysis

3.3.1 The networkIn this paper, the social network was modeled based onscientific collaboration. Sonnenwald defined scientific col-laboration as the interaction among two or more scientistssharing the same goal for facilitating the development oftasks [15]. More specifically, the network was built accord-ing to the joint publications (coauthorship relationships).The social network was modeled as a undirected graph thatis composed by vertices (authors) and edges (coautorships).

Figures 2, 3 and 4 were adapted from [7], which used thesame dataset we used, but their work aimed to character-ize the Brazilian Computer Science PhDs social network. InFigure 2 each node is a Brazilian state and it is possible to

1http://lattes.cnpq.br/

Figure 2: Computer science PhDs social network -Brazilian states

see the interstate relations. In Figures 3 and 4 each noderepresents a PhD in Computer Science and they are col-ored according to their Brazilian states. The algorithm thatproduced these figures uses a force-directed approach, wherethere are a repulsion force between nodes that are not relatedand an attraction one between the nodes that are related.The differences between the two networks occurs because inFigure 3 there are only edges between nodes from the samestate. On the other hand, the network in Figure 4 containsall the edges from this social academic network.

Figure 3: Brazilian Computer Science PhDs socialnetwork

3.3.2 The metricsMetrics of social network consists of different characteristicsthat can be quantified. In the proposed approach, some met-

rics were selected to become part of the independent vari-able’s set. The selection was based on assumptions about thecapability of each metric to explain the information spread-ing. For example, one of the assumptions is that a nodeincluded in the giant component of a network is more capa-ble of disseminating information through the network thana node which is not included. The metrics selected are: gi-ant composition, shortest path to the most central node, de-gree centrality, eigenvector centrality, page rank centrality,betweenness centrality, closeness centrality, clustering coef-ficient, structural equivalence with the most central node andcommunity average centrality. The centrality metrics can ex-plain how important a node is in the network, the shortestpath metric indicates how far the node is from the centralnode while the structural equivalence shows how similar thenode been analyzed is with the most central node. The mostimportant/central node was used as reference. To justify theuse of the most central node as the reference one, Table 1shows the difference in the degree and eigenvector centralityamong the most central node and the other top ten mostimportant nodes in the network.

Table 1: Eigenvector centrality and number of de-grees of top ten central nodes

Top impor-tant nodes

EigenvectorCentrality

Degree

1 1.000 672 0.986 453 0.944 454 0.845 315 0.825 356 0.799 297 0.798 378 0.763 249 0.745 3010 0.744 27

Figure 4: Brazilian Computer Science PhDs socialnetwork - Reorganized

After the selection of metrics, each of the nodes had all re-lated metrics calculated. Then, for each of the 1,638 ex-tracted terms, the metrics of the nodes were summed. How-ever, instead of building the variables with the summed met-rics, the nodes were grouped into communities (we used thealgorithm proposed by Clauset et al [5] to identify the com-munities). We assumed that using the communities’ met-rics instead of the node metrics individually for the most ofthe variables would better explain the information diffusion.The individual information would result in misleading socialmetrics for some terms that could be well widespread in asingle community but not on the network as a whole. Thatway, if many nodes of the same community use the sameterm, the metrics could become skewed, This balance intracommunity aims to add the factor that the information in-side a community propagates quickly and tends to becomea general knowledge of the nodes belonging to it. Therefore,the final values of the variables were based on the commu-nities’ values. For example, let us assume that there is aterm A that is used by two authors: author 1 and author2. If author 1 and author 2 are in the same community, themetrics would be calculated based on the metrics average ofthat community for the authors that used term A as well.But if author 1 and author 2 were in different communi-ties, the metrics values average of each community would becalculated and these results would be finally summed. Thedetails of the metrics are described below.

Giant composition: number of nodes in the giant compo-nent; Shortest path to the most central node: the short-est path to the most central node; Degree centrality : av-erage degree centrality of the nodes inside the community;Eigenvector centrality : average eigenvector centrality of thenodes inside the community; Page rank centrality : averagepage rank centrality of the nodes inside the community; Be-tweenness centrality : average betweenness centrality of thenodes inside the community; Closeness centrality : averagecloseness centrality of the nodes inside the community; Clus-tering coefficient : average value of the clustering coefficientfrom the nodes inside the community; Structural equivalencewith the most central node: average value of the structuralequivalence from the nodes inside the community; Commu-nity average centrality : average centrality of all communitynodes.

At last, the social network metrics were put together withtime series prediction forming the independent variables ofthe dataset.

For avaliation and models comparison we used Relative Ab-solute Error (RAE). The equation for RAE is

RAE =

∑ni=1 |fi − yi|∑ni=1 |yi − y|

where:

y =1

n

n∑i=1

yi.

4. EXPERIMENTSThe experiments aim to measure and compare the improve-ment of the proposed model against the time series model.

Figure 5: Examples of regression curves

First of all, we made experiments for the time series dataand, after that, we experimented the proposed model. Fur-thermore, as a specific goal, we wanted to know the im-portance of the period of analysis in the results. Therebywe used three different periods: 1991-2011, 2002-2011 and2007-2011.

4.1 Time Series Trend AnalysisGiven one dependent and independents variables, a regres-sion analysis can be formulated as

Y ≈ f (X,β)

where the dependent variable Y can be approximated by theindependent variables X and the respective parameters β fora function f .

Before adding the social network variables for the trend anal-ysis, we performed time series regression analysis with im-portance index TF-IDF (which is a widespread index forterms importance measurement) as the dependent variable,and time (year) as independent variable. For each term ex-tracted by the method described at Section 3.2, TF-IDFwere calculated for each year in each of the three experi-mentation periods. Then, parametric and non parametricmethods were performed.

For the parametric tests we did not use only one kind of re-gression that best fit the time series data. We worked withthe regression (linear or nonlinear) that best fitted each ofthe series. Ordinary Least Squares was used to evaluate it.The kinds of regression used were linear, exponential, loga-rithmic, power law and polynomial with two to five degrees.The regression curves for some terms are shown in Figure 5.

For the non parametric tests, we used Artificial Neural Net-work (ANN), Support Vector Machine (SVM) and RotationForest tring to approximate the best function that explainthe historical series distribution.

Table 2 shows the best results of RAE to the three periods

for the time series trend analysis. For parametric methods,the best result was obtained for the longest period while forthe non parametric methods, the shortest period was thebest. We can see that the non parametric methods producedresults more accurate than the parametric ones.

Table 2: Parametric and non parametric regressionRAE results for three periods

Parametric Non Parametric1991-2011 113.16% 51.52%2002-2011 136.42% 51.01%2007-2011 288.14% 50.31%

4.2 Proposed Model: Adding Social NetworkFeatures to Trend Analysis

In the previous subsection, we modeled the problem in away that the TF-IDF index of each term only depends onthe year. In the proposed model, with the addition of thesocial network information the problem is modeled in thefollowing way

TF − IDF (Term) = SNM(Term) +RR(Term),

where SNM is the set of social network metrics built basedon the section 3.3.2 and RR is the best regression result ofthe term for the prediction year.

For these experiments we used four prediction techniques:Linear Regression, Artificial Neural Network (ANN), Sup-port Vector Machine (SVM) and Rotation Forest. We var-ied the parameters for each technique generating 16 testsfor ANN, 9 tests for SVM and 15 tests for Rotation Forest.Furthermore we varied the types of dataset selection. Wegenerated datasets with all attributes and datasets with at-tributes selected by Relief and manual selection, that is agood selection method if the analyst has knowledge aboutthe dataset. The exception is the linear regression technique.The technique was executed for all possible sets and the setwith the least RAE was chosen.

Table 3 presents the best results (RAE) for each techniqueaccording to the periods and selection methods.

In the technique’s point of view, the best performances ob-tained, as shown in Table 3, were achieved by Rotation For-est. It is possible to see that Rotation Forest achieved thebest performances for short periods while SVM did betterfor long periods getting better results than Rotation Forestin the 1991 - 2011 period.

With respect to the periods, the best result was obtainedin the 2007-2011 period (39.28%). However, in general, thebest results were obtained in the 2002 - 2011 period. Theaverage RAE values for the best techniques are: 43.77%for 2002-2011; 51.57% for 2007-2011; and 69.68% for 1991-2011. There is an important difference between the modelsat this point. While the parametric model had better re-sults for long periods (Table 2), the non-parametric and theproposed model had better results for shorter periods. FOrthe proposed model, it can be explained by the dynamismof the network. The metrics built in a static way based on

Table 3: Best results of each prediction techniqueAll at-tributes

Manualselec-tion

Relief

Linear Reg. 91-11 72.18% - -ANN 91-11 72.95% 73.43% 73.44%SVM 91-11 65.21% 62.21% 64.15%Rot. Forest 91-11 71.51% 70.57% 71.18%Linear Reg. 02-11 53.22% - -RNA 02-11 45.75% 46.25% 46.23%SVM 02-11 43.45% 41.05% 41.30%Rot. Forest 02-11 40.29% 40.04% 40.12%Linear Reg. 07-11 62.76% - -RNA 07-11 57.77% 59.02% 57.98%SVM 07-11 53.31% 52.37% 51.22%Rot. Forest 07-11 41.68% 39.28% 40.33%

a network consisting of a long period can result in mistakenvalues, i.e., some metrics can indicate characteristics of thenetwork that was true in the past but was false in the endof the period.

Comparing the best results of the proposed model with theparametric model we had an error reduction of 42%, 70%and 85% for the 1991-2011, 2002-2011 and 2007-2011 peri-ods, respectively. Comparing the best results again with thenon-parametric model we had an error increase of 20% for1991-2011 and error reductions of 20% and 25% for 2002-2011 and 2007-2011, respectively.

Table 4 compares the results of 15 terms for both models.Fore reference, these terms were selected based on the maintendencies calculated by the pure time series trend analysis(parametric model). In this table the real TF-IDF of eachterm is compared with the predicted results from the para-metric and non parametric time series prediction models andthe results of the proposed model. The prediction techniqueused for the proposed model was Rotation Forest for the pe-riod 2007-2011 (the best prediction results presented, as wecan see at Table 3).

The accuracy gain displayed in Table 4 is a sample of thetrend analysis improvement when using social network fea-tures. The experimental results shows the error producedby the proposed model corresponds, in average, to only 17%and 18% of the error produced by the parametric and nonparametric models which do not use social network features.

5. DISCUSSIONAs discussed before, time series and content based analysishas been widely used to predict trends. It considers thatall the information is equally generated, except by the timedimension. However, the content generated by people, pri-marily in the internet, has clear influences based on the con-nections of the generators with other people. Intending tofill this gap, we presented a new concept of trend analysisincluding the social network factor to content based trendanalysis model. The proposed model achieved better resultsthan the time series based model. In addition to simpleprediction techniques as linear regression, we applied more

Table 4: Model’s results comparison for 15 first trends of the time series prediction model in 2012Term Real Parametric Error Non parametric Error Proposed Errorservice discovery 135.17 441.52 306.35 58.43 76.74 123.39 11.77based approach 155.19 424.16 268.97 249.40 94.21 161.10 5.91information systems 147.32 334.29 186.97 182.08 34.76 148.37 1.05supply chain 174.31 298.37 124.06 145.71 28.60 143.96 30.35web services 225.28 297.74 72.46 190.14 35.14 201.05 24.23product line 174.99 291.57 116.57 481.68 306.69 154.73 20.26motion estimation 107.78 274.36 166.58 174.73 66.95 99.00 8.78social network 249.05 269.42 20.38 327.70 78.65 198.94 50.11business process 131.75 240.09 108.34 264.25 132.50 119.61 12.14time series 150.79 217.76 66.97 196.08 45.29 147.03 3.76neural network 213.36 178.86 34.51 565.81 352.45 198.85 14.51sign language 108.21 176.83 68.62 76.97 31.24 101.69 6.52sao paulo 191.93 172.84 19.09 71.51 120.42 145.79 46.15genetic programming 128.25 156.64 28.39 104.18 24.07 107.98 20.26routing problem 101.11 147.16 46.05 195.61 94.50 83.75 17.36

robust techniques that resulted in even more accurate mod-els. As we supposed, these findings cast light on the issueof trend prediction. Information content and characteris-tics of the social structure of the information sources canbe combined to improve the explanation of the informationtemporal behavior.

This work explored a concept still little studied by otherresearchers and, thus, there are some shortcomings to beworked on. The dynamism of the social network is one ofthem. We worked with a whole time window to the socialnetwork modeling, however, slicing the time interval proba-bly will improve the prediction models by better represent-ing the dynamism of the network through time in the socialstructures. Another improvement can be done in the appli-cation’s point of view. The grouping of the extracted termsin topics can be more relevant for the academic scholars thenanalyzing each term alone. In conclusion, we found out thatlooking to the social structure of data sources in a supportperspective can help to explain the information temporalbehavior.

AcknowledgmentsThis work was partially funded by FAPESP, CAPES andCNPq.

6. REFERENCES[1] H. Abe and S. Tsumoto. Evaluating a method to

detect temporal trends of phrases in researchdocuments. In 2009 8th IEEE InternationalConference on Cognitive Informatics, pages 378–383.IEEE, 2009.

[2] Y. Altshuler, W. Pan, and A. Pentland. Trendsprediction using social diffusion models. . . . -culturalmodeling and prediction, pages 97–104, 2012.

[3] E. Bakshy, I. Rosenn, C. Marlow, and L. Adamic. Therole of social networks in information diffusion. InProceedings of the 21st international conference onWorld Wide Web, pages 519–528. ACM, 2012.

[4] O. Cimenler, K. a. Reeves, and J. Skvoretz. A

regression analysis of researchersaAZ social networkmetrics on their citation performance in a college ofengineering. Journal of Informetrics, 8(3):667–682,July 2014.

[5] A. Clauset, M. E. Newman, and C. Moore. Findingcommunity structure in very large networks. Physicalreview E, 70(6):066111, 2004.

[6] L. Digiampietri, J. Mena-Chalco, J. de JesusPerez-Alcazar, E. F. Tuesta, K. Delgado, andR. Mugnaini. Minerando e caracterizando dados decurrıculos lattes. In CSBC 2012 - BraSNAM, jul 2012.

[7] L. A. Digiampietri, C. M. Alves, C. C. Trucolo, andR. A. Oliveira. Analise da rede dos doutores queatuam em computacao no brasil. In CSBC 2014 -BRASNAM, pages 33–44, 2014.

[8] N. Kawamae. Theme Chronicle Model: ChronicleConsists of Timestamp and TopicalWords over EachTheme. In Proceedings of the 21st ACM internationalconference on Information and knowledge management- CIKM ’12, page 2065, New York, New York, USA,2012. ACM Press.

[9] A. Kontostathis, L. Galitsky, and W. Pottenger. Asurvey of emerging trend detection in textual datamining. Survey of Text . . . , pages 1–44, 2004.

[10] H. Moed, W. Glanzel, and U. Schmoch. Handbook ofquantitative science and technology research. 2004 ed,2004.

[11] H. Nakagawa and T. Mori. A Simple but PowerfulAutomatic Term Extraction Method. In COLING-02on COMPUTERM 2002: Second InternationalWorkshop on Computational Terminology - Volume14, COMPUTERM ’02, pages 1–7, Stroudsburg, PA,USA, 2002. Association for Computational Linguistics.

[12] W. Pan, N. Aharony, and A. Pentland. Compositesocial network for predicting mobile apps installation.In AAAI, 2011.

[13] H. Park, E. Kim, K.-J. Bae, H. Hahn, T.-E. Sung, andH.-C. Kwon. Detection and Analysis of Trend Topicsfor Global Scientific Literature Using Feature

Selection Based on Gini-Index. In 2011 IEEE 23rdInternational Conference on Tools with ArtificialIntelligence, pages 965–969. IEEE, 2011.

[14] S. Singh, N. Mishra, and S. Sharma. Survey of varioustechniques for determining influential users in socialnetworks. In Emerging Trends in Computing,Communication and Nanotechnology (ICE-CCN),2013 International Conference on, pages 398–403,March 2013.

[15] D. H. Sonnenwald. Scientific collaboration. Annualreview of information science and technology,41(1):643–681, 2007.

[16] L. A. Teixeira and A. L. I. de Oliveira. Predictingstock trends through technical analysis and nearestneighbor classification. In 2009 IEEE InternationalConference on Systems, Man and Cybernetics, pages3094–3099. IEEE, 2009.

[17] C. C. Trucolo and L. A. Digiampietri. Trend Analysisof the Brazilian Scientific Production in ComputerScience. FSMA, 14:2–9, 2014.

[18] C. C. Trucolo and L. A. Digiampietri. Uma RevisaoSistematica acerca das Tecnicas de Identificacao eAnalise de Tendencias. In X Simposio Brasileiro deSistemas de Informacao (SBSI 2014), pages 639–650,Londrina, 2014.

paper1

Documents