arxiv:1706.09859v1 [cs.si] 29 jun 2017 · pdf filevalverde and solé ... 5more...

Complex Networks Analysis for Software Architecture: anHibernate Call Graph Study

Daniel Henrique Mourão Falci Orlando Abreu Gomes Fernando Silva [email protected] [email protected] [email protected]

LAIS – Laboratory for Advanced Information SystemsFUMEC University

Av. Afonso Pena 3880 - 30130-009 - Belo Horizonte - MG - Brazil

Abstract

Recent advancements in complex net-work analysis are encouraging and mayprovide useful insights when appliedin software engineering domain, reveal-ing properties and structures that can-not be captured by traditional met-rics. In this paper, we analyzed thetopological properties of Hibernate li-brary, a well-known Java-based soft-ware through the extraction of itsstatic call graph. The results reveala complex network with small-worldand scale-free characteristics while dis-playing a strong propensity on formingcommunities.

1 Introduction

In the Software Engineering field, one has beenobserving an unceasing search for quantitativemeasures capable of assessing internal softwareattributes such as maintainability, reusabil-ity, agility, among others. The argument isthat these characteristics, also known as ar-chitecturally significant requirements (ARS),when combined, determine the product qual-ity (Chen et al., 2013), what in turn affectsthe development costs, particularly while un-der the software maintenance cycle. Accord-ing to Nelson (2005), this stage responds for ashare between 50% and 80% of the total de-velopment cost.The first measurement processes were based

on metrics that are relatively simple to ob-tain such as the amount lines of code, cyclo-matic complexity and commentaries percent-age on source code (Hudli et al., 1994; Leeet al., 1994). With the strengthen of object-oriented programming paradigm and due tothe high complexity of nowadays systems, a

set of metrics focused on describing the associ-ation levels among software components arose.In this sense, we highlight encapsulation, cou-pling, and cohesion (Allen and Khoshgoftaar,1999; Mitchell, 2003). These measures are par-ticularly useful for the Software architecture,a sub-field of Software Engineering that inves-tigates how to best to segment a system intoindependent and intercommunicating compo-nents, favoring the software attributes men-tioned above (Shaw et al., 1995).A natural choice to study the association

level among software components is throughthe usage of call graphs, where vertexes mayrepresent software elements in a system, andthe edges map their calls, giving shape to acomplex network of relationships (Pezzè andYoung, 2009). Such representation may be dy-namic, indicating that it was captured duringsystem execution, or static that on the con-trary, analyzes the source code structure. Callgraphs have proven its utility in many softwaredevelopment activities such as compiler op-timization and program understanding (Gra-ham et al., 1982; Bohnet and Döllner, 2006).The complex networks theory field, in turn,

boosted by the sharp increase in computingpower in recent years, has evolved signifi-cantly. Its new methods and techniques cre-ated a favorable environment for the develop-ment of new approaches applicable on pattern-discovery, community and anomaly detection,trend prediction and others in several fields,from social to biological sciences (Watts andStrogatz, 1998; Barabási and Albert, 1999;Newman, 2010; Blondel et al., 2008).Considering this scenario, the following re-

search question emerge: "Which software at-tributes may be revealed through the appli-cation of common complex network analy-sis measures on call graphs?". We also in-

arX

iv:1

706.

0985

9v1

[cs

.SI]

29

Jun

2017

Caller type Caller instruction Callee type Callee instructionM org.hibernate.cfg.Configuration::setProperty M java.util.Properties::setPropertyM org.hibernate.cfg.Configuration::addPackage M org.hibernate.boot.MetaDataSources::addPackageM org.hibernate.cfg.Configuration::registerTypeContributor I java.util.List::addM org.hibernate.cfg.Configuration::registerTypeOverride O org.hibernate.type.CustomType::newC org.hibernate.cfg.Configuration C org.hibernate.type.CustomType

Table 1: CallGraphExtractor sample output.

tend to analyze the topological and basalproperties of software in network theory field.In this work, we utilize the Hibernate li-brary, a well-known Object/Relational Map-ping (ORM) framework, widely employed inJava-based enterprise applications.

The rest of this paper is organized as fol-lows: In section 2 we present the related work.In section 3, we expose our methods and ma-terials. In section 4 we discuss our results, andfinally, in section 5 we conclude our work.

2 Related workValverde and Solé (2003) analyzed an exten-sive collection of networks derived from object-oriented software diagrams and call graphs. Intheir work, the authors identified that all ana-lyzed systems displayed small-world and scale-free characteristics, claiming that encapsula-tion - a programming technique that restrictsthe direct access to internal software compo-nents, isolating its internal complexity fromthe rest of the system (Mitchell, 2003) - is theleading cause of such behavior. They also sug-gest that all object-oriented systems may ex-hibit the same behavior.Bhattacharya et al. (2012) analyzed the evo-

lution of eleven popular open source systems,with the purpose of developing error severitypredictors based on static call graphs alongwith staff collaboration graphs. This studyproposed a measure known as NodeRank, aderivation of PageRank, which generates arank of software components whose bugs areprone to higher severity.Operational systems were also analyzed un-

der the light of network theory. Through staticcall graphs, the source code of Linux kernel(version 2.6.27), was investigated by Wangand Ding (2015). The authors described thetopological properties of its strongly connectednetwork portion, known as the giant compo-nent. In their study, they have found small-world and scale-free characteristics. Wang

et al. (2009) on the other hand, analyzed theevolution of filesystem and driver modules.In their study, they captured the static callgraphs for 223 versions of such modules ap-plying common network metrics in a timeline,while proposing a metric to find major struc-tural changes.Ying and Ding (2012) performed a topologi-

cal investigation of static call graphs in a Javaprogram, focusing on the centrality measures.This study though, does not describe any di-rect association to the software engineeringfield.

3 Materials and methods

This paper relied on the static call graph repre-sentation extracted from Hibernate library inits version 5.1.31. All the undertaken analy-sis presented here utilized the software Gephi2and the complex network analysis packagenamed NetworkX3, for the Python language.In general, we chose Gephi to create the net-work’s visual representations and to acquireits basal properties. NetworkX, on the otherhand, was selected to perform data manipula-tions and any deeper investigations.The following subsections will expose the

procedures employed to create our call graphrepresentation and the common metrics uti-lized in this study.

3.1 Call graph extractionTo extract the static call graph structure fromHibernate, we developed a tool named CGE4.This tool is capable of reading the bytecode ofJava classes embedded into JavaArchive files

1We analyzed only the core.jar file, distributedalong with the full version of Hibernate and download-able at http://www.hibernate.org

2Available at https://gephi.org/users/download/

3More information at https://networkx.github.io

4The source code is publicly available at https://github.com/anonymous/softwarename

http://www.hibernate.org

https://gephi.org/users/download/

https://gephi.org/users/download/

https://networkx.github.io

https://networkx.github.io

https://github.com/anonymous/softwarename


(JAR files) with the goal of extracting thecaller and the callee for all instructions of eachclass contained in the referred file. Grosslyspeaking, our software analyzes internal meth-ods of a class, regardless of their visibility(public, private, static, and so on), creatinga relationship table in an output file.Table 1 illustrates the CGE output. The

caller and callee types must assume a value inthe following set: M that stands for a method,O for an object, I for an interface, S for astatic call and C for a class usage relation-ship. Based on standard Java 8TM call nota-tion5, the instruction fields value indicate thefull qualified name of a class and, when avail-able, its respective method separated by the<::> symbol. Therefore, observing the firsttable row, we say that the method setProp-erty from org.hibernate.cfg.Configuration classcalls the method setProperty that belongs tothe class java.util.Properties.

When applied to Hibernate source code, theCGE extracted a relationship table containing143,768 calls.

3.2 Graph modelingThe call graph structure exemplified abovemay serve as the raw material to several graphrepresentations. Its vertex set may containonly objects or methods, or objects and meth-ods combined, using directed or undirectededges, weighted or not, in a k-partite repre-sentation. After experimenting some of thesecombinations, we decided to take a directedrepresentation whose edges are unweighted,and vertexes contain methods and also objectsin a unipartite graph.In our method, for each extracted call, we

create three vertexes: The caller method, thecallee class, and the callee method. If thecaller and callee classes are different, indicat-ing an object transition, we link them withtwo edges observing the rule: [(callerMethod,calleeClass), (calleeClass, calleeMethod)]. Ifthe caller and callee classes are equals, de-noting an internal call, then the rule appliedrule is different: [(callerMethod, calleeClass),(callerMethod, calleeMethod)]. It is worth not-ing that our representation cannot be consid-

5More information at https://docs.oracle.com/javase/tutorial/java/javaOO/methodreferences.html

ered bipartite due to the presence of edgeslinking same type vertexes, in our case, meth-ods. A sample source code and our corre-sponding graph representation are exempli-fied in Figure 1. In the sample source code,for instance, the method doSomething of classSampleNetwork evokes the methodmethod1 ofclass ClassA. In this case we create the threevertexes: doSomething, ClassA, method1. Wealso create two edges linking them in a struc-ture such as [(doSomething, ClassA), (ClassA,method1)].To create our graph representation, we im-

plemented a software6 in Python language,which takes as input the structure generatedby the CGE and, using the NetworkX pack-age, generates the graph in Graph ExchangeXML Format (GEXF) following the rules in-troduced in this section. This tool was appliedin the static call structure captured from Hi-bernate and, in order to constrain the graphrepresentation to the library domain, we dis-carded the calls whose caller or callee instruc-tions did not belong to the "org.hibernate"package, consequently removing any call tothe native Java library. Following these pro-cedures, we obtained a graph comprised of27,556 vertexes and 57,919 edges.

3.3 Network measuresIn this section we provide a general overviewof the metrics utilized in this study, contextu-alizing each of them in software architecturefield.

3.3.1 Vertex degreeThe vertex degree is one of the most funda-mental measures of complex network analysis.In a undirected graph, the degree correspondsto the number of edges linked to a given ver-tex7. In directed graphs, though, the metric ispartitioned into two. The in-degree representsthe number of incoming edges in a vertex whilethe out-degree stands for the outgoing edges.In software architecture, this measure may

be relevant since it provides means to assessthe usage level of a given software component,what may, in turn, reveal overused elements oreven the coupling level among components.

6The source code is publicly available at https://github.com/anonymous/softwarename

7Self-loops are counted twice

https://docs.oracle.com/javase/tutorial/java/javaOO/methodreferences.html





Figure 1: Resulting graph from a sample java source code

3.3.2 BetweennessThe measure known as Betweenness is a cen-trality measure in complex networks theory.It reveals how many shortest paths in the net-work pass through each of its vertexes. In thispaper, we utilized the algorithm proposed byBrandes (2001) to calculate it.We selected this measure based on the per-

ception that it may provide a suitable repre-sentation for object encapsulation (Mitchell,2003). The idea is that an encapsulated soft-ware component shall provide few entry pointsto the rest of the system, hiding its inter-nal connections and structures. This conceptleads to a higher betweeness degree for the en-try components. Therefore, this measure mayindicate encapsulation level of a given softwareelement, a useful information while in softwaredevelopment cycle.

3.3.3 PageRankPageRank is a centrality measure in complexnetwork analysis that is commonly treated asa measure of importance. It is calculatedthrough the analysis of the number of edgesreceived by a vertex and by the quality of thevertexes that bind to it, which is denoted by itsown PageRank. In its calculation procedure,takes into account the out-degree of each ver-tex, normalizing the index (Newman, 2010).In general, the method translates the notionthat a vertex is important if it is connected toother important vertexes.When applied to software architecture, this

measure may be helpful in revealing key soft-ware components, which is obviously impor-

tant in several activities such as software un-derstanding and bug severity prediction.

3.3.4 ModularityModularity is a topological measure that isconcerned with the propensity of a givennetwork to form communities. Formally, itidentifies network sections whose vertexes aredensely connected to each other (internal tothe group) and weakly connected to vertexesbelonging to another vertexes group. In thispaper, we utilize the algorithm proposed byBlondel et al. (2008), which translates thisconcept into a number ranging from 0 to 1,where higher values indicate a higher propen-sity to form communities. Another relevantresult of this algorithm is that it associateseach vertex with a group what may provideuseful information for further processing.In software architecture this technique may

dynamically discover groups of related soft-ware structures, considering the intensityof their relationships. This fact indicatesa remarkable characteristic since traditionalgrouping structure is typically static, packageoriented and established during the artifactcreation. The dynamic community detectionmay be useful in detecting structural problems(code smells) and patterns.

4 Results and DiscussionIn this section we present the basal proper-ties of Hibernate network analyzing its small-world, scale-free and community properties.Table 2 presents the values for some of themain complex network measures. Among

them, we highlight the average shortest path(19.664), which stays on pair with the valuesachieved for the Linux kernel (≈ 21) (Wangand Ding, 2015).

Measure ValueAverage clustering coefficient 0.194Average shortest path 19.644Average degree 2.02Network diameter 62Modularity 0.838Number of communities 446

Table 2: Hibernate network measures

In order to verify the existence of smallworld characteristics (Watts and Strogatz,1998) in our Hibernate network representa-tion, we calculated the link creation proba-bility p as shown by the equation 1, wherel is the number of edges and n representsthe number of vertexes. With the p value,we proceeded to the generation of a randomnetwork based the Erdős–Rényi model (Erdősand Rényi, 1959). For the random network,we registered an average clustering coefficientC = 0 and an average shortest path d = 3.479.These values were obtained from the meanof the values obtained by five random net-works. The value of C is derived from thevery low value of p, which may produce ran-dom graphs with sparse connectivity, so sparsethat may result in a disconnected graph, asit is the case. Through the comparison ofthe values obtained from Hibernate and ran-dom networks (formally reported in the equa-tion 2), one may say that the Hibernate net-work presents small-world characteristics. Itsclustering coefficient is much higher than theone of the random network, while the aver-age shortest path in both networks is close,although the average shortest path obtainedfor Hibernate network is more than five timesgreater than the value achieved by the randomnetwork.

p = 2ln (n− 1) = 2(57, 919)

27, 556 (27, 555) ≈ 0.0001525

(1)

CHibernate � CRandom and dHibernate ≈ dRandom

0, 194 � 0 and 19.664 ≈ 3.479(2)

Figure 2: Network appearance - top modular-ity classes are colored

The Hibernate network degree distributionanalysis, as shown in Figure 3, suggests thatit may be classified as a scale-free network(Barabási and Albert, 1999). It follows apower law distribution (P (k) ∼ k−α) whosemain characteristic is the existence of a fewnumber of vertexes possessing a large numberof connections while the vast majority of ver-texes have a small number of links. The log-log plot reveal the angular coefficient α ≈ 2.6.One of the main properties of scale-free net-works is its failure resistance. This definitionthough, must not be used in a software con-text, since a failure on the smallest softwarecomponent may compromise the whole sys-tem. This finding is in compliance with pre-vious studies (Valverde and Solé, 2003; Wanget al., 2009; Ying and Ding, 2012).The modularity analysis was based on the

algorithm described by Blondel et al. (2008),known as Louvain method. The modularityvalue (M = 0.838) is close to the maximumpossible value (1 ) which reveals a high networkpropensity to form communities. In fact, 446communities were detected by this method,which results in an average value of 61.78vertexes per class. The ten biggest classes,though, holds 47.45% of the total number ofvertexes. The Figure 2 illustrates the aspectof the Hibernate network, where vertexes arecolored according to their modularity class.

Rank Betweeness Degree PageRank1 org.hibernate.type.TypeFactory org.hibernate.internal.CoreMessageLogger org.hibernate.internal.util.StringHelper2 org.hibernate.boot.cfgxml.spi.LoadedConfig org.hibernate.internal.CoreMessageLogger.$logger org.hibernate.internal.CoreMessageLogger3 org.hibernate.boot.MetadataSources org.hibernate.engine.spi.SessionImplementor org.hibernate.internal.CoreLogging4 LoadedConfig::consume org.hibernate.engine.spi.SessionFactoryImplementor org.hibernate.engine.spi.SessionImplementor5 org.hibernate.boot.cfgxml.spi.MappingReference org.hibernate.internal.util.StringHelper org.hibernate.engine.spi.SessionFactoryImplementor6 MappingReference::apply org.hibernate.type.Type org.hibernate.type.Type7 org.hibernate.boot.internal.MetadataBuilderImpl org.hibernate.persister.entity.EntityPersister org.hibernate.internal.CoreMessageLogger.$logger8 MetadataSources::getMetadataBuilder org.hibernate.persister.entity.AbstractEntityPersister org.hibernate.type.AbstractSingleColumnStandardBasicType9 MetadataBuilderImpl::build org.hibernate.dialect.Dialect CoreLogging::messageLogger10 org.hibernate.dialect.Dialect org.hibernate.mapping.PersistentClass org.hibernate.boot.spi.SessionFactoryOptions

Table 3: Top ranked software components

Figure 3: Scale-free analysis

Table 3 shows the top 10 vertexes obtainedfrom the analysis of the main complex net-work analysis measures when applied to Hi-bernate’s call graph. Due to the lack of com-parison ground for the task of describing therelevance of such software components, partic-ularly over the Hibernate source code struc-ture, we limited our analysis to present thetop ranked software elements for each networkmeasure.

5 Conclusion

In this paper, we analyzed the propertiesof a widely employed Java-based softwarethrough its static call graphs. We investi-gated its topology that revealed a small-worldand scale-free network, in compliance with thefindings of Valverde and Solé (2003); Ying andDing (2012); Wang et al. (2009). Furthermore,we have identified that the network exhibits astrong propensity to form communities. Fi-nally, we highlighted the top 10 software com-

ponents for some of the main complex networkanalysis measures.The tools we developed for this work may

benefit research in this field of study as itprovides accessible means to create static callgraphs for Java-based software. In future weplan to compare the traditional measures ap-plied in software engineering domain with theones obtained through the application of com-plex network analysis measures. We also planto investigate the application of modularity asway of finding higher level software compo-nents and its respective relevance within sys-tem.

References

Allen, E. B. and Khoshgoftaar, T. M.(1999). Measuring coupling and cohesion:An information-theory approach. In Soft-ware Metrics Symposium, 1999. Proceed-ings. Sixth International, pages 119–127.IEEE.

Barabási, A.-L. and Albert, R. (1999). Emer-gence of scaling in random networks. sci-ence, 286(5439):509–512.

Bhattacharya, P., Iliofotou, M., Neamtiu, I.,and Faloutsos, M. (2012). Graph-basedanalysis and prediction for software evolu-tion. In Proceedings of the 34th Interna-tional Conference on Software Engineering,pages 419–429. IEEE Press.

Blondel, V. D., Guillaume, J.-L., Lambiotte,R., and Lefebvre, E. (2008). Fast unfoldingof communities in large networks. Journalof statistical mechanics: theory and experi-ment, 2008(10):P10008.

Bohnet, J. and Döllner, J. (2006). Visual ex-ploration of function call graphs for featurelocation in complex software systems. InProceedings of the 2006 ACM symposium onSoftware visualization, pages 95–104. ACM.

Brandes, U. (2001). A faster algorithm for be-tweenness centrality. Journal of mathemat-ical sociology, 25(2):163–177.

Chen, L., Babar, M. A., and Nuseibeh,B. (2013). Characterizing architecturallysignificant requirements. IEEE software,30(2):38–45.

Erdős, P. and Rényi, A. (1959). On randomgraphs i. Publ. Math. Debrecen, 6:290–297.

Graham, S. L., Kessler, P. B., and Mckusick,M. K. (1982). Gprof: A call graph execu-tion profiler. In ACM Sigplan Notices, vol-ume 17, pages 120–126. ACM.

Hudli, R. V., Hoskins, C. L., and Hudli, A. V.(1994). Software metrics for object-orienteddesigns. In Computer Design: VLSI inComputers and Processors, 1994. ICCD’94.Proceedings., IEEE International Confer-ence on, pages 492–495. IEEE.

Lee, Y.-S., Liang, B.-S., and Wang, F.-J.(1994). Some complexity metrics for object-oriented programs based on informationflow: A study of c++ programs. J. Inf. Sci.Eng., 10(1):21–50.

Mitchell, J. C. (2003). Concepts in pro-gramming languages. Cambridge UniversityPress.

Nelson, M. L. (2005). A survey of reverse engi-neering and program comprehension. arXivpreprint cs/0503068.

Newman, M. (2010). Networks: an introduc-tion. Oxford University Press Inc., NewYork.

Pezzè, M. and Young, M. (2009). Teste eanálise de software: processos, princípios etécnicas. Bookman Editora.

Shaw, M., DeLine, R., Klein, D. V., Ross,T. L., Young, D. M., and Zelesnik, G.(1995). Abstractions for software architec-ture and tools to support them. IEEE trans-actions on software engineering, 21(4):314–335.

Valverde, S. and Solé, R. V. (2003). Hierar-chical small worlds in software architecture.arXiv preprint cond-mat/0307278.

Wang, L., Wang, Z., Yang, C., Zhang, L.,and Ye, Q. (2009). Linux kernels as com-plex networks: A novel method to study

evolution. In Software Maintenance, 2009.ICSM 2009. IEEE International Conferenceon, pages 41–50. IEEE.

Wang, Y.-F. and Ding, D.-W. (2015). Topol-ogy characters of the linux call graph. In In-formation Science and Control Engineering(ICISCE), 2015 2nd International Confer-ence on, pages 517–518. IEEE.

Watts, D. J. and Strogatz, S. H. (1998). Col-lective dynamics of ‘small-world’networks.nature, 393(6684):440–442.

Ying, L. and Ding, D.-w. (2012). Topologystructure and centrality in a java sourcecode. In Granular Computing (GrC), 2012IEEE International Conference on, pages787–789. IEEE.

arxiv:1706.09859v1 [cs.si] 29 jun 2017 · pdf filevalverde and solé ... 5more...

Documents