using babelnet for analysing cobit 5 - ulisboa · enterprise it (artifact, state or a capability...

Using BabelNet for Analysing COBIT 5

Nuno Teles [email protected]

Instituto Superior Tecnico, Lisboa, Portugal

November 2016

Abstract

Throughout the years Enterprises have acknowledged the importance of investing in the adoptionof IT Governance (ITG) frameworks. Good-practice frameworks, such as COBIT 5, emerged as majordrivers for implementing effective ITG. COBIT 5 is an important framework that assists Enterprisesachieving their objectives for both Governance and Management of Enterprise IT. As the manuals ofCOBIT are written in a natural language and represented in some informal diagrams, activities such asthe assessment and analysis of COBIT PRM are error-prone and time-consuming tasks. Relying on alanguage with such flexibility, COBIT manuals create great difficulties for organizations to understandand use the framework. Due to the problems that Stakeholders face to assess, analyse and applythe concepts resulting from PRM, in this work, we exert ontology techniques to assess automaticallyCOBIT PRM and semantic techniques to infer if specific PRM elements are used similarly in the sameway or context.Keywords: IT Governance, COBIT 5 PRM, Ontology, Word Sense Disambiguation, SimilarityMeasure, Lexical Knowledge Base

1. Introduction

The rapid growth of Information Technology (IT)makes IT being recognized as a strategic weaponfor rising competitive advantage among Enterprises.Organizations strive to increase overall businessperformance by using IT in an efficient, practicaland strategic way. Therefore, the strategic use of ITbecame a crucial factor for organizations in order togain competitive advantage and enhance protectionagainst a dynamic and threatening environment [9].

As studies have shown, misalignment or the lackof alignment between IT and business strategies isone of the main reasons why organizations fail torealize the full potential benefits of their IT invest-ments [4, 17]. Assuring the alignment of IT andbusiness strategy is not a trivial task, as well as ac-cepting the risk of associated IT investment. A wayto address the mentioned misalignment is to focuson IT Governance (ITG).

COBIT currently in its fifth edition is a goodpractice framework for the management and gov-ernance of IT. COBIT Enabling Process guide pro-vides a generic process model, Process ReferenceModel (PRM) that defines all the processes and ac-tivities which one should find according to the ideaof best practice in an IT department or organization[20, 12]. As this guide is written in a natural lan-guage and represented in several informal diagrams,activities such as the assessment of these models im-

ply manual searching for contents in the documents.Consequently, manually inferred assessments are er-roneous and inefficient.

Aside from that, as reported by [7], there is aneed to investigate COBIT 5 intellectual founda-tions, design, applicability and internal consistency.COBIT manuals are based on a natural languagei.e. a language to be used by people in general fordaily communication [21]. Whereas artificial lan-guages are characterized by self-created vocabular-ies and strict grammars, the syntactic and semanticflexibility of a natural language enables this type oflanguage to be natural to human beings [21]. Re-lying on a language with such flexibility, COBITmanuals create great difficulties for organizations tounderstand and use the framework. This happensbecause there is a potential presence of ambiguityand abstraction between concepts.

In this work, we exert ontologies as a mean toovercome the first aforesaid problem. Ontologiesenable the representation of conceptual models aswell as, the use of inference mechanisms to auto-matically validate and analyse such representations.We will reuse an existing ontological representationof COBIT 5 and a set of SPARQL queries that en-able automatic analysis of COBIT PRM. The appli-cation of the aforementioned queries will generatedifferent representative samples of PRM elements.

We want to use semantic similarity techniques as

1

a way to perform an internal consistency checkingand thus, ascertain if there is a presence of sim-ilar elements in PRM. A Prototype to be devel-oped, relying on different semantic techniques, willbe executed in the context of a Case Study and theresults will be evaluated recurring to a clusteringtechnique.

The remainder of the document is structured asfollows. Section 2 describes the related work com-plementary to this work. Section 3 describes thedefinition of a Prototype and all the resulting mod-ules that composes it. Section 4, demonstrates theutility of the solution through a Case Study. Sec-tion 5 interprets the results of the demonstration.Section 6 concludes the work presenting contribu-tions and lessons learned.

2. Related WorkCOBIT 5 [19] provides a comprehensive frameworkthat helps Enterprises create optimal value from ITinvestments, realizing benefits while reducing riskand resource use.

2.1. COBIT 5 PRMCOBIT 5 includes a Process Reference Model(PRM) that defines and describes in detail all theprocesses normally found in an Enterprise. EachProcess belonging to PRM has common dimensions[20]:

• Stakeholders have predefined interests, re-sponsibilities, and roles.

• Goals describes the outcomes of processes orEnterprise IT (artifact, state or a capabilityimprovement)

• Metrics measure the achievement of each goal.

• Practices high-level requirements influencedby policies and procedures.

• Activities main actions taken to operate aprocess.

• Work Products support the operation of pro-cesses.

2.2. Ontologies TechnologiesIn order to support the representation of ontologies,many languages were developed in the past decade.W3C has developed a set of standards and tech-nologies, also known as Semantic Web standards.

RDF is an infrastructure for describing metadataabout Web resources and information about thingsthat can be identified on the Web [6]. The datamodel consists of three major components: Re-sources, Properties, and Statements. RDFschemais a semantic extension of RDF [42] that provides adata modeling vocabulary for describing groups of

related resources and relationships. OWL [41] of-fers a semantic markup language for publishing andsharing ontologies. SPARQL is the W3C standardfor accessing and querying RDF graphs.

2.3. Semantic Similarity

Accurate measurement of semantic similarity be-tween words is crucial for various tasks for exam-ple, word sense disambiguation [32], clustering [39],information retrieval [40], synonym extraction [10]and ontology alignment [5].

Measuring semantic similarity between two wordsremains a complicated exercise. A measure of se-mantic similarity is a function that quantifies the re-semblance between two words. Word semantic sim-ilarity [13] approaches can be categorized as corpus-based and knowledge-based similarity measures.

Two acclaimed corpus-based measures are LSAand HAL. HAL and LSA are similar in the ulti-mate goal i.e. both methods capture the mean-ing of a word using co-occurrence information [24].Nevertheless, in LSA the co-occurrence is mea-sured against text units of paragraphs or documentswhereas in HAL, a word-by-word matrix is pro-duced based on word co-occurrences of a corpustext, using a moving window of a predefined width(normally ten).

In past few years different knowledge-based mea-sures have shown to be beneficial when used inspecific contexts. In general, semantic similarityis influenced by a number of different informationsources (depth, path length, information contentand gloss).

First of all, works such as Rada et al. [34] formthe basis of some edge-based methods. Rada et al.demonstrated that the minimum number of edgesseparating two concepts in a is-a hierachy is a met-ric for measuring the conceptual distance and there-fore, for assessing similarity.

Information theory-based works, for instance,Resnik [35] proposed a theory to calculate the se-mantic similarity of two concepts based on the no-tion of information content share. In agreementwith [22] the information content of a concept de-pends on the probability of encountering an in-stance of that concept in a corpus. That is, theprobability is calculated as the frequency of a con-cept divided by the total frequency. Alternativeworks such as [37], propose new ways to computeIC. Sanchez et al., considers the whole set of sub-sumers (leafs) when computing the information con-tent values.

Li et al. present a new measure that combines dif-ferent information sources for example, the benefac-tions of path length, depth, and local density. Pathlength and depth contributions are derived from alexical database using a nonlinear function [22]. IC

2

is used to represent the local density of concepts ina corpus.

Gloss-based methods appeared as promisingmeans of measuring semantic similarity throughgloss information [32]. The first method was origi-nally proposed by Lesk [2]. Lesk method comparestwo concepts analysing the overlaps that might oc-cur within two definitions. Adapted Lesk measure[2, 32], introduces a new scoring algorithm thattakes into consideration the size of possible over-lappings of semantically related concepts. GlossVector measure estimates the semantic relatednessbased on occurrences of second order cooccurrencevectors [32].

2.4. Short Text Similarity

Most of the existing studies concerning Text Sim-ilarity focus on calculating the similarity betweendocuments or long sentences [16] while there areonly a few publications available to calculate thesimilarity between very short texts or sentences[38]. Techniques for detecting similarity betweenshort texts have been categorized into three maingroups [24]: word co-occurrence, corpus-based andfeatures-based methods.

Word co-occurrence methods, also called as thebag of words methods are widely applied in IR sys-tems. By using an equivalent representation fordocuments and queries it is possible to retrieve rel-evant documents based on the similarity between aquery and a document.

Alternative approaches [24] use lexical dictionar-ies to compute the similarity between two sentences.Pairs of words between two sentences are formedand meanings are transmitted into a set of patterns.The similarity is computed using a simple patternmatching algorithm.

Li et al. [24], presented a full dynamic, automaticand adaptable method for computing sentence sim-ilarity. The proposed method forms a joint wordset using all the distinct words in the pair of sen-tences. A semantic vector for each of the sentencesis built using information from a lexical knowledgebase and word significance derived from a corpus.Subsequently, two word order vectors are createdfor each sentence and at the end, overall sentencesimilarity is defined as a combination of semanticsimilarity and word order similarity.

Another category focuses on extracting statisti-cal information of words in huge corpora. Two ac-claimed methods are LSA and HAL. LSA and HALcan be applied at the sentence level. Sentence vec-tors are composed by adding together each wordvector within a sentence. Similarity between twosentences is calculated using a distance metric suchas cosine distance.

The main idea of the features-based method is to

represent a sentence using a set of predefined fea-tures [24]. Each word belonging to a sentence isrepresented considering a feature set of predefinedfeatures. Features such as HUMAN, SOFTNESS,and POINTNESS could be used to conceive labeledtraining data from the similarity of the comparedsentences. Finally, a classifier could learn a func-tion from the training data and generate similarityvalues in presence of test sets.

2.5. Lexical Knowledge Base

Lexical Knowledge Bases (LKB) are digital knowl-edge bases that provide lexical information onwords of a particular language [14]. LKB storesdifferent types of information: senses i.e. associ-ations between lemmas to meanings, and seman-tic relations. BabelNet [27] follows the traditionalstructure of a lexical knowledge base. Main con-cepts and relations present in BabelNet result fromthe intersection of important two sources [28]:

• WordNet provides a set of synsets and seman-tic relations. A synset is a concept identifiedby a set of multilingual lexicalizations.

• Wikipedia a multilingual Web-based encyclo-pedia that contributes with Wiki-pages as con-cepts and semantically unspecified relations.

According to [29, 30], information encoded in thetext dump of BabelNet can be effectively accessedby means of a programmatic access. Important op-erations defined in BabelNet classes are getSynsets,getSenses, getSynset and getGloss [31].

2.6. Word Sense Disambiguation

Word Sense Disambiguation (WSD) is the capac-ity to computationally determine which sense of aword is chosen by its use in a particular context [26].WSD is usually performed on one or more texts i.ecollection of words (w1, w2, ..., wn). WSD task isdescribed as follows: assign the appropriate sensesto all or some of the words, creating a mapping fromwords to senses. There are three main approachesto conduct WSD [26]: supervised, unsupervisedand knowledge-based.

Supervised methods use machine-learning tech-niques and for this reason, require vast amounts ofannotated data and a huge human effort for build-ing them. Some annotated data might not be avail-able for all the words of interest.

Unsupervised methods rely on unlabeled corporato produce a set of clusters. As there is no shareof a reference inventory of senses, the evaluation ofcluster results is a arduous process [26].

Knowledge-based approaches exploits knowledgeresources such as dictionaries, thesauris and ontolo-gies. Since the resources they rely on are progres-

3

sively upgraded, they are important methods toconsider [26].

Conforming to [32], measures of semantic relat-edness can be used to perform word sense disam-biguation. A target polysemous word wt in a textT can be disambiguated by choosing the sense stiwhich maximizes the relatedness(sti, sjk) to otherwords wj in a given window of context.

Babelfy is an approach which aims to performboth multilingual WSD and Entity Linking. Theapproach exploits semantic relations between wordmeanings and named entities from BabelNet [28].According to [25], there are three main steps to dis-ambiguate sentences:

1. A set of semantic signatures (related vertices)is generated and connected by means of ran-dom walks.

2. For each of the fragments of text having at leastone lexicalization (in BabelNet) their possiblemeanings are listed.

3. A semantic interpretation is produced by con-necting the candidate meanings using the se-mantic signatures. All the fragments are dis-ambiguated by means of a centrality measure.

As stated in [25], Babelfy system is accessiblethrough a web interface and Java RESTful API.Important operations defined in Babelfy classes, arebabelfy, getSource, getScore, setAnnotationResourceand addAnnotatedFragments.

2.7. ClusteringClustering is known as an attractive approach forfinding similarities in data and putting similar datainto groups [15]. The goal is that the objects withina group should be similar to one another and dis-similar from the objects in other groups.

Some clustering techniques rely on computing thenumber of clusters a priori. A well-accepted tech-nique that uses this strategy is the k-centers method[11].

Affinity propagation (AP) is a graph theoreticclustering method that does not require a number ofclusters to be prescribed a priori, it discovers a num-ber of exemplars (clusters) automatically through amessage-passing procedure. AP takes as input asimilarity matrix s and a set of s(k, k) [11]. Thesevalues are commonly referred to as ”preferences”.

Considering each data point as a node in a net-work, the message-passing method iteratively trans-mits real-valued messages until one of the followingconditions is met [11]: a fixed number of iterationsor changes in the messages fall below a threshold oralternatively after decisions stay constant. Thereare two kinds of message exchanged between datapoints [11]:

• ”Responsibility” r(i, k), reflects the accumu-lated evidence for how well-suited point k is toserve as the exemplar for point i, taking intoaccount other potential exemplars for point i.

• ”Availability” a(i, k), reflects the accumulatedevidence for how appropriate it would be forpoint i to choose point k as its exemplar, takinginto account the support from other points.

3. ProposalThe research problem identifies potential problemsunderstanding and using COBIT manuals. In thissection, we propose the definition of a Prototypeadequate to extract similarity values between short-text elements belonging to PRM.

3.1. Prototype RequirementsFrom the analysis of the research problem and solu-tion objectives, several functional requirements re-garding the Prototype were captured:

RQ1: It has to generate an overall score ofsemantic similarity, given work products andpractice names.

RQ2: It has to disambiguate polysemouswords within the sentences.

RQ3: It has to compute sentence similarity byusing a short-text similarity measure.

RQ4: Similar work products and practicesnames must be identified recurring to a clus-tering technique.

3.2. Prototype ArchitectureConforming to [7], there is a need to investigate CO-BIT 5 intellectual foundations, design, applicabilityand internal consistency.

Figure 1: Prototype architecture.

Mostly of the short text similarity methods showsome limitations when used in certain contexts.Methods such as the bag of words, LSA and HALfor example, use vectors of limited size to computesimilarity. As a sentence is usually represented into

4

a very high dimensional space, extremely sparsevectors could arise and lead to less accurate cal-culation results. Additionally, important syntacticinformation is ignored.

Li et al. proposed a method that overcomes theprevious limitations; it adjusts dynamically the se-mantic information according to the size of the in-put sentences and additionally, considers syntacticinformation. The authors in [23], mentioned thatthe proposed method does not currently conductword sense disambiguation for polysemous words.Previous works tend to calculate the similarity be-tween two sentences based on the comparison oftheir nearest meanings and this method is no excep-tion. Nonetheless, it is important to disambiguatesentences and integrate with WSD since the near-est meanings do not represent their actual meanings[18]. We had this into consideration, so the devel-oped Prototype integrates with WSD.

The high-level architecture of the Prototype ispresented in Fig.1 and comprises two differentphases:

• Disambiguation Phase - includes all the ac-tivities that matter to the preprocessing anddisambiguation of the sentences.

• Calculation Phase - includes all the activi-ties that matter to the calculation of an overallscore of similarity.

3.2.1 Disambiguation Phase

The disambiguation phase shall start before theprocessing of the sentences being carried out. Firstof all, a co-occurrence matrix was created the sameway as exposed for Gloss Vector measure. Such ma-trix exposes the number of times in which a wordco-occurs with all distinct words of text corpora.The reference corpora (all gloss definitions of Word-Net) was preprocessed recurring to map and reducetechniques.

The process begins effectively at step 1, wherethe content of work products and practice names isa target to preprocessing. In this case, some pre-processing steps were applied to transform the textinto a structured format: specific characters wereremoved and acronyms words were replaced by theirextensible forms.

At step 2, the operation ”babelfy” of Babelfy APIis executed. Since there is intention of restrictingthe disambiguated entries only to WordNet sources,a constraint was created for that purpose. Similarly,pre-annotated fragments were devised to help thedisambiguation process. Special cases, for example,expressions such as ”return on investment” or ”hu-man resources” will be disambiguated into different

synsets, one for each word if no pre-annotated frag-ment is used. However, BabelNet, specially Word-Net specifies a unique synset to express that se-quence of words. To prevent the occurrence of suchproblems, semantic annotations were created to re-strict the Babelfy instance and exclude the remain-ing synsets.

Figure 2: Disambiguation Phase.

Babelfy outputs a set of disambiguated instancesat step 3. Each produced semantic annotation isassociated with a confidence score (Babelfy score).When score goes below 0.7, a back-off strategybased on the most common sense is used [3]. An al-ternative strategy was adopted to refine all the mostcommon senses. The fragment corresponding to thesemantic annotation is sent to a lemmatizer thatcomputes the lemmatization process. At step 6, allthe BabelSynsets that corresponds to that lemmaare retrieved through BabelNet API. For each Ba-belSynset, it gloss, is augmented with definitionsproviding from direct semantic relationships suchas hypernyms, meronyms and gloss disambiguated.

Finally, the previously disambiguated instances(with satisfactory scores) are reused to discoverwhich BabelSynset maximizes the score of the Glossvector measure.

3.2.2 Calculation Phase

Calculation Phase begins when all the activitiesthat matter to the disambiguation of the sentencesare completed with success.

At step 1, each fragment of text within a sentencehas a corresponding semantic annotation i.e. a Ba-belSynset representing its actual meaning. First ofall, during the step 1, a joint set J is constructed

5

Figure 3: Calculation Phase.

dynamically by representing the semantic informa-tion for the compared sentences. Each unique wordor fragment of text (of the input sentences) appearsin the joint set. At step 2, two raw semantic vectorsT , s1 and s2 are generated with the same size of thejoint set. Each entry in the semantic raw vector iscalculated by following the conditions:

• If wi word of J appears in the Sentence Ti, siis set to 1.

• If wi is not contained in Ti, a semantic simi-larity score is computed between wi and eachword in the sentence Ti using the Li et al. mea-sure. The most similar word wi in Ti (withthe maximum score) is selected. If the maxi-mum score exceeds a preset threashold, si setsto maximum, otherwise si sets to zero.

Li et al. measure was implemented by extendingthe WS4J library. As the WS4J library already pro-vides operations that allows users to compute theshortest path and at the same time find the depthto the lcs, the implementation is much simplified.Alongside this, the significance of each entry in thesemantic raw vector is weighted by the informationcontent derived from BNC corpus. IC is calculatedthrough a file with SynsetIds occurrences.

During the step 3, word order vectors are createdtaking into account joint set word indexes. Finally,at step 4 overall similarity is calculated.

4. DemonstrationThis section proves that the aforestated Prototypeis adequate to process semantic information and

therefore, to generate similarity values between dif-ferent elements belonging to PRM.

4.1. Representative sample calculationBefore exerting the Prototype in practice, an ex-isting ontological representation of COBIT 5, avail-able at http://goo.gl/xc2WiO, conjointly with a setof SPARQL queries was used to generate distinctsamples. As depicted in Fig.4, different combina-tions of relevant processes, i.e individuals havinghigher number of work products, were applied toextract different values of work products. By usingthis approach, we have a better probability of se-lecting the PRM elements having more influence inthe execution flow of processes.

Figure 4: SPARQL query that combinates processesto generate work products.

The size of the samples was calculated by virtueof a simple statistical formula. Yamane [1], pro-vides a simplified formula. With a confidence levelof 95%, precision e of 0.05 and N , a population sizeof 478, the application of the formula n = N

1+N ·e2estimates that our sample needs at least 217 ele-ments for work products. To satisfy such constraintthe query was limited to the most 28 relevant pro-cesses. For simplification reasons, Fig.4, just showsthe three most relevant processes.

Two PRM samples were generated and part oftheir content is presented in Figs.5 and 6.

Figure 5: An excerpt of the generated sample forWork Products.

6

Figure 6: An excerpt of the generated sample forPractice Names.

The sample for work products has 237 distinctelements while for practice names have 184 distinctelements. Both samples are a good cover of COBITtotal work products (almost 50% i.e. 237/478) andpractice names (88% i.e. 184/210).

4.2. Similarity calculation

In this section, we apply the Prototype defined inSections 3.1 and 3.2 to both of PRM samples. Inthe first place, the Prototype was executed withthe sample corresponding to work products. Theresult (see Fig.7) was a matrix with 237 rows and237 columns. For practice names (see Fig.8), theSPARQL query produces 27966 results.

Figure 7: An excerpt of Work Products similarityresults.

Figure 8: An excerpt of Practice Names similarityresults.

5. Evaluation

This section shows the evaluation of the proposedsolution. Some clustering techniques are exploredas a basis to group sentences of PRM.

5.1. Affinity Propagation applicationCluster analysis of texts is a well-established prob-lem in IR. Generally, documents are typically rep-resented as data points (vectors) in a high dimen-sional vector space. Since data points using thisrepresentation lie in a metric space [33], there areplenty algorithms indicated to cluster such points.

In our case, we have different pairwise similari-ties that resulted from the demonstration step. Al-gorithms, such as AP, are viable options to applyin this case. AP algorithm was initialized with thefollowing information:

• Number of iterations was set to 100.

• Preferences entries were defined using the min-imum similarity value.

• Dumping factor was set to 0.5.

• Thresholds of 0.15 and 0.10 were used.

The AP algorithm is executed several times untilthe objective function is maximized. According to[8], the objective function maximizes the net sim-ilarity, S, which is the sum of the similarities ofnon-exemplar data points to their exemplars plusthe sum of exemplar preferences.

5.2. Results of the Cluster AnalysisBefore applying the AP algorithm, 90 of 237 workproducts were identified and removed from the ini-tial set. Work products in such situation had av-erage values of similarity (to all data points) lowerthan 0.15. The same applies for practice names,where 40 in 184 were removed using a lower thresh-old, 0.10.

The execution of AP produced 33 clusters forwork products and 50 for practice names. Aftergenerating the results, silhouette method was em-ployed as cluster validity mechanism. The formuladescribed in [36] was applied at cluster level. Pos-itive values obtained for work products and prac-tice names, respectively, 0.12 and 0.06, demonstratethat the objects lie well within clusters with satis-factory results.

5.3. Work Products Results InterpretationPairwise similarity values were assembled to com-pute intra-cluster similarity value. Major conclu-sions that result from the analysis of the clustercontent and Figs.9 and 10 are:

• Examples of clusters relating work productswith higher intra-cluster values (higher than0.6) are wp-cluster-#14, wp-cluster-#18 andwp-cluster-#1.

• Examples of clusters relating work productswith lower intra-cluster values (lower than 0.5)

7

are wp-cluster-#25, wp-cluster-#3 and wp-cluster-#15.

• Most of the time cluster sizes and intra-clustersimilarity values occurs inversely. For exam-ple, clusters such as, wp-cluster-#14 and wp-cluster-#18 has lower sizes whereas higherintra-cluster similarity values.

• A total of 79 out of 147 work products (53% ofour sample) reside in clusters with intra-clustersimilarity above 0.5. If we decrease the limit to0.4 we note that 143 out of 147 work prod-ucts (93% of our sample) exceed that limit.To support this, there is great affinity betweenWP ”Remedial actions and assignments” and”Noncompliance remedial actions”.

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

wp-cluster-#14

wp-cluster-#18

wp-cluster-#1

wp-cluster-#0

wp-cluster-#31

wp-cluster-#30

wp-cluster-#13

wp-cluster-#21

wp-cluster-#22

wp-cluster-#24

wp-cluster-#6

wp-cluster-#9

wp-cluster-#19

wp-cluster-#4

wp-cluster-#10

wp-cluster-#20

wp-cluster-#8

wp-cluster-#28

wp-cluster-#5

wp-cluster-#26

wp-cluster-#25

wp-cluster-#3

wp-cluster-#15

wp-cluster-#32

wp-cluster-#7

wp-cluster-#11

wp-cluster-#17

wp-cluster-#29

wp-cluster-#27

wp-cluster-#2

wp-cluster-#12

wp-cluster-#16

wp-cluster-#23

Figure 9: Intra-cluster similarity for Work Prod-ucts.

wp-cluster-#27 wp-cluster-#8 wp-cluster-#19











Figure 10: Cluster sizes of Work Products.

5.4. Practice Names Results InterpretationMajor conclusions that result from the analysis ofthe cluster content and Figs.11 and 12 are:

• Examples of clusters relating practice nameswith higher intra-cluster values (higher than0.6) are wp-cluster-#40, wp-cluster-#15 andwp-cluster-#39.

• A total of 9 out of 12 clusters with higherintra-cluster similarity contains practice namespertaining to different domains. For example

the pair, {”Perform a post-implementation re-view.”,”Conduct post-resumption review.”} be-longs to domains BAI and DSS.

• Examples of clusters relating practice nameswith lower intra-cluster values (lower than 0.5)are wp-cluster-#44, wp-cluster-#21 and wp-cluster-#6.

• A total of 41 out of 144 of practice names (only0.28% of our sample) reside in clusters withintra-cluster similarity above 0.5. Several clus-ters appear with an intra-cluster value of 0 i.esize of 1. The existence of those clusters mean,that after execution of AP algorithm, no otherelement shares the same exemplar i.e. practicename.

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

pn-cluster-#40

pn-cluster-#15

pn-cluster-#39

pn-cluster-#34

pn-cluster-#47

pn-cluster-#36

pn-cluster-#4

pn-cluster-#10

pn-cluster-#45

pn-cluster-#20

pn-cluster-#3

pn-cluster-#31

pn-cluster-#25

pn-cluster-#44

pn-cluster-#21

pn-cluster-#6

pn-cluster-#12

pn-cluster-#37

pn-cluster-#28

pn-cluster-#2

pn-cluster-#32

pn-cluster-#33

pn-cluster-#1

pn-cluster-#41

pn-cluster-#0

pn-cluster-#7

pn-cluster-#5

pn-cluster-#16

pn-cluster-#8

pn-cluster-#9

pn-cluster-#11

pn-cluster-#13

pn-cluster-#14

pn-cluster-#17

pn-cluster-#18

pn-cluster-#19

pn-cluster-#22

pn-cluster-#23

pn-cluster-#24

pn-cluster-#26

pn-cluster-#27

pn-cluster-#29

pn-cluster-#30

pn-cluster-#35

pn-cluster-#38

pn-cluster-#42

pn-cluster-#43

pn-cluster-#46

pn-cluster-#48

pn-cluster-#49

Figure 11: Intra-cluster similarity for PracticeNames.

pn-cluster-#16 pn-cluster-#37 pn-cluster-#25pn-cluster-#21 pn-cluster-#28 pn-cluster-#5pn-cluster-#45 pn-cluster-#32 pn-cluster-#33pn-cluster-#0 pn-cluster-#7 pn-cluster-#1pn-cluster-#2 pn-cluster-#6 pn-cluster-#41pn-cluster-#12 pn-cluster-#34 pn-cluster-#31pn-cluster-#3 pn-cluster-#44 pn-cluster-#39pn-cluster-#20 pn-cluster-#15 pn-cluster-#4pn-cluster-#40 pn-cluster-#36 pn-cluster-#10pn-cluster-#47 pn-cluster-#35 pn-cluster-#30pn-cluster-#29 pn-cluster-#26 pn-cluster-#27pn-cluster-#8 pn-cluster-#9 pn-cluster-#43pn-cluster-#46 pn-cluster-#42 pn-cluster-#38pn-cluster-#11 pn-cluster-#13 pn-cluster-#48pn-cluster-#49 pn-cluster-#22 pn-cluster-#23pn-cluster-#24 pn-cluster-#18 pn-cluster-#19pn-cluster-#14 pn-cluster-#17

Figure 12: Cluster sizes of Practice Names.

6. ConclusionCOBIT 5 provides a comprehensive framework thatassists Enterprises achieving their objectives forthe Governance and Management of Enterprise IT.Some critics still argue that there is still a lack oftheoretical foundation from an academic viewpointthat explores COBIT as a research artifact.

As this official guide is written in a natural lan-guage and represented in some informal diagrams,activities such as the assessment of these modelsimply manual searching for contents in the docu-ments and for this reason, turn out to be massivelyresource consuming.

8

On the other hand, COBIT manuals create greatdifficulties for organizations to understand and usethe framework. From the previous problems, weconcluded that a scientific analyse and research ontop of these models is missing.

Ontology techniques were used to narrow this gapi.e. to understand and assess the logical structuresand generating semantics of PRM.

Semantic techniques were used to ensure an in-ternal consistency checking and thus, ascertain ifthere is a presence of similar elements in PRM.

6.1. Lessons learned

• DSRM provided us guidance on how to conductthe evaluation of IT artifacts and principallyhow to structure a successful research work.

• Assessments made on top of PRM could ben-efit from having a representation with Ontolo-gies. Previously manually assessments of CO-BIT PRM would cost a huge amount of timeand thereby are susceptible to errors. Usingontologies techniques, supplementary assess-ments made on top of PRM achieve satisfac-tory results and take just a couple of minutesto execute.

• Measuring semantic similarity between twosentences could be a challenging exercise. IfCOBIT manuals were written in a normal-ized vocabulary this task would be simplified.Therefore, we had to implement some prepro-cessing steps in order to normalize the contentof some sentences as well as to disambiguatepolysemous words.

• Similarity techniques are crucial to conductsome natural language processing tasks. Differ-ent algorithms have pros and cons that mightbe evaluated. Specific algorithms use a pre-compiled word list to represent sentences intoa very high dimensional space, others requiremanual compilation and ignore syntactic infor-mation within the texts.

• AP method is useful to analyse and find simi-larities in data. We concluded that a total of79 out of 147 work products (53% of our sam-ple) reside in clusters with intra-cluster simi-larity above 0.5. Besides, only a total of 41 outof 144 of practice names (0.28% of our sam-ple) reside in clusters with intra-cluster simi-larity above 0.5. Such observations allow us toconclude that WPs have the tendency to be se-mantically similar to each other while the samedoes not occur in practice names.

References[1] S. Ajay and B. Micah. Sampling techiniques &

determination of sample size in applied statis-tics research: an overview. IJECM, 2014.

[2] S. Banerjee and T. Pedersen. Extended glossoverlaps as a measure of semantic relatedness.IJCAI, 2003.

[3] J. Camacho-Collados, C. D. Bovi, A. Ra-ganato, and R. Navigli. A Large-Scale Multi-lingual Disambiguation of Glosses. In Proceed-ings of the Tenth International LREC, 2016.

[4] Y. E. Chan. Why Haven’t we Mastered Align-ment? The Importance of the Informal Orga-nization Structure. MIS Quarterly, 2002.

[5] V. Cross and X. Hu. Using semantic similarityin ontology alignment. In P. Shvaiko, J. Eu-zenat, T. Heath, C. Quix, M. Mao, and I. F.Cruz, editors, OM, CEUR Workshop Proceed-ings. CEUR-WS.org, 2011.

[6] J. Davies, D. Fensel, and F. Van Harmelen. To-wards the Semantic Web Towards: Ontology-driven Knowledge Management. Wiley, 2003.

[7] S. De Haes, W. Van Grembergen, and R. S. De-breceny. COBIT 5 and Enterprise Governanceof Information Technology: Building Blocksand Research Opportunities. JIS, 2013.

[8] D. Dueck. Affinity Propagation: Clusteringdata by passing messages. PhD thesis, 2009.

[9] M. Faryabi, A. Fazlzadeh, B. Zahedi, and H. A.Darabi. Alignment of Business and IT and ItsAssociation with Business Performance: TheCase of Iranian Firms. IJBM, 2012.

[10] O. Ferret. Testing semantic similarity measuresfor extracting synonyms from a corpus. ELRA,2002.

[11] B. J. Frey and D. Dueck. Clustering by passingmessages between data points. Science, 2007.

[12] S. Goeken, Matthias and Alter. TowardsConceptual Metamodeling of IT GovernanceFrameworks Approach Use Benefits. In 42thHICSS, 2009.

[13] W. Gomaa and A. Fahmy. A survey of textsimilarity approaches. IJCA, 2013.

[14] I. Gurevych, J. Eckle-Kohler, and M. Ma-tuschek. Linked Lexical Knowledge Bases:Foundations and Applications. Synthesis Lec-tures on Human Language Technologies. Mor-gan & Claypool Publishers, 2016.

9

[15] K. Hammouda. A Comparative Study of DataClustering Techniques. Tools of Intelligent Sys-tems Design: Course Project SYDE 625, 2000.

[16] V. Hatzlvassiloglou, J. L. Klavans, and E. Es-kin. Detecting text similarity over short pas-sages: Exploring linguistic feature combina-tions via machine learning. In 1999 Joint SIG-DAT EMNLP, 1999.

[17] J. C. Henderson and N. Venkatraman. Strate-gic Alignment: Leveraging Information Tech-nology for Transforming Organizations. IBMSyst. J., 1999.

[18] C. Ho, M. A. A. Murad, R. A. Kadir, andS. C. Doraisamy. Word Sense Disambiguation-based Sentence Similarity. In Proceedings ofthe 23rd COLING: Posters, Stroudsburg, PA,USA, 2010. ACL.

[19] ISACA. COBIT 5: A Business Frameworkfor the Governance and Management of En-terprise IT. 2012.

[20] ISACA. COBIT 5: Enabling Processes. 2012.

[21] M. C. Lee, J. W. Chang, and T. C. Hsieh.A Grammar-Based Semantic Similarity Algo-rithm for Natural Language Sentences. TSWJ,2014.

[22] Y. Li, Z. A. Bandar, and D. McLean. Anapproach for measuring semantic similaritybetween words using multiple informationsources. IEEE TKDE, 2003.

[23] Y. Li, D. Mclean, Z. Bandar, J. D. O. Shea,and K. Crockett. Sentence Similarity Basedon Semantic Nets and Corpus Statistics (fullpaper). 2006.

[24] K. Li, Yuhua and McLean, David and Bandar,Zuhair A. and O’Shea, James D. and Crockett.Sentence Similarity Based on Semantic Netsand Corpus Statistics. IEEE TKDE, 2006.

[25] A. Moro, F. Cecconi, and R. Navigli. Multilin-gual Word Sense Disambiguation and EntityLinking for Everybody. In Proceedings of the13th ISWC, 2014.

[26] R. Navigli. Word Sense Disambiguation: ASurvey. ACM Surv., 2009.

[27] R. Navigli and S. P. Ponzetto. BabelNet: Build-ing a very large multilingual semantic network.2010.

[28] R. Navigli and S. P. Ponzetto. BabelNet: Theautomatic construction, evaluation and appli-cation of a wide-coverage multilingual semanticnetwork. AI, 193, 2012.

[29] R. Navigli and S. P. Ponzetto. BabelNetX-plorer: A Platform for Multilingual Lexi-cal Knowledge Base Access and Exploration.Companion Volume to the Proceedings of the21st WWW Conference, 2012.

[30] R. Navigli and S. P. Ponzetto. MultilingualWSD with just a few lines of code: the Babel-Net API. In Proceedings of the 50th AnnualMeeting of the ACL, 2012.

[31] R. Navigli and F. Vannella, Daniele Cecconi.BabelNet API guide.

[32] T. Pedersen, S. Banerjee, and S. Patwardhan.Maximizing semantic relatedness to performword sense disambiguation. UMSI report, 2005.

[33] S. J. M. Phil, A. C. M. Sc, and M. Phil. Surveyon Clustering Algorithms for Sentence LevelText. IJCTT, 2014.

[34] R. Rada, H. Mili, E. Bicknell, and M. Blettner.Development and application of a metric onsemantic nets. In IEEE SMC, 1989.

[35] P. Resnik. Using Information Content to Eval-uate Semantic Similarity in a Taxonomy. InProceedings of the 14th IJCAI Vol 1. MorganKaufmann Publishers Inc., 1995.

[36] P. J. Rousseeuw. Silhouettes: a graphical aidto the interpretation and validation of clusteranalysis. JCAM, 1987.

[37] D. Sanchez and M. Batet. A New Model toCompute the Information Content of Conceptsfrom Taxonomic Knowledge. IJSWIS, 2012.

[38] Shashank, S. Kaur, and S. Singh. ImprovingAccuracy of Answer Checking in Online Ques-tion Answer Portal Using Semantic SimilarityMeasure. Springer Singapore, 2016.

[39] G. J. Torres, R. B. Basnet, A. H. Sung,S. Mukkamala, and B. M. Ribeiro. A Simi-larity Measure for Clustering and Its Applica-tions. IJECES, 2008.

[40] G. Varelas, E. Voutsakis, P. Raftopoulou,E. G. M. Petrakis, and E. E. Milios. Semanticsimilarity methods in WordNet and their ap-plication to information retrieval on the web.Proceedings of the seventh ACM internationalworkshop on WIDM 05, 2005.

[41] W3C. OWL Web Ontology Language Refer-ence, W3C Recommendation, 2014.

[42] W3C. RDF Schema 1.1, W3C Recommenda-tion, 2014.

10

using babelnet for analysing cobit 5 - ulisboa · enterprise it (artifact, state or a capability...

Documents