recasting program reverse engineering through on-line analytical processing · 2020. 4. 7. ·...

Recasting Program Reverse Engineering through On-Line Analytical Processing

Periklis Andritsos

-4 thesis submitted in conformity with the requirements for the degree of Master of Science

Graduate Departrnent of Cornputer Science L'niversity of Toronto

@ Copyright bp Periklis Andritsos 2000

National Library Bibliothèque nationale du Canada

Acquisitions and Acquisitions et Bibliographie Services services bibliographiques

395 Wellington Street 395. rue Wellington ûttawaON K1AON4 Ottawa ON K1A ON4 Canada Canada

The author bas granted a non- exclusive licence allowing the National Library of Canada to reproduce, loan, distribute or sel1 copies of this thesis in rnicroform, paper or electronic formats.

L'auteur a accordé une licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/film, de reproduction sur papier ou sur format électronique.

The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or othenvise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation.

Recasting Program Reverse Engineering through

On-Line Analytical Processing

Periklis Andritsos

hlaster of Science

Depart nient of Compii ter Science

L'niversity of Toronto 2000

Abstract

Program reverse engineering is the task that helps software engineen understand the archi-

tecture of large soft~vare sustems. \Ve study how the data niodeiing techniques kriown as

On-Line Analytical Processing (OL.AP) can be iised to enhance the sophistication and range

of reverse engineering tools. This is the fi rst corn prehensire esamination of the similarit ies

and differences in these tasks. both in how OL.AP techniques meet (or fail to rneet) the

needs of reverse engineering and in how reverse engineering can be recast as data analysis.

RTe identify limitations in the data modeling tools of OLAP that are required in ttie

area of reverse engineering. Specifically. multidimensional models assume that ivliile facts

may change dynamically. the structure of dimensions are relatively static (both in their

dimension values and their relative orderings). Mé show both why this is required in current

O L M solutions and provide new solutions t hat effectively manage dynamic dimensions.

Acknowledgement s

First and foremost. 1 would like to than k my supervisor Renée .l. hliller for her support and encouragement in the accornplishment of t his t hesis. Her insightful cornments inspired my work and her srnile made me feel conifortable when in trouble. 1 feel very fortunate to have Renée as nly supervisor.

John 1Iylopoulos provided me with valuable comments and remarks on my work. He helped in the improvement of tliis document and deserves a big thank you. 1 ivould also like to t hank Dereli G. Corneil and Rick Holt for their advice and ideas. Bi1 Tzerpos was the first who listened to my research probleni and instigated some of the ideas presented in this write-up. I am very thankful to him and 1 will always admire the .*gentle'* way he faces life.

I art1 deeply indebted to Tinios Sellis who conducted niy undergraduate studies and helped nie when applying to North American rniversities. 1 also thank Panos Vassiliadis for introducing me to the theory of OLAP -stems and tiis e-mails with the -big brother's" voice from Greece.

1 feel very lucky that I have the most sniiling. encouraging and pleùsant-to-live-ivith roornmate. Themis is the one ivho guided me during my fint steps in Toronto. taught nie al1 the intricacies of living aivay from farnily. stood by me in al1 my rough tinies and had the patience to put up with rny pecutiarities and my music.

Therc are no e,uy words to thank the Faloutsos brothers. hlichalis anci Pctros. tvho had the kindness to leave me in charge of their ofice. hlichalis always tried to cheer me up wi th his pcrfect sense of humor, while Petros is an escellent office-niate. listener and tennis partner. I greatly thank Rosalia for the joyfril smile 1 receive when [ get in the ofice every morning and al1 those long lasting conversations that sculpted niy mind. l i y L'SIS knowledge would not have been iniproved without 1-iannis Célegrakis and Tasos. whose last nanie 1 do not dare to write down. They both made our first moments in Toronto ver- pleasant and 1 will never forget al1 those jokes we tvere making at the beginning of this "journey". Theodoulos keeps proving to me that distance and tirne do not matter. He never refused to help when in need and listen to my problems and concerns when in despair. Panayiotis s h e d iriy music taste and some of my best tinies in Toronto when D.Jing with Iiini. ;\part from enhancing my knowledge in music. Nick Koudas gave me priceless suggestions for this ivork.

Melanie and Florine are admirable for their ability to put up with al1 the Greekies. hlelanie keeps the artistic spirit of the Company high and Florine brings an additional laughter when she visits the office. I am grateful to Lucia. Daniel and Natan for letting me deploy my DJing capabilities and introducing me to the quality of Brazilian music rvit h their parties. Attila was always enjoyable distributing his jolies and stimulating -interestingW conversations. I feel the need to thank Nick Zachariadis. for the Californian air he brought every now and then. Vaso for always showing me the optimistic side of it and Stergios for his clean-cut solutions to every problern.

1 also feel lucliy meeting .-\ngeliki. She is a good listener and the person 1 like to tease on our favorite issue: -How to spend a Saturday night in Toronto". -4ngeiiki. thanli you for enduring my jokes. I am very happy that lndira introduced me to the Latin American food. and she deserves special thanks for t his and for being a good friend. Many t hanks go to the new Greeh that came to the department: Anastasia and George for their creativity and good taste of movies: George. Fanis and Andreas for showing u s how to break the record of the number of "petsw one can have at home: Anna for her penetrating smile and Kleoni

for demonstrating how srnail this world is. Our life i n the department would not have been as cornfortable and easy without Kathy

f i n . our one and only graduate secretarÿ. Iïathy thank you for your help and friendship. Greece is an integral part of rny life, and many Friends back home contributed to

making it happier. 1 am gratefully thankful to my -second brother" Yiannis Vrachoritis. He proves that an ocean between two friends is a tiny distance and keeps reminding me the sunny side of life and how to nremvaziv~ toivards t hat side. 1 also t hank George Gkioulos. Gerasimos Sismanis. Thanassis bérgis. Eva Athanassiou and Sofia Vassalou for al1 those niagnificent school years ive had together and I always recall in my mind.

I am also thankfuI to Lanessa Evaggelatou for being my best fernale buddy during the iiniversity years and listeriing to niy concerns and coniplaints al1 the time. Thanos Vitas. Lefteris Stamatogiannakis, Panayiotis S klavos and Costas Iiotsokalis also turned t hese five years into the most fruitful years of my life and I am very lucky that I had them by my side,

do not corne easy i n my mouth when speaking for Sassia. From the early steps at the university. Ive were together facing every little moment of life. She taught me how to .-al~vays sniile" and something more important: how it feels when someone loves you. For sis years. çhe stood by me. believed in nie. worked with me. cried ivith me and esperienced al1 those -simple" pleasures of life with me. Nassiouli. you o w e d the best spot in rriy heart and tlian k you for ttie tiappiest sis years of my Me.

51y uttermost thanks belong to my parents Loukas kai Vassiliki. for they keep loving nie. supporting me and teaching me valuabte lessons 1 could not get from any teacher and school so Far. 1 also thank Thanassis. my brother. becaiise he is the one who tries to rernind w hat every nest step in Iife brings. and my sister Eleni. because she always had the patience to listen to my problerns and brought me u p iis her own child. Your "touloumpaki" and -mpouloukmpazits" thanks you for al1 you have done for him. To the o u n g s t of the faniily. Tina. \[aria. Loukas and 17annakis. 1 thank thern for coniing among us giving us their sniiles ancl their Young perspective of life. which benefited me even if the- did not realize. 1 also feel very Iiicky to have Vagelis and Rita as rny brother- and sister-in-law. respectively. They made rny teenage years look more beautiful.

Finally. I would like to espress rny deepest feelings to a friend that will never abandon nie and has been the most valuable Company and precept i n my life: rny music !

Contents

1 Int roduct ion 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Basic Background 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Motivation 4

1.3 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Handling groups using OL.4 P and i ts use in R ~ r e r s ~ Engineering 7 -

2.1 What is OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 What is Reverse Engineering LO

. . . . . . . . . . . . . . . . . . 2.3 Can OLAP be used for Reverse Engineering ? 12

3 ldenti fying Partit ions and the use of Hierarchies 14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Clustering Algorithms 14

. . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Hierarchical Clustering 14

3.2 Dixussicin on the usage of graph theory and clustering algorithms in reverse

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . engineering 17

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Concept Analysis 19

4 The Multidirnensional Model 23

-4.1 A Multidimensional Model for managing hierarchical clusters . . . . . . . . . . 23

. . . . . . . . . . . . . . . . 4.2 A Multidimensional Model for managing concepts 29

5 T h e Extended SQL(31) Mode! 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 L TheSQL(2) mode1 34

. . . . . . . . . . . . . . . . . . . 5.2 The Query language for the SQL('H) model 37

. . . . . . . . . . . . . . . . . . . . 5.3 Semantics of the SQL(H) query language 35

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Limitations of the model 39

- - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 The ESQL(X) mode1 40

. . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 The ESQL(R) query language 43

- - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . a., Sample queries 45

. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.1 Dirnensiona 1 Selection 4.5

. . . . . . . . . . . . . . . . . . . . . . 5 . 7 . Hierarchical Join/Aggregation 46

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.3 Hierarchical Join 43

6 Conclusions

List of Figures

2.1 I n example data warehouse . . . . . . . . . . . . . . . . . . . . . . . . s

2.2 IIuItidimensional OL.1 P (51OL.A P) Architecture . . . . . . . . . . . . . . . 9

2.3 Relational OL.4P (ROL.4P) Architecture . . . . . . . . . . . . . . . . . 9

2 . :I Fact table with a graph structure . . . . . . . . . . . . . . . . . . . 12

3.1 A n example of agglomerative clustering . . . . . . . . . . . . . . . . . . . . 1G

1.2 .A source code. its variable usage and its concept lattice [LSST] . . . . . . . 20

4 . L The schcrna tbr the relations used i n Soft~vare Bookshelf . . . . . . . . . . . 2-1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 .A clustering schema 27

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 . 1 :\ clustering instance 27

4.4 The DW schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

. . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Example of a relation matrix

4 . 6 The concept lattice of the rnatris in Table 1 . . . . . . . . . . . . . . . . . 31

- 7 The extents table for the lattice in Figure 4.6 . . . . . . . . . . . . . . ;32

4.S The intents table for the lattice in Figure 4.6 . . . . . . . . . . . . . :32

4.9 The hierarchy table for the lattice i n Figure 4.6 . . . . . . . . . . . . . . . 32

. . . . . . . . . . . . . . . . . . . . . . . . . . . . -1.10 The estended D W schema X3

5.1 .\ Data Warehouse conforming to the SQL(U) model . . . . . . . . . . . . . :3T

.5 .2 The hierarchical domain for concept ids (Cids) of figure 4.6. . . . . 41

5.3 An exanpie Data Warehouse . . . . . . . . . . . . . . . . . . . . . . 43

vii

To wish impossible things.

Robert Smith

v iii

to rny locely parents.

for vithout them this rlrtxm woulcl h n c ~ n e r e r becorne true.

Chapter 1

Introduction

1.1 Basic Background

In recent years. an increasirig number of organizations ha[-e realized t hc corn petit ive nrf-

vantase that c m be gained froni the efficient access to accurate information. Iriforniation

is a key coniponcnt ici the decision-niaking process. Tlie niore inforniatiori ive have. the

better arc the chances for successful decisions. A Data Il'crrehouse consolidates infornia-

tion from different data sources and enables the application of sophisticated query tools

for faster and better analysis. Data warehouses came into existence to meet the clianging

dernands of the enterprises as On- Line Transaction Processing (OL TP) systenis could not

cover the analytical needs of the enterprises' corn petitive environ ment. OLTP systems are

rnainly supported by the operational databases. while they automate daily tasks. such as

the banking transactions [CDOÏ]. Data Warehouses. on the other hand. support analytical

capabilities by providing an infrastructure For integrated. companj--wide. historical data

from which the analysis process can be achieved [IH94]. A data n-arehouse integrates an

enterprise. w hich is comprised of manu, older and even incompatible application systems.

One of the basic kepwords in data warehouse technology is Dimensional Jlodeling [FirSS].

In Dimensional Modeling, a set of tables and relations forms a model. whose purpose is to

optimize decision support query performance. relative t o a measure or set of measures of

the outcome of the business process that is modeled. Using the Dimensional hlodeling ap-

proach, developers first decide on the business processes t hat are going to be modeled and

then on what each low level record in the fact table will represent. For esarnple. one would

have al1 the transactions made in a bank. on a specific date and place by al1 customers.

stored in a table and analyze them according to the -tirne". -geographyn and -personai

information" dimensions.

The concept of data warehousing has been widely used in busiriess-oriented applica-

tions. It enables analysts, managers and esecutives to gain cIear insight into data of their

enterprise. detect anomalies and niake strategic decisions for a better position of their con]-

pan. among others i n the market. The main characteristics of this technotogy are [ia588]:

the multidimensional view of data: and

a the data analysis whicii can be performed through interactive and navigational qiiery-

ing of data.

In a multidimensional system. there esists a set of nurneric rrzoasurcs (the objects of analy-

sis), which are vieived as points in a niultidirnensional space consisting of several diniensions.

Each dimension can be described by a set of attributes t hat are related to each other accord-

ing to a hierarchy. Esample dimensions can be "product". -geography". "tirne", etc.. while

esample measiire can be the -dollar amount" of products or the '*revenue of ernployees-.

In this work. our main contribution will be to bring data warehouses and reverse

engineering together. In particular. we will esamine how soft~vare engineers can benefit frorn

a multidirnensional view of large softivare and how data analysts can benefit from access to

hidden structures in t heir data. obtained by Re rerse Engineering approaches. The hidden

structures mostly involve graphs. which we believe can be esplored and browsed using data

~varehousing techniques.

Reverse Engineering involves trvo phases [Ti192]:

1. the identification of the components of a system and their interdependencies: and

1. the extraction of system abstractions and design information.

Int uitively, reverse engineering helps developers understand the architecture of large

legacy software systems. Several tools have been designed and built torvards this goal.

mainly because the majority of legacy systems are undocumented. Even if documentation

esists, t hese systems help people compare the as-implemented wit h the as-documented

or the as-designed structure of the underlined system. Ciao [CFI<W95]. Dali [IXST].

'rlanS:\RT [YHCSÏ]. PBS [FHIiiSi]. Rigi [l[OTC93] and SPOOL [IiSRP99] are tools

t hat have evolved as products of the reverse engineering research.

The above tools basically operare at t~vo levels of abstraction [Ti192]:

0 the code-level of' abstraction: and

0 the architectural-level of abstraction.

At the code-level. the tocus is on implerneritation details. such as instantiation of variables.

how csprcssions arc affected by certain variable values and which functions call ot her ftinc-

tions inside a program. On the other hand, systems that operate at the architectural-level

e s t n c t facts that lead to a reconstruction of the actual system. Rigi. Dali and PBS use an

en tity-rclationship model. at the concept ual Ievel. to represent fac ts aboii t a software sys-

teni and an individual format for encoding the entities and relations. In addition. they arc

al1 'open'. in t hat t hey can be retargeted into different fact est ractors or programming lan-

guages. Fact estractors are specific applications that esarnine the source code and reveal the

interconnections between its entities. However. none of t hese tools uses a .\f ultidimensional

Database (MDDB) to store its facts.

=\II the aforenientioned systems use graph structures as intermediate or fina1 data

structures to model intra- and inter-dependencies of the different components of a program.

By intra-dependencies ive usually rnean an' dependencies that esist among the data items

of a procedure. a file. etc.. and by inter-dependencies the interactions between two or more

procedures. Esamples inciude call gmphs and .tloduk D e p m d m c y Gmphs (.IIDGJs.

Understanding the intricate relationships t hat may esist between the source code com-

ponents of a software system may become a difficult task. but at the same time it is crucial

for the maintenance of the system [.LI4IR+SS]. This maintenance will have negative effect.

if it is not based on subsystem knowledge. The situation is even worse in the case of huge

systerns, where any architectural view of them is not easy to infei frorn the source code.

That is the case where data warehouses. in conjunction with On-Line . - 1 ~ ~ ' j ~ i c a l Processing

(01;-4 P) systems. can help. OLXP systems are widely used to allow the interactive analysis

of da ta properfy modeled in a niultidirnensional way.

1.2 Motivation

Dile to the presence of hidden structures that involve graphs. the proper use and analysis

of such structures is of evident value. Identifying the components of a software systern and

the interactions between them using graph theoretical algorit hms is one side of the coin.

The other side consists of applying mining algorithms to partition the niodules of s program

into meaningful and natti ral regions.

Furthermore. a closer look a t a software program reveals interesting information con-

cerning its structure. Structure that has either to do with their physical or architectural

design. The reverse engineering tools Ive referred to in the previous section (PBS. Rigi. Dali

etc.) basically use the physical structure of a soft~vare systern under investigation ta infer

the architectural structure. This happens because it is often the case that documentation

is non-existent for a softivare system (e.g. Linur [BHBSS]).

Recently. the applicability of data mining in decision making tasks has become neces-

sary since its results provide insightful information that reverse engineering tools are not

able t o reveal. Data mining algorit hms. such as ffierarchical Clustering and Concept .-!na/-

ysis. appear to be promising as far as software systems are concerned. Wiggerts gives a

substantial analysis as to why and how clustering algorit hms help in the renovation and

maintenance of legacy systerns [Wig97]. At the same tirne. several authors have been in-

volved in the analysis of systems using Concept Analysis [ST98. vDIi99. SRST. $IG99]. The

identification of modules. program structure and other characteristics are boosted by the

application of such an algorithm.

Both mining techniques mostly unveil:

a dependencies among the entities of the system:

O groupings or partitionings of entities: and

O relationships. especially hierarchical relationships. bet~veen entity grou pings.

.';u~~igciliutt ("r t ! ~ ~ u w ~ i / t t ~ ) ihr~ugh tiwae groupa d i~ierimhim lilight heip w ~ ~ w ~ r e etlgi~~eerb

niaintain or even understarid the systems under consideration. tn this work. we try to

investigate how data ~varehouse technology and reverse engineering techniques can ~vork

together. In particular. we shall locus on hoiv the multidirnensional vicw of data helps in

asking con1 ples ad hoc queries over the information estracted by reverse engineering tools.

1.3 Contributions of the thesis

In this thesis our contributions are the following:

œ ive investigate hoa techniques. incliiciing data mining

aggregate graphs. including program analysis graphs.

cessing tech niques:

:. can be used to partition and

using On-Line Analytical Pro-

O we propose a multidimensional mode1 For managing these groupings. Our results en-

hance the reverse engineering process by permit ting integrated browsing and analysis

of the data produced by these automated techniques. togetlier with data produced by

more human-centric documentation or reverse engineering techniques:

a we identify shortcomings in current OLAP techniques when applied to reverse en-

gineering data. We propose OL-IP extensions specifically designed to permit easy

updates when the schema is modified by the introduction of new reverse engineering

results. In particular. current OLAP models assume the structure and schema of

groupings and hierarchies is static. Our solution relaxes this restriction:

we conclude with an erarnple of querying Our rnultidimensional data.

The thesis is organized as folIows.

O in Chapter 2. ive give an overview of OL-IP) and Reverse Engineering systems. LVe

conclude giving our ideas of how these techniques can work together.

[11 C hapter 3. we give the basics of hierarchical and heuristic algorithrns for the

partitioning of graphs and after discussing same of the limirations of the aforemen-

tioned algorithms we conclude with our arguments on how to incorporate the results

of mining (in particular clustering) algorittims into an On-Line Analytical Processing

franiework. to enhance the reverse engineering process.

O In Chapter 4. we introd lice our multidimensional mode1 for the resuits of a Hierar-

chical and Concept Analysis algorit hm.

a In Chapter 5. ive give the intuition behind estending an esisting data mode1 that

gives first-class status to dimensions i n a data warehouse ( the SQL(7-l) data niodel).

\Ive enurnerate its limitations and give al1 definitions of oiir estended ESQL(3C) data

mode1 and qiiery language.

In Chapter 6. we conclude and offer suggestions for further research on this area.

Chapter 2

Handling groups using OLAP and i t s

use in Heverse hngzneerzng

This chapter gives an overviea of OL:\P systems and their usage in the analysis of business

data. Iloreovcr. since groups can be identified i n data emitted from R ~ c e r s ~ Engin~tr ing

tools. we give our ideas on how ive could take advaiitage of OL-AP systenis to navigate and

query such data.

2.1 What is OLAP

Ln an OL.AP system. data are presented to the user in a multidirnensional niodei. vhich

comprises one or more fact tables and dimensions. A fact table consists of columns. each one

corresponding to a dimension .e.g., geography. product and one (or more) corresponding to

the measure (or measures). e,g.. sales amount. An esample data warehouse containing the

dimensions: Location, time, product and the fact table sales is depicted in Figure 2.1

Furt hermore. OL-IF operations. such as roll-up or dn'll-doum. provide the means to

navigate aIong the dimensions of a data cube (we assume that the fact tabie is the relational

representation of a data cube. the n-dimensional presentation of data [GBLPSS]).

While OLXP systems have the abilitv to answer "who.Y and "what?" questions. it is

t heir ability to answer *what if'?- and "why?" that sets t hem apart from Data ÇVarehouses.

(a) location dimension (b) time dimension

fact table

(cl product dimension

I : store. dîy . p-nme. Dollar~rnt I

-- - / b m d /

Figure 2.1: An example data warehouse

OLJlP enables decision-making about future actions. -4 typical 0L.-IP calculation is more

comples than simply summing data, for esample: "What would be the effect on suit costs

i f fabric prices went clown by O.?O/inch and transportation costs [vent iip bv O.lO/rnile'lb~.

B;iserl on clle uiiderlying architecture used for an OL.4P application. vendors have

classifieci t heir products eit her .as Multidimensional OLAP (MOLAP) or Relational

OLAP (ROLAP).

S[ultidimensional OL.4P uses data stored in a multidimensional dntabase (11DDB) so

as to provide OL.4P analysis. .As shown in Figure 2.2. hiOLJIP is a trvo-tier. client/server

architecture. in which t h e SIDDB serves as both the database layer and the application

logic layer. IR the database Iayer it is responsible for data storage. access and information

retrieval while in the application logic layer takes care of al1 the OL.-\P requests. Finally.

the presentation laver integrates rvith the application logic layer to provide an interface

t hrough which users can issue t heir queries.

On the other hand. Relational OL.-\P supports OL.-\P analysis. by accessing datastored

in relational tables. i .e. a data warehouse. Figure 2.3 depicts the general architecture of

a R0L.-1P system. It is evident that ROLAP is a three-tier. client/server architecture.

in which the database uses conventional relational databases for data storage, access and

Figure 2.2: ~[tiltidimensional OL:\P (1IOLXP) Architecture

information retrieval. At the application togic layer. a R0L.-\P engine esecutes t tic niulti-

dimensional reports hom the iisers a n d integrates wit h various presentation l a y ~ r s . t tirough

w hich iisers issue t heir qucries.

! ! t & E n l O t A P ~ \ , I O U P tutL.

1 I 1

8 8

Figure 2.3: Relational OLr\P (ROL-AP) Architecture

;\part From the different ways that the above architectures store their data. they bot h

provide managers with the information the- need to rnake effective decisions about an or-

ganization's strategic directions. The key indicator of a successful OLAP appiication is its

abiIity to provide information as needed. Le.. its ability t o provide 'just-in-timey informa-

tion for effective decision-making. Furthermore, due to the fact that data reiationships rnay

not be known in advance. the data mode1 rnust be fiesible. -4 truly flexible data mode1

ensures that OLAP systems can respond to changing business requirements as needed for

effective decision making.

.Lltliough OL-AP applications are fotind in widely divergent functional areas. the- al1

require the follorving key feat ures [Cou. AP98].

.. .. Muitidirnensional view of data. whicii provicies more tnan the abiiity to siice

and dice': it gives the Foundation for analptical processing through flexible access to

information. Database design should not prejudice which operations can be perfornied

on a dimension or how rapidly those operations are performed. .Cianagers must be

able to analyze data across any dimension, at ariy level of aggregation, wi th equal

functionality and ease.

Calculation-intensive capabilities. OL-AP databases must be able to do niore

than simple aggregation. Cl'hile aggregation dong a hierarcliy is important, there is

more to analysis t han simple data roll-ups. Esamples of more complex calculations

indude share calculations (percentage of total) and allocations (which use hierarchies

from a top-dorvn perspective).

0 Time intelligence. Tirne is an integral component of almost any analvtical ùppli-

cation. Tirne is a unique dimension because it is sequential in character (January

always cornes before February). True OLAP systerns understand the sequential na-

ture of tirne. At the same time business performance is almost alwaps judged over

time. for exarnple. this month us. last month. t his mont h rs. the same month last

vear.

2.2 What is Reverse Engineering

)[an! systerns. when they age. becorne difficult to understand and maintain. Sometimes.

this task also becomes inefficient due to its high cost. A "Reverse engineering environment

can manage the complexities of program understanding by helping the software engineers

ext ract hig h-level information from low-level artifacts- [Ti1981 .

.A major effort has been undertaken in the software engineering cornmunit- to produce

tools t hat help program analysts uncover the hidden structure of legacy code, LVe already

nieritioned Rigi and The Softwnre Bookshelfas two results of this effort pIOTL:93. FH1iç9T].

These systems are basicaIly focused on perforniing t h e central reverse engineering tasks

presented in [Ti198].

L. Program -4nalysis. This is the task where source code analysis and restructiiring

is perfornied.

2. Plan Recognition. This is the task ivhere cornnion patterns are identified. The

patterns can be betiavioral or struct.rira1. depending on what relationships ive arc

looking for in the code.

3. Concept Assignment . This is the task t hnt alloivs the softivare engineers to discover

human-orierited patterns in the sribject systetri. This task is still at an early research

stage.

-!. Redocumentation. This is the task that attempts to biiild documentation for an

undocumented. and probably old. systern. t hat describes its architecture and func-

tionality.

From the above. it is obvious that reverse engineering tools try to estract an already

esisting. but unknown. structure of a software system. This involves the break down of

the system either in spstem-oriented or human-oriented partitions that represent natural

groupings. Le.. different subsystems or directories of the same system.

The system eramination and management is based on the use of graph structures that

are produced. and later on presented to the user. taking into adrantage feôtures of the

original code. such as function calls or file inclusion. In the ne.- section, we discuss how

those natural groupings could be handled by an OLAP framework.

2.3 Can OLAP be used for Reverse Engineering ?

To the best of our knowledge. program analysts have not taken advantage of a multidi-

mensionai view of data t hat could help them model and analyze the alternative postulated

program structures, For instance, we could consider function cal1s stored in a fact table as

in Figure 2.4. In that figure, Function? is called by Functionl.

/ Functionl / .-. 1 FunctionZ 1 ... 1

Figure 3.4: .-\, Fact table ivith a graph structure

In these cases. ive would like to be able to identify useful graph partitions t hat basically

correspond to parts of the graph with a certain property. For instance. if the graph consti-

tiites a systeni's !dodule Dependency Graph. dense regions of the graph may correspond to

separate logical modules or siibsystems of the system.

The issue in question now is hoiv do ive identify worthy graph partitions. having a

table like the one in Figure 2.4. i.e.. how do ive horizontdly partition the fact table into

subtnbles wi th a certain structure and how do we efficiently query these systems. Therefore.

a particular algorithm h a s to be used to produce the partitions of the fact table. and i n

addition an OLAP system to represent and quer? the results. At this point. we would

like to stress that our Fociis will not be on imposing a specific structure on our table but

estracting its inherent one. This process should be based on the Following decision steps:

(. 1) CVhat is the current format of Our data:

(2) \.Vhat is the algorithm under consideration: and

(3) CVhat models have OLAP researchers proposed and used. and how information about

the results of our techniques can be incorporated in thern.

Lpon this, vue can nonr

system under consideration.

answer "what if?". ~ h y ? and ~here'!" questions on the

An esample OLAP q u e l could be: -What would be the

effect on the memory subsystem if the function cal1 to f o o 0 from bar() is ornitteci and

the io .h header file is rnoved to the /system directory'l". Unlike traditional OLAP. the

effect will Iikely not be a numeric aggregate. but rat her a new grouping of entities prodiiced

either by a query or by mining or prograni analysis algorithms. This also makes the the

--multidirnensional view of data" and -tirne intelligence" properties of more importance

compared to the -CalcuIation-intensive capabilities" one.

.\ ioreover. current reverse engineering tools do not support Version Cont rol of a soft-

tvare system. In order to investigate differences arnong versions of the system. one necds

to esamine all versions individually, and nianually find al1 points of interest. On the other

hand. in an OL-\P framework. t i m e is treated a s a separate dimension. niaking historical

data casier to analyze.

In out work. we shall consider data that are originally in the Riyi Standard Format

( RSF) format [II?VT94. \VTIISS-l] n-hich is used by existing systenis. such as Rigi and

Thc Sof tunr~ Bookshelf[FC¶1i+97). to provide understanding of softivare legacy +stems. In

the following chapter we present some graph theoretical and mining algorithrns that can

be used to unvei1 groups in software engineering data which can later on be modeled in an

OL.4P environment.

Chapter 3

Identifying Partitions and the use of

Hierarchies

:Li already trierit ioned in previous chapters. t lie problem of arialyzing and understanding

inforniation relateci to a software system consists of iïnding proper and meaningful partitions

of a graph. In reverse engineering siich graphs include control flou*. data fiou and resoirrce

flou- graphs [MC90]. They capture the dependencies or interactions arnong the software

entities that comprise a system.

Grapti based algorit hms are used in the software engineering and data rnining commu-

nity to find nat urai partitions of the set of vertices. or edges of a graph, and what follows

is an overview of these techniques and how they can be used. ive conclude with an anal-

ysis of which of the aforernentioned techniques are suitable for adaptation in an On-Line

.-I nalgticnl Proc~ssing ( O LA P) system.

3.1 Clustering Algorithms

3.1.1 Hierarchical Clustering

The first family of algorithms that result in groupings of the initial data set is the one of

clustering algorithms. Their main purpose is to find naturd and meaningful partitioning~~

or clusters. In some problerns the produced clusters can be used "as is', while in others

they may form the basis of constructing consequent clusters. thus producing a hierarchy of

clusters. This section gives an overview of the two major categories of hierarchical clustering

algorit h rns: agglomeru tire (or bottom-up) and dicisice (or top-tioum) . Before proceeding

with the brief description of these algorithms. we introduce the notion of a sequence of

partitions that are nested to each other [JDYS].

Consider a set -1- of k data items ( in our case this can be the set of nodes or edges):

-4 partition C of 'i breaks it into aubsets (Cl. C,.. . ..Cm} such that:

n C; = id. 1 < i. j 5 m. i # j. and

The set C is called a clustcn'ng and each of the C,'s â clustt~r.

Definition 1 Partition *P is nrsted into C if tcemj clirstcr of .P is a proper subsct O/ n

cluster of C.

In the following clusterings. 'l, is nested i n C. b u t D' is not:

Agglomerative Clustering

In agglomemtire clustering [JDYS], each data point starts being an individual cluster. .As

the algorithm goes on. clusters are merged to form larger clusterst thus nesting a clustering

into another partition. The rnerging of clusters is based on a similarity (or dissimilady)

function t hat decides horv similar (or dissimilar) two clusters are.

Figure 3.1 is an esample of how the agglomerative function works on a data set of 4

points. A special type of tree structure is used to depict the

each level, This structure is called a dendrugram.

rnergings and clusterings of

3 & r 4

Figure 3.1: An exarnple of agglomerative ciustering

A large collection of agglornerative algorit hms is presented i n [.J DSX].

DNisive Clustering

Dirisire clusteriiig algorithms [JDSS] perforrn the t a s k of clustering in reverse order. Stnrt-

ing w i t h ail the data points in a ~ingle .~big" . cluster. siich an algorithm iteratively divides

the -big- cluster into srnaller ones. This type of algorithms are not very popular due to

their high cornputational complerity: at each step. the number of partitions to consider is

esponential [JDSS].

hlost of the above cases olten lead to espensive solutions and maybe not near optimal

ones. These are the cases where we need to ernploy a smart procedure to identify meaningful

and interesting clusters of a graph. Heuristic approaches. then. corne into pla! in an attempt

to find optimal solutions in a moderate arnount of time. Researchers use variations olalready

known techniques. such as hill-climbing [IiLTO. .LI&IRC9S] to prune the search space of

ciusters in order to find &good7 clusters in the minimum possible amount of tirne. Depending

on the domain under consideration. different heuristics can be applied. wit h different results.

and certain attention should be paid to their evaliiation.

3.2 Discussion on the usage o f graph theory and clustering algo-

rithms in reverse engineering

Evaluatirig various techniques that pertorm clustering is crucial. and our concentration

ehould be on the following [CWX]. three issues:

( 1) on what data \vil1 the methods be applied:

(2) what is the computational cost of a met hockand

{ : 3 ) how -good" are the clusters.

To the above. ive add a fourth issue for consideration. which emanatcs from the arnoiint of

disk and niernory space availabte:

(4) whether the algofit hm is suitable for in-rnemory esecution o r the data should reside

on disk.

\Ve shall be dealing with graph da ta lrom the progrnm domain. specifically we'll focus

on what has been called Slodide D ~ p e n d ~ n c y Graphs or .\[DG'S [.\ISIR+SX]. . in lIDG is

a directed graph whose nodes are entities of a software system (procedures. files etc.) and

whose edges are relationships between them. The nodes and edges may be accompanied

wit h at tributes t hat depict properties of each procedure (developer. version. fan-in. fan-out

e tc . ) . A software system seems easy to analyze when the number of modules (nodes) is

fairIy srnall. In this work. ive are interested in analyzing large legacy systerns. consisting of

severai thousands of nodes and edges. which often corne undocumented. Our goal is to find

partitionings of the NDG in a way that the produced subsets are natural and represent

interesting information inherent in the system. -Aithough there might be sorne structure

inside a software system. we are often unab[e to single out individual components.

.-\gglomerative and divisive algorithms have been proposed by Wiggerts [Wiggi] as a

means of performing hierarchical clustering given an LIDG-like graph. Both categories of

algorithms are based on a sirnilarity or dissimilarity measure arnong the nodes of the graph.

This measure has to be updated each tinie a new clustering is formed or split into smaller

ones. The measure obviously affects the number and the qualit? of the clusters. mainly due

to the following reasons:

1. a. node (module) might end up being in a w o n g cluster due to ties:

'1. different measures can give different clusterings.

Ir1 the hierarchical algorithms. it is not clear what happens if there are more t han one edge

between two nodes. hence we have a multi-graph. L\.é woiild sa' that these algorithms

are inefficient. everi inapplicable in such cases. To make matters worse. parallel edges ofteri

appear in IIDG's, for esarnple ~vhen a function calls anot her fiinction in two different points

of the program.

Heuristic approaches seem to alleviate this pain and moreover give natural clusters of

a system. hIancoridis et al. describe a system that generates meaningful clusters based on

the inter- and int ra-con nections of nodes in SI DG'S [D.\I.\199. l[MCC;99. .\IlIR+SP]. The

cliisters conform to the widely used heuristic of -low-couplin:, and high cohesion". a heuristic

widely used in software engineering. Low coupling is a software principle which requires

that interactions between subsystems should be as few as possible. while hi& cohesion is a

related principle t hat requires t hat interactions wit hin a su bsystem should be rnaxirnized.

Inside the described framework. a genetic algorithm is applied to an hIDG. that esplores.

in a systernatic way. the estremely large space of partitions and gives a -goodU one. Their

systern. called Bi'.\'CH. operates well for any given set of nodes and edges.

\Te should note here the esistence of graph theoretical algorithnis that try to capture

groups in graph structures. 'lamely. algorithrns that investigate strongly connected com-

ponents or articulation points might be of signifiant interest for our problem. Strongly

connected components identify the -piecesn that comprise a graph. and two vertices are

in the same component if and only if t here is sorne path between them [WesgG]. On

the ot her hand articulation point algorit hrns find vertices whose deletion disconnect a

graph [Slii98]. .Ut hough bot h type of algorit hrns do not require significant amounr of mem-

or- and space [CLR92]. their applicability is not proven in the software reverse engineering

domain.

In the previoiis algorithms. we could add that of finding cliques in a graph. Intuitively.

a clique is a graph in which each pair of vertices is ari edge. -4 complete graph ( a grapfi in

whicti al1 pain of vertices forni edges) has tiiany subgraphs that are not cliques. but every

induced subgraph of a complete graph is a clique [LVesSü]. Finding cliques. however. is an

.L*P-cornplete probleni [C\\'3]. and in [S[S165] .1Ioon and ltoser showcd that the nuniber

of cliques in a graph rnay grow esporientially with the nurnber of nodes.

In our work. we do not coiisicier any graph theoretical algorithm.An interesting question

that nrises ivhen a clustering algorithm is applied. has to do [vit h the identity of the clusters.

If the algorithm is hierarchical. the question also includes the identity of the levels produced.

One way to deal with it. is to use esisting domain knowledge about the software systeni.

In the foilowing section. we present a more natural technique widely used in the software

engineering community. that of Concept .-Inalysis.

3.3 Concept Analysis

Concept analysis is a means to identify groupings of objects that have common attributes.

In 1940 G. Birkhoff [Bir401 proved that for every binary relation between -0bjects" and

their -attributes". a Iattice can be built. which allows remarkâble insight into the structure

of the original relation. The following definitions and the exarnple are taken from [LS97].

In concept anaiysis we consider a reIation T between objects O and at tributes A. hence

'7- E O x cl. -4 formaI context is the triple:

For any set of objects O C O. their set of common attributes is defined by:

o ( O ) = { a € A I V o € O : ( o . a ) € T }

while. for any set of objects -4 C A. their set of common objects is given by:

r ( 4 = { O E 0 1 Va E -4 : (o. a ) E T }

A pair (O. -4) is called a concept. if:

.-! = o ( 0 ) and O = r ( 4

Such a concept corresponds to a masimal rectangle in the table 7. A niasimal rectangle is

n set of objccts sharing çoninion attributes.

Concept analysis starts ivith the table 7 inciicating the attribiites of a given set of

objects. It then builds up JO-called conc~pts which are rnasimal sets of objects sharing

certain features. Ali possible concepts can be grouped into a single lattice. the so-called

concept l n t t i c ~ . The smallest concepts consist of few objects having potentially many

different attributes. the largest concepts consist of many different objects that have only

few attributes in common. A formal concept and its concept lattice estracted from a

FORTR-4.j source file are shown in Figure 3.2.

Figure 3.2: A source code, its variable usage and its concept lattice [LS9'7]

The set of ail the concepts of a given table conform with a partial order:

In the concept lattice. the infimum. or join. of two concepts is computed by intersecting

t k i r c ï : c i i L . iIi2 c:iiefii ùf & iùiiicpi Lriiig the art "f iia u b j e ~ ~ a 0:

Thus. an infimiirn describes the set of attributes common to two sets of objccts.

The suprrrnuni. or meet. is cornputed by intersecting the intents. the intent of a concept

being the set of its attributes A:

Thus. a supremum describes a set of'comrnon objects which share the two sets of attributes.

In order to interpret a concept lattice. we also need to define the following:

which corresponds to a lattice element labeled \vit h a, and

7 ( O ) = /\{c E L(C) 1 O E extent (c) )

which corresponds to a lattice element labeled tvit h o. The property t hat connects a concept

Iattice with its tabIe is as foltows:

Hence. attributes of object O are just those which show up above o in the lattice, and the

'LI

objects for attribute a are those which show up below a.

Interpreting the concept lattice of Figure 3.2. we have the following, according to the

aforementioned definitions:

;\Il subroutines below p(L..3) (R2. R3. R-!) use C'3 (and no other subroutines use

1:). Al1 variables above 7 (Rd ) (1.3. \4. 1 3 . C-6. I T C-8) are used by R-l (and

no othw variables iise R1). Thiis. the concept labeled R-I is:

and the concept labeled 1?.5/ R2 is:

[t is obvious that cl 5 c3. This can be read as: 3 n y variable that is rised by

subrotitine R2 is also used by R-l". Similarly. ( 5 ) 5 p(I.-3) = p(\,*-C), which

is reaci as: -:[II subroutines which 11s 5 i l also lise 1'3 and 4 . LIoreover,

the infinrurn of C'.5/R'L and 1% K. \.'Y/ R3 is labeled R-L meaning that R-k (and

al1 subroutines below 7(R-!) ) uses both F...5 and '19'6.1,'1. VS.

After dl. the lattice uncovers a hierarchy of conceptual clusters iniplicit in the origi-

nal table. To handle t hose concept ual clusters i n a rnultidirnensional way and. furt herniore.

query them. we need to introduce a proper multidirnensional rnodel. Several researchers have

proposed variations of a multidirnensional mode1 [Fir98. IH9-L. PJ99. Vas9S. CT9T. .AGS97]

and Our work ivi11 center towards an estension of it so as to include clusters and concepts.

From our description of hierarchical clustering it seems reasonable for our task to use an

agglomerative algorithm that incorporates a hierarchy with clusters organized in a -clus-

ter- dimension. The following chapter gives a fint approach torvards the multidirnensional

modeling of clustering and concept analysis.

Chapter 4

The Multidimensional Model

This chapter iritrociuces our approach to the niultidimensional mode1 that will incorporate

the results produced by reverse engineering tools and mining algorithrns. Firçt. we give the

defiriitions of the mode1 for the hierarchical chutering algorithms. and then we extend it to

include the rcsults of concept analysis.

4.1 A Multidimensional Model for managing hierarchical clusters

Lit consider the following.

- ir be a set of featirtrs over which we perform the clustering. Hence. F is given by:

where fi can be a f u n c t i o n - d l , a file inclusion. etc.

m A be a set of nodes. A is given by:

where .-1; can be a file. function, cariable. etc.

23

The dependencies ( i .e . interactions between entities of the software systern) that we have

are of the form:

where -4,. -4, and jk could. for esample. comply wi th the scherna of (Figure 4.1). which is

used in the Software Boobhel/tool [FHIi+ST]. Each of the nodes .Ai has a domnin. denoted

5:; Llo,x(.l,). ::'hick CDTTCS~)G~C!S t o :Lit * . d x s it CL;: Z ~ C . 1n:üitircl:;. fcatarc?~ ~czpic's~;;:

L J

funcdcl, îüncde f f 3

function

f

vardct. vardef \

* variable \ I

mricrodef. usemricro f >

macro \ J

uniondef, useunion T -l

union \ .J

typedef, usetype r J

enumdef. useenum e n m

*

Figure 4.1: The scherna for the relations used in Software Bookshelf

type

edge Iabels on the relations that connect two entities (nodes).

24

J

Each feature may be aceompanied by a set of attributes: B I . @, . . . . BI,, where ~f

is the k-th attribute of feature f;. If f;=function-cail. a set of attributes can be:

funciion-cal1 ! = line nurnber

Bfunction-cali r) = number of paranieters passed -

I n the same sense. a node .-Li may also be accompanied by a set of at t ributes: B:' . B;' . . . . . B t ' . L

If .-\;=file, the set of attributes that can be defincd is the following:

BI" = nuniber of lines

~ f " = nuniber of i/statenients

= nurnber of function calls

For each feature f;, ive create a table with the following schenia:

where { ~ f t ) is the set of attributes for /, . In the above. fi is a fact table of our Data

LVarehouse. Since ive might have parallel edges. i .e.. multiple appearances of a pair ( . - L i . .4,).

we give each pair a iiniqtte identifier. and the Follon-ing is the updated schema of the fact

table fi:

For esample. if fi=function-call. then. according to Figure 4.L. -Ai = .A, =function. and a

fact table could look filie the fol~owing:

id

1

1 4 /I main 1 printf 1 IO5 1 2

2

3 1

For each node -4; we create a table with the following schema:

funcl

1 foo

where (B; ' . } is the set of nocie attributes for the node A , .

From Figure 4.1 we can easily infer tliat s fact table incorporates a graph structure. For

instance. al1 dependencies of the forrn of Figure -1.1 constitute a graph. \Ve are interested in

splitting the nodes of the graph into meaningful horizontal partitions. Considering a node

-4,. Froni the original set. as an individual clustcr. we can apply a hiernrchical clustering

algorithrti on that set of nodes.

. i n initial clustering is defined as:

foo

main

tvhere each C* is a cluster that corresponds to a node .Ac. which participates in feature Ji,

func2

printf

Given a similarity (or dissimilarity) function <j ive may start by trying to find which

are the clusters that can be f'rrned from the initial clustering D$. UTe c d this clustering

scanf

foo

line # # of parameters passed

4 25 1 O

30 1

100 3

and @ is nested in 'Df since each of T$ iis a proper subset of a Ç!; of e.

Definition 2 Operator 4 denotes the nesting of one dustering into anoth~r uith rusp~ct

to a feature f i . Hence. if Z$' is nested in .@'. w r i t ~ :

\\.é perforrn consecutive clusterings. aay p. until we find t tie final clustering with I I I$ II = 1 .

Let Ji=Function-call. .-I,=function with donl(.-li) = {Joo. bar. main. printf. sean/} and

a sirnilnrity function G. The clustering algorithni nia? give the following clusterings:

,@luncttin -cul1 = { ( J o o ) . (bar). (rrrairt). (printf) . ( scc~nf ) )

.@metton - d l = { ( foo. bar) . (niain) . (print f. scan f ) }

@nctton -cal1 r) = { ( foo. bar. main). (print f. scan f ) } *

.fi$mction - c d = { (Joo . har. main. printJ. scan f ) }

The schema for the above hierarchical dustering is depicted in Figure 4.2 whiIe its instan-

tiation in 4.3. Figures 4.2 and 4.3 represent a dimension D and its levels Df.

Figure 4.2: A clustering schema Figure 4.3: -4 clustering instance

Definition 3 For eoch pair of clusterings ~, Z$ such that a $, there erists o roll-up

function R c ~ P ~ ' : DoI

Intuitively. the RLP function aogregates one or more clusters of one clustering fl('\ to a

cliister of the inmediate higher order clustering in the hierarchy (If,').

If a çliister rolls-up to anot her cliister wit h the sanie elenients. i.e.. C$ : ( z I . l n . . . . .ln )

PL rolls-up : ( y r . y?. . . .. gn) such that s, = r,. Vi. j : L 5 i. j 5 n. then RC.P =identity. P?

For each pair of nodes (A,. Ah) tha appears in the fact table fi. i.e. the fact table t hat

corresponds to feature f,. ive create two tables:

where {D~}\'@ is the set of clusterings escept for l$ tvhich is represented by -4, and .Ar,

i n each table.

The dimension D ~ S for the pair (rl,. - A b ) is given by:

After al[. the scherna of a Data \.\arehouse (DiV) is like the one depicted in Figure 4.4

This erample shows how iinherent hierarchies in a software system can be revealed using a

hierarchical clustering. and rnoreover how the results of such a clustering algorithm can be

modeled in a multidimensional way. Cpon the formation of such a mode1 and storage of

the data in the tables. navigation and browsing become easy and efficient.

Figure 4.4: The DIV scliema

4.2 A Multidimensional Model for rnanaging concepts

In our appronch to hierarchical clustering. we create al1 groupings without actually knotving

what earh @' represents. Concept Analysie is a nieans not only to discover the groupings

but also to describe them in a more naturai way.

Suppose that Ive have a relation F -+ -4, x .45. where .4, is the set of objects and -4,

is the set of attributes. For instance. n, can be a set of nodes corresponding to . c files and

.-ta the set of nodes correspoiiding to .h files. In this esample. F is the relation that depicts

file inclusion. F is called the midion rnatrix i n concept analysis. while here it represents

the fact table. described in the previous section. A n esample of a relation niatris is given

i n Figure 4..5. The esample was taken frorn[C;SI.-\95].

The relation matris can be accompanied by several attributes that characterize the edge

they represent(.i.e.. for a relation matris that represents file inclusion possible attributes

might be the decefoper of the . c files and the number of fines of the . h files).

Having such a matris avaiiable we may start building the concept Zattice of the above

relation. which depicts maximal rectangles of a relation rnatrk. The concept lattice for the

relation matris of Figure 4.5 is the one of Figure 4.6.

For each object Oi that appears in a Concept Ci, we create the table extents with

Figure 4.5: Esample of a relation rnatrix

the tollowing schenia:

1 I

Object 1

e r t r n t s ( 0 b j ~ c t . Concept)

Attribute a

The prirnary key for the above table is the combination of both attributes.

For each attribute .-1, that appears in a Concept Ci, ive create a table intents with

the following scherna:

intents(.-lttribute. Concept )

The primary key for the above table is the combination of both attributes.

Finally. once the concept analysis algorithm has computed the concepts and the links

between them. i.e.. the hierarchy inside the concept lattice. we create a table that depicts

the child -t parent relationships between concepts. The scherna of that table is:

hierarchy (Concept l. ConceptS)

30

Figure 4.6: The concept lattice of the matris i n Table 1.

Priniary lie- for the above table is the conibination of both attributes. C'onccpt 1 is the

rhilrl att ribute. ~vtiile Concept2 is the parent one.

The above tables for our exaniple are depicted i n Figures 4.7. 4.8 and 4.9.

The scherna of Figure 4.4 is now estended to include the above tables. The new scherna

is given in Figure 4.10. Iloreover. & - e n the tables extents , intents and hierarchy. tve

can conipute any concept in the lattice using standard SQL.

LVe undentand that for both Hierarchical Clustering and Concept Analysis algorithms.

levels are of a major importance. \Ye need to efficiently navigate through the different levels

of the hierarchy these algorithms produce and infer things that Iiappen above or belotv a

specific level. In general and for any instance of the Data LVarehouse we need to be able to

vierv the software system from different levels of abstraction (or detail).

The nest chapter introduces the SQL(H) multidimensional mode1 that gives first-class

status to the dimensions. i.e. the hierarchies t hey encornpass.

Figure -1.7: The extents table for the lattice in Fig- ure 4.6

Object 1

Attribute 1 Concept 1

Figure 4.S: The in t en t s table for the lattice in Fig- ure 4.6

Concept c 1

bot c'L bot c 3

1

Figure 4.9: The hierarchy table for t lie lat tice in Fig- ure -4.6

h iernrchy : 1 parent 1 Chifcl

r 3 r' 3

~ s t ~ n t s : Object Concept interits : .-lttribicte Corrcrpt

Figure 4.10: The estended DW schema

Chapter 5

The Extended SQL('H) Model

.-\Y nientiotied st the end of the previous chapter. performing mining algorithms over software

data needs to be flesible, in the sense tha t hierarchies ancl levels inside them should be ~ v e l l

defirieci and easy to iise. In the paper **\Vhat can Hierarchies do for Data \\/'arehouses"

.I ngndisli ~t al. proposeci a neiv ni i i l tidimension;rl model. ralled SQ 1.(31). ivhich estends

the relational data model of SQL and gives first-class significance tu the hierarchies in

dimensions.

I n this chapter. ive briefl. introduce the SQL('H) model. ive identify some key weak-

nesses of this inodel and go one step ftirther by estending it to the (E)stended SQL('R) (or

ESQL(R)) niodel. in order to make it more general and adaptable to ou r needs.

5.1 The SQL( 'H) model

Several models have emerged to handle multidimensional data. We can bbefly mention

the Star and S n o ~ a k schemata as the rnost prevalent and elegant ones. However sev-

ers1 limitations apply to t hese models. wit h heterogeneity wit hin and across levels being

one of them (especially for the Star schema). Restricting the case to Relational storage

of fact and dimension tables (ROL-AP architecture). those models require that the com-

plete inlormation concerning the levels of a hierarchy be stored in a single table. The

shortcomings are straightforward. For esample. having a dimension named location. USA

and Slonaco are constrained to be modeled in the same way. e.g. within the hierarchy

store-city-region-country. But . as Ive knotv. hlonaco is a city and a country a t the

same time.

The aut hors of [J LS99] refer to the limitations of the snowflake schema as the following:

a -Each hierarchy in a dimension h a s to be balanced- . L e . the lengt h froni the root to

a, leaf has t 0 h~ th^ wmP.

0 -.\Il nodes at any level of a hierarchy have to be homogeneous". i.e.. they should

include the sanie attributes.

Since the hierarchies in the aforementioned models are restricted to be part of the

metadata, L E . they do not have a first-class importance. even simple queries have to

include sequences of joins making them hard to read and understand. The SQL(31) mode1

tackles the above problem int roducing an a extension of standard SQL.

The SQL(H) niodel con1 prises:

.-1 Hierarchical Dornain which is ri coiiection of attribute values arranged in such

a way that form a tree. Sew predicates are defined over this domain. and these

predicates are:

- =. which is the standard equality predicate:

- <. which corresponds to a binary relation over the set of attribute values so that

they form a tree:

- <<. which is the transitive closiire of <: and

- <= (resp. <<=). ivhich corresponds to the relation that represents non-proper

child-parent (resp. descendant-ancestor) dependencies.

In general. we interpret each hierarchical domain as a special data type.

0 A Hierarchy Schema. tvhich forms a rooted Directed Acyclic Graph (DAG). In this

data structure the root has a special value .W. Each node of the DAG accommodates

a certain number of attributes including one that has a hierarchical domain. the

hiemrchical nitribute. and is denoted by .Ah.

a -1 Hierarchy Instance. which corresponds to a hierarchy sche'ma defined as above.

In the instance. al1 relational tables correspond to esactly one table of the schema.

~vhile at the same time no table can straddIe hierarchy levels. This riieans t hat al1 the

b d u e d i&e CULI t d i ~ l ~ Lefi~ug L W ~ i i e sariie riimet~isiwri. Fiiiaiiy, LU pies OC a specific LaLie

are properly related wit h tuples of a table (or more thari one tables) above it. This

means tliat given a tuple in a table and the Iiierarchy of the hierarchical attribute

that corresporids to this tiiple. tve can infer its ancestors.

O :\ Dimension Schema. which is a name together with a hierarchy schenia.

O A Dimension Instance. which is a name together with a hierarchy instance.

O '\ Data Warehouse Schema. tvhich is a set of fact tables together with n. set of

dimension sctienias. Fact tables are restrictcd to include hierarchical attribtites corrc-

sponding to only the leaves of the appropriate dimension.

Imagine a Data Warehouse t hat includes the dimensions of location. t ime and product.

and whose fact table captures dollar arnounts for sales with respect to t hese diniensions.

The schema of al1 tables that could form such a Warehouse are depicted in Figure 5.L.

In this figure locId. tId and pId are hierarchical attributes and prirnary keys for their

respective relations.

Recall the example of the concept Iattice in Figure 4.6. Trying to use the SQL(31)

model to represent the scherna of the concept analysis algorithm results. first of all. ive

observe that ive do not have a tree for the hierarchical domain of the first candidate for

such an attribute. which is the set of Concept Ids: {Ci. i <_ i 5 12}. In order to do so. we

need a more general structure. Before introducing such a structure let's see what estensions

the SQL(31) model adds to standard SQL.

(al Iocation dimension (b) t h e dimension (CI pmduct dimension

fart table

Figure 5.1: .A Data CVarehouse conforming to the SQL(7-l) model.

5.2 The Query language for the SQL('H) model

To take full advantage of the SQL(7-l) niodel a simple but poweriul estension of standard

SQL is proposed in LJLSSI)!. Considering single block SQL(31) queries. the basic extensions

are the following.

O DIMENSIONS clause: This clause permits the inclusion of dimension names in

n. query. tt is relevant to the tables mcntioned in a FR011 clause of standard SQL.

but they now refer to the tables of a dimension. Sloreover. just like in SQL we can

declare tuple variables. i n a DI3IESSIOXS clause al1 names that corne right after the

dimension name are called dimension cariables. :\lthough. it will be mentioned in

the semantics of the language. dimension variables range over al1 tu ples of al1 tables

appearing in a dimension.

Hierarchical predicates: in the SELECT, WHERE. HAVIXG or GROUP BY

clauses of an SQL query we can include domain expressions (DES) of the form T..-L.

tvhere T is a tuple kariable and .4 an attribute name. These DES are compared with

others. or values of compatible type. In order to take advantage of the hietarchical

operators that are defined in the SQL(2) model. we permit DES of the form V A

where V is a dimension variable and .4 an attribute narne. Moreover, we extend

DES to include hiemrchical dornain expressions (HDEs) which are of the form !'v..dh

where W is a tuple/dirnension variable and .4h a hierarchical attribute. HDEs can

be compared with cach other using the predicates (hierarchical predicate) that are

defined in the hierarchical domain. For example. given -4 and B. -4 < B means that

-4 is a child of B.

5.3 Semantics of the SQL(X) query language

For the sake of sirnplicity. the authors of [JLS99] use uni/orm SQL(U) queries. Such a query

is of the form:

SELECT dom ExpList. agglist

DilCEYS IONS dimList

FR011 from List

WHERE w hereconditions

GROVP BI' groupbyList

HAC'IYG haveconditions

Clauses that also appear in standard SQL have the same semantics. The question

is what happens with the newly introduced DILIEYSIONS clause and the appearance of

hierarchical predicates in the iVHERE clause. As far as the dimension variables of the DI-

SIENSIONS clause are concerned. ~ h e y should range over the set of nodes in the hierarchy

associated wit h the dimensionn [J LS99]. i. e.. over al1 heterogeneous t uples of a hierarchy

instance. The result of an SQL(R) query is a table according to the scherrra imposed by

the sets domExpList and agglist of the SELECT clause.

The semantics of a n SQL('H) query is given in the original paper [JLSSS]. We present

the semantics more forrnally in a following section. where ive discuss the semantics of

ESQL(R).

5.4 Limitations of the model

The model ive just described offers the advantages listed below.

hdds semantics of hierarchies to t lie data mode1 and the query language:

- gives first-class status to the hierarchies by:

* permitting Iieterogeneity in dimensions. and

* introducing hierarchical domains (trees) as first-class objects.

a Permits -diniension indepeiidence" of the queries. The DIlIE'iSIOSS clause allows

the definition of dimension variables and. f u r t hermore. allows t hese variables to range

over the tuples of the dimension. without taking into accoiint the schenia of each

table. Therefore. the evaluation of' an SQLtX) query is the same rio niatter what is

the schema of the dimension tables it refers to.

m ;\llows the fast evaluation of SQL(71) hierarchical queries. based on bi tmap indices.

However. there esist limitations that are particularly relevant to reverse engineering

data. As we already mentioned. the concept la t t ic~ of Figure 4.6 and the hierarchies that

cxist in it cannot be represented by the SQL(R) niodel. The basic restriction is that the

hierarchical domain must be a tree. .-hot her point is that if some of the hierarchies in the

lattice change in time. those changes might be difficult to capture. The following list gives

the two basic limitations of the model as well as the intuition behind its estension,

c Kierarchical attributes should conform to a domain t.hat has the structure of a tree.

The esarnple of the concept lattice of Figure -4.6 proves tvhy such a domain becomes

inappropriate for rnining and reverse engineering applications.

0 Each Ievel inside a hierarchy must be modeled as a separate set of tables. This implies

that changes in dimension values (e-g.. changes in the number of Ievels) rnay lead to

schema changes.

5.5 The ESQL('FI) model

To overcome the limitations listed in the previuus section ive need to provide an extended

model that fits our needs. The key point for this model is tu be more general than the

SQL('H) model. The notation used i n the following sections is the same as in the paper of

SQL(X). Definitions that are also the same are mentioned to be so. Brieflc in our mode1

...A ""ri-

- r c p ~ w p w ù ~ .

0 h more general structure for the hierarchical domain. and

Levels to straddle tables. so that any arbitrary table may contain values from many

levelç.

Definit ion 4 [Hierarchical Domain]

.-1 hierarchical domain is a partinlly orrkrcd set < h. s> u * h ~ r ~ Vx is cr rion-empty set of

rittributes and 5 a bina y relation uhich is mflex ic~ . antisymmetric and transitire.

The following hold:

1. The only predicstes defined on this clornain are: =. <. <=. < <. < <=. (<= is the same

as 5 in the above definition).

2. The equality predicate = h a . the standard interpretation of syntactic identity

3. The predicate < is interpreted as a binary relation over VR such that for every x. y E

L. r < y e x 5 y A x # y. the graph G< over the nodes of Vw c m be depicted as

a Hasse diagram [TSKq. Such a diagram is an undirected graph were al1 edges are

considered as arrows From bottom to top. i.e.. snialler elements are placed lower.

4. The predicate << is interpreted w the transitive closure of <.

5. For any two elements u. c E Vrr , u 2 c holds iff either u < c or u = LI. Respectively

for u <<= L;'.

C9 CIO CI 1 Cl2

Figure 5.2: The hierarchical domain for concept ids (Cids) of figure -1.6.

The partial order of concept ids for the esample of Figure -4.6 is piven in Figure 5.2.

Intuitively. VH is an abstract data type that corresponds to hierarchies where predicate

< relers to çhild-parent relationships and << to (proper) ancestor-descendant ones.

\Vhenever an attribute .-t conforms to a ciornain which is hierarchical. ive cal1 -4 a

hiernrchicczl nttribrite and denote it by

Definition 5 [Hierarchical Schema]

.4 hierarchy schema is a triple H = (G. A. a) such that:

(ci G is n collection of nocifs o j n n y structure. har-ing a special node .-\fi:

{ii) A is an altribute sel that contains a unique hiemrchical attribute Ah: and

(iii) a : G + 2" is a function that associates a node u E G r i th a set of attributes

o(u) C A. such that V u + Ali. A h E u(u). und cr(.Lll) = 0.

.-Ill nodes O/ G. ercept -411 should inelude the hiemrchical attribute .-lh in their attribute list.

Imagine that ive have a dimension called Concepts and a hierarchical attribute Ah =

Cid that corresponds to Figure 5.4. If the attribute set of the hierarchy is {Cid, Objects,

~ t t r i b u t e s } . then this attribute set can be associated with eractly orle node of the hi-

erarchy. Hence. we shall have nodes: n id, Objects, ~ t t r i b u t e s } and { } for the node

.-\il.

Definit ion 8 [Hierarchy Instance!

-4 hierarchy instance corresponding to a hierarchy s c h ~ m a H = (G.A. a ) is a collection o/

tables 31. thnt satisfy the jollou*ing: ~ n c h tnbk r E U corresponds to a unique node u E G,

( ~ x r e p t for node -411). a n d r is n table orer a(u).

Note that ive do not restrict the nodes to forni a DAC; and we permit the stradclling

of tables t hrough the levels of the hierarchy.

Definition ? [Dimension] JJLSg-

.-1 dimension scherna D(H) is n nunie D together trith a hierczrc-hy scherna H = (G,.-l. a ) .

I , iE mfer to uttributcs A as the nttribute set nssociatd with clirn~nsion L I .

.-i dimension instance 0 ( R ) occr a dimension s c h ~ m n D(H) is a dimension nana€ D

with a hierarchy instance 31 o / H .

Definition 8 [Data Warehouse Schema] [JLSSS]

.-i Data Warehouse Schema in the ESQL(31) mode1 is defineci as a sel of dimension schemas

D , ( H , ) . ui th associated hiemrchical uttnbutes -4;. 1 5 i 5 k. together with a set of fnct

table schemas o f the f omi f (.A;, . . .. .-th". BI.. . .. 5,). where Dj,. . . .. il,, are n subset of

the dimensions Di.. . .. Dk. and B,. L 5 j 5 m. arc additional attributes. including a n y

mensure attributes.

X Data LVarehouse, i .~ . . a fact table and a dimension for a concept analysis framework

are depicted in Figure 5.3. Yote that to store the Objects and Attributes colurnns of the

tables appearing in that figure. we are taking advantage of the object-oriented features of

SQL (3) [RamS'i]. which permits set-valued attributes.

Dimension "Concepts"

1 Cid 1 Objects 1 Attributes 1

Fact Table "Basic-Concep ts"

Figure 5.3: An esample Data LVarehouse

5.6 The ESQL(7i) query language

Before giving some esample queries to the data nioclel ive jtist described. WC will t ry to

analyze the seniantics of the ESQLIR) query language. The langiiage does not have any

differences with the SQL(R) query language as far as spntas is coricerned. The difference is

that when ive try to eialuate each query. we have to take into account the new. more general

hierarchical domain and the arbitrary number of tables that make up each dimension.

A uni/orm ESQL(R) query is defined in the same aay as in the SQL(R) mode1 [.JLSSO].

In general. a uniform ESQL(7f) Q is a Function:

where Q is t h e set of database. and îZ a set of tables of t h e output. under the schema of

the attribute list t hat appears in the SELECT clause. Taking into account al1 clauses of a

uniform of an ESQL(3) query we have the FoIlowing:

a SELECT clause: This clause enforces the schema of the output table. I t is inter-

preted as in standard SQL with the addition that it may contain hienrchical and

dimension attributes, i .e.. att ributes from the set 21 of a hierarchy instance.

DIMENSIONS clause: This clause permits the declaration of dimension names of

interest. .&II dimension variables range over the set of al1 (possibly non-hornogeneous)

tuples of ail tables associated with t hat diniensiori.

FROM clause: This clause is interpreted esactly as in standard SQL. Le.. it takes

the cross product of al1 tables appearing i n it. wliile al1 tuple variables declared i n it

range over al1 (horriogeneous) tuples of the fact tables they refer.

0 WHERE clause: To pin down the semantics of this clause. Ive should recall the

definition of an iristuntiation funçtion [JLS99]. Considering nll tuple and dimension

variables. the instantiation function maps them to appropriate tables of the data

warehouse. Yow, the key issue is to properly evaluate each wherecond of the LVHERE

clause of an the ESQL(IH) query. according to the type of relationship between the

operators and the operands. Thus:

- if the wherecond involves at tribiites from the fact table and operands of the sanie

type. the LVHERE clause is evaluated esactly as in SQL. This means tliat al1

tuple satisfying the wherecond mil1 appear in the result.

- if the wherecond involves attributes from dimension tables which are compared

to operands of the proper type based on a standard cornparison operator. the

query is again satisfied by all tuples in the dimension tables ivhich appear in

relationship wit h the operand,

- if the uherecond involves hierarchical att ributes which are compared to operands

of the proper type based on a hierarchical predicate. the query is satisfied by al1

tuples that are related to operands according to the hierarchical reIationship. i .e.

the hierarchicaI dornain of the hierarchical attribute.

In al1 the above. al1 cornparisons are performed through the mapping of the instanti-

ation function to the appropriate tuples. Taking the concatenation of al1 the results

(instantiations) from a query and restricting t hose tu ples to the attributes that appear

in the SELECT clause. ive have the final ansiver ro the ESQL('H) query. If there is no

measure defined in the fact table of the data warehouse (as in our esample) instead

of concatenation standard relational union shouId be employed.

a GROUP BY clause: It is i n t ~ r p r ~ t w l ~ s n r t l y as in standard SQI,.

HAVING clause: It is interpreted esactly as in standard SQL.

5.7 Sample queries

I n order to show the simplicity and power of the ESQL(R) language. ive give some esample

qiieries and esplain their semantics and their step bp step cornputation.

5 .Y. 1 Dimensional Selection

The following single block ESQL(R) qiiery. Ql . captures the query -find concepts that

contain more than 3 attributes-.

SELECT C. Cid

DIMENSIONS Concepts C

WHERE COUNT(C .Attributes) > 3

Here the approach is the same as in SQL(X). C will range over al1 tuples OF the Concepts

table (here the tuples are homogeneous) and select those C i d s that satisfy the condition of

the \VHERE clause. The resulting table is the following:

5.7.2 Hierarchical Join/Aggregation

The following single block ESQL(R) query. Q2. captures the query "Find the objects of

each concept that contain over 2 objects".

SELECT C.Cid, C.Objects


FROM Basic-Concepts F

WHERE F-Cid <<= C.Cid

GROUP BY C .Cid

HAVING COUNT(C .Attributes) > 2

[n this c,we the \VHERE clause contains condition: "F.Cid <<= C .Cidg-. which is of the

forrii * o i ' ~ ~ . - \ h eh opnd-. Here. ive do not have the tiierarchy of the levels in the concept

lattice giveii by the tables of the dimension Concepts. Thus. ive should use the attributes

of the hicrnrchical domain to get the eh-relation. Let's see hoiv the query will be evaluated.

1. ~(C)[C'idl ranges over al1 Cid attributes of the Tact table sales. i.e.. attributes {Cl ,C2, C3 ,C4 ,Cï}.

For each of t hese attributes we compute the (reflesive) transitive closure relation tc .

and get:

Now i(opnd) ranges over al1 hierarchical attributes of table Concepts. Taking also

into account t h e H.\VTXG clause condition. we have:

m Using tc(C1). the instantiation of Oh-relatives of CI is:

46

O k i n g t c (C1) . t h e ins t an t i a t ion of Oh-relatives of C2 is:

I:sing tc(C'3). the instantiation of dh-relatives of C3 is:

0 C'sing tc(C' - i ) . the instantiation of Bh-relatives of C 4

0 Lsing tc(C'7). the instantiation of Oh-relatives of C 7

1 top 1 {1.2.3.4.5} //

Cid Objects

2. The final result for Q2 is the union of al1 t h e above tables:

5.7.3 Hierarchical Join

The following ESQL(X) qiiery. Q3. captures the query -Finci the irnniediate breakdoivn of

concepts wit h more than 2 abjects".

SELECT Cl. Cid AS Concept 1, C2. Cid AS Concept2, C2. Objects C20b j ects

DIMENSIONS Concepts Cl, C2


WHERE F.Cid <<= C1.Cid AND

CL. Cid < Cl. Cid AND

C1.Cid IN ( SELECT C. Cid



WHERE F.Cid <<= C.Cid

GROUP BY C X i d

HAVING COUNT(C.Objects) > 2 )

GROUP BY Cl. Cid, C2. Cid

Taking into consideration the result of query Q2, it is easy to inf'er what the result of Q3

will be: For al1 tuples in the result of Q2. give the Cid of its immediate child. T h e result is

given in the follotving table:

Concept1 Concept2 C2O bjects

C9 CS {1.2}

C9 C -5 {2.3}

Cl0 CS { 1.2}

cro CS (W

Cl2 l CG 3:;;

Cl2 c'ï

L top cc3 { 1.2.3} i

top Cl0 ( 1.2.4)

top , Cl1 (4 3)

top CL2 (2.3.5}

Chapter 6

Conclusions

Iri t his ivork. LW st uciied ivaxs of put ring R e r ~ r s e Engineering and Data Il'wehousing tech-

niques toget her. Softlvare reverse engineering techniques t ry to capture the structure of.

usiially. iindocumented systems so that their understanding and maintenance becorrie cas-

ier. On the other hand Data \\*arehousing. and apecifically On-Line Analytical Processing

systerns. prol-ide the appropriate nieans to pose cornpIes. ( r d hoc. queries on inforniatiori

est racted by reverse engineering tools.

\Ve first investigated how several graph-theoretical algorithms can be used in order to

analyze and partition graph structures that are estracted from reverse engineering tools.

such as Rigi and the The Software BooksheiJ h[ost of these algorithms proved to be inef-

ficient to iniplenient due to time and space constraints and the nature of the graphs that

appear i n the results. hlost important is the fact that those algorithms do not reveal any

hierarchicâl structure of the underlying system-In the following chapters. ive described how

On-Line r\nalytical Processing systems handle situations rv here hierarchies exist. .A large

nurnber of researchers have been involved in the study of such systems so as to make their

modeling and querying easier for the naive user. These systems are basically employed by

decision makers who search for trends and Future estimates about their company's critical

parameters. To the best of our knorvledge OLAP systerns have never been comprehensively

studied and employed in the field of reverse engineering.

This thesis presented a new multidimensional model for hiemrchicaI clustering and

concept analysis algorit hms. Bot h types of algorit hms are often used by software engineers

and their results yield interesting observations about the systems under consideration. How-

ever. they have never been able to store these results in a natural and easy to use manner.

Our model is the ba is for optimal storage and natural aay of querying this data. Further-

more. we estended the work by Jagaciish. Lakshmanan and Srivastava [JLS99]. in order to

pive a more general mriltidimensional model which provides first-ciass status to dimensions.

The basic intuition is that the algorithnis mentioned above nia- give different results un-

der different paranieters. or given different versions of the same prograni. The estensions

corn p rise:

a A more general structure t'or the hierarcbical domain of a certain type of attribtites.

called hierarchical at tributes: and

a A refined definition of the notion of levels in t his model. so t hat tuples may appear i n

any table of a hierarchy.

Therefore. the hierarchy of levels can be estracted by the hierarchical domain of the hier-

archicd attributes and if new levels appear in the conceptual level. the Iiierarchy schema

does not need to be changed.

The work presented in this thesis can be estended in several ways. LCé focus on the

evaluation of comples OL..\P queries posed over the ESQL(U) model. In [JLS99]. a new

algorithm based on bitmap indices is given in order to compute queries that include the

<<= and = hierarchical predicates. This algorithm does not need to be further extended

for ESQL(X) queries because it is based on a preorder traversal of the hierarchical domain.

In ESQL(R) the hierarchical dornain is a partial order where such a traversal can be defined.

However. we need to consider algorit hms for evaluating queries including the < hierarchical

predicate. Bitmap indices could help. and moreover. the- con provide the appropriate

background for the faster evaluation of queries that entai1 COUYï' and SUM aggregate

functions in their SELECT or HAVING clauses.

Bibliography

[hCS97] Rakesh :lgrawr.al. .-1. Gupta. and Siinita Sarawagi. hIodeling ~Iultidiniensional

Databases. III Ales Gray and ~er-:\ke Larson. editors. Proc. of the 13th l n t l

Conf. ori Data Engineering. ([C'DE). pages 232-243. IEEE Press. 7-11 April

1997.

[:\P9Y] Periklis Andritsos and Athanassia Papagianni. On the dewlopment O/ a tool

thrrt supports O L.-1P q u ~ r i c s . Diplonia t hesis. Dept. of Elect rical and Corriputer

Engineering Xational Technical Lniveristy of At hens. 199s.

[BHB99] Ivan T. Bowrnan. Richard C. Holt. and Neil V. Brewster. Liniix as a Case

Study: [ts Estracted Software Architecture. In Proc. of the 2ist Int *l Conf.

on Soft uare Engineering, pages .3.5.5-.563, Los Angeles. C.1. C-S.A. hiau 1999.

.-\C'Li Press.

[Bir401 Garrett Birkhoff. Lattice Theo y. :\Sis Colloquium Public.. 23. .AS[S. Yew

York. 1940.

[CD971 S. Chaudhuri and Ir. Dayal. An overview of Data Warehousing and OL..\P

technolog. SIGJIOD Record. 26( 1 ) : 65-14. Slarch 1997.

[CFKIVS?] Yih-Farn Chen. Glenn S. Fowler. Eleft herios Kou tsofios. and Ryan S. LVailach.

Ciao: X G raphical Navigator for Software and Document Repositories. In lEEE

Pmc. of the Int 'l Conf. on Sojlware Maint~nance. pages 66-75, Nice? France.

October 1995.

[CLR92] T. H. Cormen. C. E. Leiserson. and R. L. Rivest. Introduction to algorithms.

hIIT Press and ATcGraw-Hi11 Book Company, 6th edition, 1992.

[COU] O L.4P Council. O L-4P Councit's LVhite Paper. In http://wu?u~.olapcouncil.org/.

[C'Tg;] Luca Cabbibo and Riccardo Torlone. Querying S[ultidimensional Databases. In

Proc. of the 6th Int '2 Wbrkshop on Database Progrnmniing Lnngirages. ( D B L P ) .

pages 3 19-33.5. Estes Park. Colorado. IY-20 August 199';.

[CWYS] D. G. Corneil and I I . E. !L*oodward. .A cornparison and evaluation of graph

t heoretical clusteririg techniques. T S F 0 R. L6( L):74-59. February 1975.

[ D I 9 D. Doval. S. .\Iancoridis. and B. S. .\litchell. Automatic clustering of software

systems ming a genetic algorithni. In Proc. of the [nt '1 Corif. on Softwnrr Tools

anci Engineering Prnctic~. Pittsburgh, P.4. .4ugiist 1999.

[FHlifS7] P. .J. Fiiiiiigan. R. C. Holt. 1. lialas. S. Kerr. K. Iiontogiannis. H. A. Lliiller.

.J. ~ I~lopoi i los . S. G. Perelgut. l l . Stanley, and II;. i.Vong. The Software Book-

shelf. 1B.U Sys t~ms .Journal, :j6(4) : 564-593. 199'7.

[FirSS] .Joseph l I . Firestone. Dimensional llodeling and E-R Uodeling I n The Data

\.Vuehouse. Executive Information Systerns Inc., Ii'hite Paper S. .lune 1998.

[GBL P96] dini Gray. Adam Bosworth. Andrew Layman. and Hamid Pirahesh. Data Cu be:

.-\ Relational Xggregation Operator Generalizing G rou p B y . Cross-Tab. and

Su b-Totals. Technical Report 1ISR-T R-95-22. IIicrosoft Research. Advanced

Technologv Division. Redniond. IV;\ 98032. LT.S..4.. 15 Sovember 1995.

[GhI.W.j] Robert Godin. Rokia SIissaoui. and Hassan Alaoui. Incrernental concept forma-

tion aalgorithms based on galois (concept) lattices. Computational Intelligence.

I l ( 2 ) 246-267. Xovern ber 1995.

[1 Hg-!] W. H. Inmon and R. D. Hackathorn. ITsing the Data Warehouse. John Wiley

k Sons. Inc.. 1994.

-4. K. .lain and R. C. Du bes. -4lgorithms for Cluster ing Data. Prentice-Hall.

Englewood Cliffs. YJ. 1983.

H. i'. Jagadish. Laks V. S. Lakstimanan. and Divesh Srivastava. What can Hier-

archies do for Data CVarehouses? In Proc. of the 25th I n t '1 Conf. on k r y Large

Data Bases. ( YL D B}. pages 530-54 1. Edin burgh. Scot land. L X . 7- 10 Septem-

Scr 1999.

Rick Iiazman and S. .Jerorny Carrière. PIaying Detective: Reconstructing Soft-

ware Architecture from *-Lvailable Evidence. Technical Report. CL IU/SEI-97-

TR-O IO. Software Engineering tnstitiite-Carnegie lfellon University. Pittsburg.

P.\ 1.52 13. October 1997.

B. \V. Kernighan and S. Lin. An efficient heuristic for partitioniiig graphs. Bell

Syst~n1.c; Technicd ./., 49:29 L-407. L910.

Riidolf K. Keller. Reinhard Schauer. Sebastien Robitaille. and Patrick Page.

Pattern-Based Reverse-Engineering of Design Coniponents. [n Proc. of the

>lst In t? Con/. on Sojlir.are Engineering. pages 226-235. Los .-ingeles. C.A.

US.-\. llay 1999. ACSI Press.

C. Lindig and G. Snelting. Assessing modular structure of legacy code based

on mathematical concept analysis. Ln Proc. 01 the Int '1 Con!. on Software

Engineering. Boston. Wi. 17-23 LIay 1997. IEEE Computer Society Press.

Renée J. .LIillet and Ashish Gujarathi. Slining for Prograrn Structure. Int ï

JO und on Software Engineering and Knotrkdge Discoce y. ?(?) :?*Y?-*?. 1999.

J. iV. hfoon and L. hIoser. On cliques in graphs. Israel Journal of .Clathematics.

3323-28. 196.3.

DIAICG991 S. Mancoridis. B. S. 4IitchelI. Y. Chen. and E. R. Gansner. Bunch: A clustering

tooI for the recoverÿ and maintenance of software system structures. In Proc.

of the [nt*/ Con/. on Software Jfaintenance. pages 50-59. Oxford, U'K, August

1999. IEEE Cornputer Society Press.

p l l l R f 981 S. I,lancoridis. B. S. Slitchell. C. Rorres. Y. Chen. and E. R. Gansner. Csing

au tornatic clustering to produce high-level systern organizations of source code.

In Proc. of the [nt '1 itorkshop on Progmm Iinderstanding. Ischia. ItaIy. J une

1 tl0Q L U U U *

[IlOTC93] Hausi A. Slüller. Ilehmet A. Orgun. Scott R. Tilley. and James S. ï l i l . A

reverse engineering approach to su bsystem structure identification. Softuwe

.\fain t e n a n c ~ : Rexarch and Practice. 5(4) : 18 1-20-1. Decem ber 1993.

[11V90] Hausi A. Ilüller and James S. Irhl. Coniposirig Si1 bsystem Structures usin:,

(k.2)-partite G raphs. Technical Report DCS- 128-IR. Depart ment of Corn puter

Science. University of C'ictoria. llarch 1990.

[h 1 \.VT94] Hausi A. blüller. Ken n u Kong. and Scott R. Tilley. Cnderstanding Software

Systems C'sing Reverse Engineering Technology. In Proc. of the 62nd Congms~

of L O.-lssociation Canadienne Francaise pour l*.-lcnncement des Sciencrs. (.-LC-

fis). pages 41-45. Ilontreal. PQ. 16-17 IIay 1904.

Torben Bach Pedersen and Christian S. Jensen. hIultidirnensiona1 Data Srod-

eling for Cornples Data. In P m . of the 15th 1nt.f Confkrence on Datu Engi-

neering, (ICDE). pages 336-345, 23-26 March 1999.

Raghu Ramakrishnan. data bas^ Management Systems. McCraw-Hill. 1997.

S teven S. Skiena. The -4lgorithm Design .\[an ual. Springer-Verlag, Berlin.

Germany / Heidelberg. Germanÿ / London. L X / etc.. 1999.

blichael Siff and Thomas Reps. Identifying Modules via Concept .\naIysis. In

Proc. O/. the Int l Con/. on Sof lua~ .Clointenance, pages 170-179, Bari. Italy.

September 1997.

Gregor Snelting and Frank Tip. Reengineering class hierarchies using concept

analysis. .4 CM SIGSO FT Sofiware Engineering :Votes. 'L3(6):99- 1 10, Novern ber

L998. Proc. of the Int'l Symposium on the Foundations of Software Engineering.

Scott R. Tilley. 4Ianagement Decision Support Th rough Reverse Engineering

Technology . In Proc. o j C.-1SC0.L'wY2. pages 3 l9-XS. 9- 1 1 .\;oveni ber 1992.

Scott Tilley. A Reverse-Engineering Environment Framework. Technical Re-

port. CbIV/SEI-98-TR-00.5. Software Engineering Instit u teCarnegie hlellon

hiversity. Pittsburg. PA 15213. April 1998.

J . P. Tremblay and R. llanohar. Discr~te SIathematicnl Striwturcs wilh .-lppli-

cations to Cornputer Science. SlcGraw-HiIl, Sew York. 1975.

Panos C'assiliadis. SIodeling !dultidimensional Databases. Cube and Cube Op-

mations. In Proc. O! the 10th SSDB.11 Conf.. Capri. Ital';. .JuIy 1998.

h i e van Deursen and Tobias Kui pers. Identifying O bjects using Cluster and

Coricept Analysis. In Proc. of the lnt '1 Conference on S o f t u w ~ Engineering.

pages ?-Ki--7.55. Los Angeles. C.1. 16-22 May 1999.

Douglas B. West. Introduction tu Grnph Theo y. Prentice-HaII. 1996.

T. A. \C'iggerts. k i n g Clustering Algorit hms in Legacy Systems Remodular-

ization. In Proc. of the 4th IVorking Confermce on Reverse Engineering. pages

24-32. .1msterdam, Netherlands. 6-8 October 1997.

[WT.\IS94] tienny LVong. Scott R. TilIey. Hausi A. lLüiler. and Margaret-.Anne D. Storey.

Structural Redocurnentation: .A Case Study. IEEE Software. I'L(1): 46-54.

January 1994.

[YHCSi] Alexander S. Yeh. David R. Harris. and '[elissa P. Chase. Manipulating Recov-

ered Software Architecture Views. In Pmc. of the 19th Int 1 ConJ on Soficcare

Engineering. pages 184-194. Boston, Massachusetts. USA. May 1997. Springer.

recasting program reverse engineering through on-line analytical processing · 2020. 4. 7. ·...

Documents