recasting program reverse engineering through on-line analytical processing · 2020. 4. 7. ·...
TRANSCRIPT
Recasting Program Reverse Engineering through On-Line Analytical Processing
Periklis Andritsos
-4 thesis submitted in conformity with the requirements for the degree of Master of Science
Graduate Departrnent of Cornputer Science L'niversity of Toronto
@ Copyright bp Periklis Andritsos 2000
National Library Bibliothèque nationale du Canada
Acquisitions and Acquisitions et Bibliographie Services services bibliographiques
395 Wellington Street 395. rue Wellington ûttawaON K1AON4 Ottawa ON K1A ON4 Canada Canada
The author bas granted a non- exclusive licence allowing the National Library of Canada to reproduce, loan, distribute or sel1 copies of this thesis in rnicroform, paper or electronic formats.
L'auteur a accordé une licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/film, de reproduction sur papier ou sur format électronique.
The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or othenvise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation.
Recasting Program Reverse Engineering through
On-Line Analytical Processing
Periklis Andritsos
hlaster of Science
Depart nient of Compii ter Science
L'niversity of Toronto 2000
Abstract
Program reverse engineering is the task that helps software engineen understand the archi-
tecture of large soft~vare sustems. \Ve study how the data niodeiing techniques kriown as
On-Line Analytical Processing (OL.AP) can be iised to enhance the sophistication and range
of reverse engineering tools. This is the fi rst corn prehensire esamination of the similarit ies
and differences in these tasks. both in how OL.AP techniques meet (or fail to rneet) the
needs of reverse engineering and in how reverse engineering can be recast as data analysis.
RTe identify limitations in the data modeling tools of OLAP that are required in ttie
area of reverse engineering. Specifically. multidimensional models assume that ivliile facts
may change dynamically. the structure of dimensions are relatively static (both in their
dimension values and their relative orderings). Mé show both why this is required in current
O L M solutions and provide new solutions t hat effectively manage dynamic dimensions.
Acknowledgement s
First and foremost. 1 would like to than k my supervisor Renée .l. hliller for her support and encouragement in the accornplishment of t his t hesis. Her insightful cornments inspired my work and her srnile made me feel conifortable when in trouble. 1 feel very fortunate to have Renée as nly supervisor.
John 1Iylopoulos provided me with valuable comments and remarks on my work. He helped in the improvement of tliis document and deserves a big thank you. 1 ivould also like to t hank Dereli G. Corneil and Rick Holt for their advice and ideas. Bi1 Tzerpos was the first who listened to my research probleni and instigated some of the ideas presented in this write-up. I am very thankful to him and 1 will always admire the .*gentle'* way he faces life.
I art1 deeply indebted to Tinios Sellis who conducted niy undergraduate studies and helped nie when applying to North American rniversities. 1 also thank Panos Vassiliadis for introducing me to the theory of OLAP -stems and tiis e-mails with the -big brother's" voice from Greece.
1 feel very lucky that I have the most sniiling. encouraging and pleùsant-to-live-ivith roornmate. Themis is the one ivho guided me during my fint steps in Toronto. taught nie al1 the intricacies of living aivay from farnily. stood by me in al1 my rough tinies and had the patience to put up with rny pecutiarities and my music.
Therc are no e,uy words to thank the Faloutsos brothers. hlichalis anci Pctros. tvho had the kindness to leave me in charge of their ofice. hlichalis always tried to cheer me up wi th his pcrfect sense of humor, while Petros is an escellent office-niate. listener and tennis partner. I greatly thank Rosalia for the joyfril smile 1 receive when [ get in the ofice every morning and al1 those long lasting conversations that sculpted niy mind. l i y L'SIS knowledge would not have been iniproved without 1-iannis Célegrakis and Tasos. whose last nanie 1 do not dare to write down. They both made our first moments in Toronto ver- pleasant and 1 will never forget al1 those jokes we tvere making at the beginning of this "journey". Theodoulos keeps proving to me that distance and tirne do not matter. He never refused to help when in need and listen to my problems and concerns when in despair. Panayiotis s h e d iriy music taste and some of my best tinies in Toronto when D.Jing with Iiini. ;\part from enhancing my knowledge in music. Nick Koudas gave me priceless suggestions for this ivork.
Melanie and Florine are admirable for their ability to put up with al1 the Greekies. hlelanie keeps the artistic spirit of the Company high and Florine brings an additional laughter when she visits the office. I am grateful to Lucia. Daniel and Natan for letting me deploy my DJing capabilities and introducing me to the quality of Brazilian music rvit h their parties. Attila was always enjoyable distributing his jolies and stimulating -interestingW conversations. I feel the need to thank Nick Zachariadis. for the Californian air he brought every now and then. Vaso for always showing me the optimistic side of it and Stergios for his clean-cut solutions to every problern.
1 also feel lucliy meeting .-\ngeliki. She is a good listener and the person 1 like to tease on our favorite issue: -How to spend a Saturday night in Toronto". -4ngeiiki. thanli you for enduring my jokes. I am very happy that lndira introduced me to the Latin American food. and she deserves special thanks for t his and for being a good friend. Many t hanks go to the new Greeh that came to the department: Anastasia and George for their creativity and good taste of movies: George. Fanis and Andreas for showing u s how to break the record of the number of "petsw one can have at home: Anna for her penetrating smile and Kleoni
for demonstrating how srnail this world is. Our life i n the department would not have been as cornfortable and easy without Kathy
f i n . our one and only graduate secretarÿ. Iïathy thank you for your help and friendship. Greece is an integral part of rny life, and many Friends back home contributed to
making it happier. 1 am gratefully thankful to my -second brother" Yiannis Vrachoritis. He proves that an ocean between two friends is a tiny distance and keeps reminding me the sunny side of life and how to nremvaziv~ toivards t hat side. 1 also t hank George Gkioulos. Gerasimos Sismanis. Thanassis bérgis. Eva Athanassiou and Sofia Vassalou for al1 those niagnificent school years ive had together and I always recall in my mind.
I am also thankfuI to Lanessa Evaggelatou for being my best fernale buddy during the iiniversity years and listeriing to niy concerns and coniplaints al1 the time. Thanos Vitas. Lefteris Stamatogiannakis, Panayiotis S klavos and Costas Iiotsokalis also turned t hese five years into the most fruitful years of my life and I am very lucky that I had them by my side,
do not corne easy i n my mouth when speaking for Sassia. From the early steps at the university. Ive were together facing every little moment of life. She taught me how to .-al~vays sniile" and something more important: how it feels when someone loves you. For sis years. çhe stood by me. believed in nie. worked with me. cried ivith me and esperienced al1 those -simple" pleasures of life with me. Nassiouli. you o w e d the best spot in rriy heart and tlian k you for ttie tiappiest sis years of my Me.
51y uttermost thanks belong to my parents Loukas kai Vassiliki. for they keep loving nie. supporting me and teaching me valuabte lessons 1 could not get from any teacher and school so Far. 1 also thank Thanassis. my brother. becaiise he is the one who tries to rernind w hat every nest step in Iife brings. and my sister Eleni. because she always had the patience to listen to my problerns and brought me u p iis her own child. Your "touloumpaki" and -mpouloukmpazits" thanks you for al1 you have done for him. To the o u n g s t of the faniily. Tina. \[aria. Loukas and 17annakis. 1 thank thern for coniing among us giving us their sniiles ancl their Young perspective of life. which benefited me even if the- did not realize. 1 also feel very Iiicky to have Vagelis and Rita as rny brother- and sister-in-law. respectively. They made rny teenage years look more beautiful.
Finally. I would like to espress rny deepest feelings to a friend that will never abandon nie and has been the most valuable Company and precept i n my life: rny music !
Contents
1 Int roduct ion 1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Basic Background 1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Motivation 4
1.3 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Handling groups using OL.4 P and i ts use in R ~ r e r s ~ Engineering 7 -
2.1 What is OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
. . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 What is Reverse Engineering LO
. . . . . . . . . . . . . . . . . . 2.3 Can OLAP be used for Reverse Engineering ? 12
3 ldenti fying Partit ions and the use of Hierarchies 14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Clustering Algorithms 14
. . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Hierarchical Clustering 14
3.2 Dixussicin on the usage of graph theory and clustering algorithms in reverse
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . engineering 17
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Concept Analysis 19
4 The Multidirnensional Model 23
-4.1 A Multidimensional Model for managing hierarchical clusters . . . . . . . . . . 23
. . . . . . . . . . . . . . . . 4.2 A Multidimensional Model for managing concepts 29
5 T h e Extended SQL(31) Mode! 34
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 L TheSQL(2) mode1 34
. . . . . . . . . . . . . . . . . . . 5.2 The Query language for the SQL('H) model 37
. . . . . . . . . . . . . . . . . . . . 5.3 Semantics of the SQL(H) query language 35
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Limitations of the model 39
- - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 The ESQL(X) mode1 40
. . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 The ESQL(R) query language 43
- - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . a., Sample queries 45
. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.1 Dirnensiona 1 Selection 4.5
. . . . . . . . . . . . . . . . . . . . . . 5 . 7 . Hierarchical Join/Aggregation 46
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.3 Hierarchical Join 43
6 Conclusions
List of Figures
2.1 I n example data warehouse . . . . . . . . . . . . . . . . . . . . . . . . s
2.2 IIuItidimensional OL.1 P (51OL.A P) Architecture . . . . . . . . . . . . . . . 9
2.3 Relational OL.4P (ROL.4P) Architecture . . . . . . . . . . . . . . . . . 9
2 . :I Fact table with a graph structure . . . . . . . . . . . . . . . . . . . 12
3.1 A n example of agglomerative clustering . . . . . . . . . . . . . . . . . . . . 1G
1.2 .A source code. its variable usage and its concept lattice [LSST] . . . . . . . 20
4 . L The schcrna tbr the relations used i n Soft~vare Bookshelf . . . . . . . . . . . 2-1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 .A clustering schema 27
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 . 1 :\ clustering instance 27
4.4 The DW schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
. . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Example of a relation matrix
4 . 6 The concept lattice of the rnatris in Table 1 . . . . . . . . . . . . . . . . . 31
- 7 The extents table for the lattice in Figure 4.6 . . . . . . . . . . . . . . ;32
4.S The intents table for the lattice in Figure 4.6 . . . . . . . . . . . . . :32
4.9 The hierarchy table for the lattice i n Figure 4.6 . . . . . . . . . . . . . . . 32
. . . . . . . . . . . . . . . . . . . . . . . . . . . . -1.10 The estended D W schema X3
5.1 .\ Data Warehouse conforming to the SQL(U) model . . . . . . . . . . . . . :3T
.5 .2 The hierarchical domain for concept ids (Cids) of figure 4.6. . . . . 41
5.3 An exanpie Data Warehouse . . . . . . . . . . . . . . . . . . . . . . 43
vii
To wish impossible things.
Robert Smith
v iii
to rny locely parents.
for vithout them this rlrtxm woulcl h n c ~ n e r e r becorne true.
Chapter 1
Introduction
1.1 Basic Background
In recent years. an increasirig number of organizations ha[-e realized t hc corn petit ive nrf-
vantase that c m be gained froni the efficient access to accurate information. Iriforniation
is a key coniponcnt ici the decision-niaking process. Tlie niore inforniatiori ive have. the
better arc the chances for successful decisions. A Data Il'crrehouse consolidates infornia-
tion from different data sources and enables the application of sophisticated query tools
for faster and better analysis. Data warehouses came into existence to meet the clianging
dernands of the enterprises as On- Line Transaction Processing (OL TP) systenis could not
cover the analytical needs of the enterprises' corn petitive environ ment. OLTP systems are
rnainly supported by the operational databases. while they automate daily tasks. such as
the banking transactions [CDOÏ]. Data Warehouses. on the other hand. support analytical
capabilities by providing an infrastructure For integrated. companj--wide. historical data
from which the analysis process can be achieved [IH94]. A data n-arehouse integrates an
enterprise. w hich is comprised of manu, older and even incompatible application systems.
One of the basic kepwords in data warehouse technology is Dimensional Jlodeling [FirSS].
In Dimensional Modeling, a set of tables and relations forms a model. whose purpose is to
optimize decision support query performance. relative t o a measure or set of measures of
the outcome of the business process that is modeled. Using the Dimensional hlodeling ap-
proach, developers first decide on the business processes t hat are going to be modeled and
then on what each low level record in the fact table will represent. For esarnple. one would
have al1 the transactions made in a bank. on a specific date and place by al1 customers.
stored in a table and analyze them according to the -tirne". -geographyn and -personai
information" dimensions.
The concept of data warehousing has been widely used in busiriess-oriented applica-
tions. It enables analysts, managers and esecutives to gain cIear insight into data of their
enterprise. detect anomalies and niake strategic decisions for a better position of their con]-
pan. among others i n the market. The main characteristics of this technotogy are [ia588]:
the multidimensional view of data: and
a the data analysis whicii can be performed through interactive and navigational qiiery-
ing of data.
In a multidimensional system. there esists a set of nurneric rrzoasurcs (the objects of analy-
sis), which are vieived as points in a niultidirnensional space consisting of several diniensions.
Each dimension can be described by a set of attributes t hat are related to each other accord-
ing to a hierarchy. Esample dimensions can be "product". -geography". "tirne", etc.. while
esample measiire can be the -dollar amount" of products or the '*revenue of ernployees-.
In this work. our main contribution will be to bring data warehouses and reverse
engineering together. In particular. we will esamine how soft~vare engineers can benefit frorn
a multidirnensional view of large softivare and how data analysts can benefit from access to
hidden structures in t heir data. obtained by Re rerse Engineering approaches. The hidden
structures mostly involve graphs. which we believe can be esplored and browsed using data
~varehousing techniques.
Reverse Engineering involves trvo phases [Ti192]:
1. the identification of the components of a system and their interdependencies: and
1. the extraction of system abstractions and design information.
Int uitively, reverse engineering helps developers understand the architecture of large
legacy software systems. Several tools have been designed and built torvards this goal.
mainly because the majority of legacy systems are undocumented. Even if documentation
esists, t hese systems help people compare the as-implemented wit h the as-documented
or the as-designed structure of the underlined system. Ciao [CFI<W95]. Dali [IXST].
'rlanS:\RT [YHCSÏ]. PBS [FHIiiSi]. Rigi [l[OTC93] and SPOOL [IiSRP99] are tools
t hat have evolved as products of the reverse engineering research.
The above tools basically operare at t~vo levels of abstraction [Ti192]:
0 the code-level of' abstraction: and
0 the architectural-level of abstraction.
At the code-level. the tocus is on implerneritation details. such as instantiation of variables.
how csprcssions arc affected by certain variable values and which functions call ot her ftinc-
tions inside a program. On the other hand, systems that operate at the architectural-level
e s t n c t facts that lead to a reconstruction of the actual system. Rigi. Dali and PBS use an
en tity-rclationship model. at the concept ual Ievel. to represent fac ts aboii t a software sys-
teni and an individual format for encoding the entities and relations. In addition. they arc
al1 'open'. in t hat t hey can be retargeted into different fact est ractors or programming lan-
guages. Fact estractors are specific applications that esarnine the source code and reveal the
interconnections between its entities. However. none of t hese tools uses a .\f ultidimensional
Database (MDDB) to store its facts.
=\II the aforenientioned systems use graph structures as intermediate or fina1 data
structures to model intra- and inter-dependencies of the different components of a program.
By intra-dependencies ive usually rnean an' dependencies that esist among the data items
of a procedure. a file. etc.. and by inter-dependencies the interactions between two or more
procedures. Esamples inciude call gmphs and .tloduk D e p m d m c y Gmphs (.IIDGJs.
Understanding the intricate relationships t hat may esist between the source code com-
ponents of a software system may become a difficult task. but at the same time it is crucial
for the maintenance of the system [.LI4IR+SS]. This maintenance will have negative effect.
if it is not based on subsystem knowledge. The situation is even worse in the case of huge
systerns, where any architectural view of them is not easy to infei frorn the source code.
That is the case where data warehouses. in conjunction with On-Line . - 1 ~ ~ ' j ~ i c a l Processing
(01;-4 P) systems. can help. OLXP systems are widely used to allow the interactive analysis
of da ta properfy modeled in a niultidirnensional way.
1.2 Motivation
Dile to the presence of hidden structures that involve graphs. the proper use and analysis
of such structures is of evident value. Identifying the components of a software systern and
the interactions between them using graph theoretical algorit hms is one side of the coin.
The other side consists of applying mining algorithms to partition the niodules of s program
into meaningful and natti ral regions.
Furthermore. a closer look a t a software program reveals interesting information con-
cerning its structure. Structure that has either to do with their physical or architectural
design. The reverse engineering tools Ive referred to in the previous section (PBS. Rigi. Dali
etc.) basically use the physical structure of a soft~vare systern under investigation ta infer
the architectural structure. This happens because it is often the case that documentation
is non-existent for a softivare system (e.g. Linur [BHBSS]).
Recently. the applicability of data mining in decision making tasks has become neces-
sary since its results provide insightful information that reverse engineering tools are not
able t o reveal. Data mining algorit hms. such as ffierarchical Clustering and Concept .-!na/-
ysis. appear to be promising as far as software systems are concerned. Wiggerts gives a
substantial analysis as to why and how clustering algorit hms help in the renovation and
maintenance of legacy systerns [Wig97]. At the same tirne. several authors have been in-
volved in the analysis of systems using Concept Analysis [ST98. vDIi99. SRST. $IG99]. The
identification of modules. program structure and other characteristics are boosted by the
application of such an algorithm.
Both mining techniques mostly unveil:
a dependencies among the entities of the system:
O groupings or partitionings of entities: and
O relationships. especially hierarchical relationships. bet~veen entity grou pings.
.';u~~igciliutt ("r t ! ~ ~ u w ~ i / t t ~ ) ihr~ugh tiwae groupa d i~ierimhim lilight heip w ~ ~ w ~ r e etlgi~~eerb
niaintain or even understarid the systems under consideration. tn this work. we try to
investigate how data ~varehouse technology and reverse engineering techniques can ~vork
together. In particular. we shall locus on hoiv the multidirnensional vicw of data helps in
asking con1 ples ad hoc queries over the information estracted by reverse engineering tools.
1.3 Contributions of the thesis
In this thesis our contributions are the following:
œ ive investigate hoa techniques. incliiciing data mining
aggregate graphs. including program analysis graphs.
cessing tech niques:
:. can be used to partition and
using On-Line Analytical Pro-
O we propose a multidimensional mode1 For managing these groupings. Our results en-
hance the reverse engineering process by permit ting integrated browsing and analysis
of the data produced by these automated techniques. togetlier with data produced by
more human-centric documentation or reverse engineering techniques:
a we identify shortcomings in current OLAP techniques when applied to reverse en-
gineering data. We propose OL-IP extensions specifically designed to permit easy
updates when the schema is modified by the introduction of new reverse engineering
results. In particular. current OLAP models assume the structure and schema of
groupings and hierarchies is static. Our solution relaxes this restriction:
we conclude with an erarnple of querying Our rnultidimensional data.
The thesis is organized as folIows.
O in Chapter 2. ive give an overview of OL-IP) and Reverse Engineering systems. LVe
conclude giving our ideas of how these techniques can work together.
[11 C hapter 3. we give the basics of hierarchical and heuristic algorithrns for the
partitioning of graphs and after discussing same of the limirations of the aforemen-
tioned algorithms we conclude with our arguments on how to incorporate the results
of mining (in particular clustering) algorittims into an On-Line Analytical Processing
franiework. to enhance the reverse engineering process.
O In Chapter 4. we introd lice our multidimensional mode1 for the resuits of a Hierar-
chical and Concept Analysis algorit hm.
a In Chapter 5. ive give the intuition behind estending an esisting data mode1 that
gives first-class status to dimensions i n a data warehouse ( the SQL(7-l) data niodel).
\Ive enurnerate its limitations and give al1 definitions of oiir estended ESQL(3C) data
mode1 and qiiery language.
In Chapter 6. we conclude and offer suggestions for further research on this area.
Chapter 2
Handling groups using OLAP and i t s
use in Heverse hngzneerzng
This chapter gives an overviea of OL:\P systems and their usage in the analysis of business
data. Iloreovcr. since groups can be identified i n data emitted from R ~ c e r s ~ Engin~tr ing
tools. we give our ideas on how ive could take advaiitage of OL-AP systenis to navigate and
query such data.
2.1 What is OLAP
Ln an OL.AP system. data are presented to the user in a multidirnensional niodei. vhich
comprises one or more fact tables and dimensions. A fact table consists of columns. each one
corresponding to a dimension .e.g., geography. product and one (or more) corresponding to
the measure (or measures). e,g.. sales amount. An esample data warehouse containing the
dimensions: Location, time, product and the fact table sales is depicted in Figure 2.1
Furt hermore. OL-IF operations. such as roll-up or dn'll-doum. provide the means to
navigate aIong the dimensions of a data cube (we assume that the fact tabie is the relational
representation of a data cube. the n-dimensional presentation of data [GBLPSS]).
While OLXP systems have the abilitv to answer "who.Y and "what?" questions. it is
t heir ability to answer *what if'?- and "why?" that sets t hem apart from Data ÇVarehouses.
(a) location dimension (b) time dimension
fact table
(cl product dimension
I : store. dîy . p-nme. Dollar~rnt I
-- - / b m d /
Figure 2.1: An example data warehouse
OLJlP enables decision-making about future actions. -4 typical 0L.-IP calculation is more
comples than simply summing data, for esample: "What would be the effect on suit costs
i f fabric prices went clown by O.?O/inch and transportation costs [vent iip bv O.lO/rnile'lb~.
B;iserl on clle uiiderlying architecture used for an OL.4P application. vendors have
classifieci t heir products eit her .as Multidimensional OLAP (MOLAP) or Relational
OLAP (ROLAP).
S[ultidimensional OL.4P uses data stored in a multidimensional dntabase (11DDB) so
as to provide OL.4P analysis. .As shown in Figure 2.2. hiOLJIP is a trvo-tier. client/server
architecture. in which t h e SIDDB serves as both the database layer and the application
logic layer. IR the database Iayer it is responsible for data storage. access and information
retrieval while in the application logic layer takes care of al1 the OL.-\P requests. Finally.
the presentation laver integrates rvith the application logic layer to provide an interface
t hrough which users can issue t heir queries.
On the other hand. Relational OL.-\P supports OL.-\P analysis. by accessing datastored
in relational tables. i .e. a data warehouse. Figure 2.3 depicts the general architecture of
a R0L.-1P system. It is evident that ROLAP is a three-tier. client/server architecture.
in which the database uses conventional relational databases for data storage, access and
Figure 2.2: ~[tiltidimensional OL:\P (1IOLXP) Architecture
information retrieval. At the application togic layer. a R0L.-\P engine esecutes t tic niulti-
dimensional reports hom the iisers a n d integrates wit h various presentation l a y ~ r s . t tirough
w hich iisers issue t heir qucries.
! ! t & E n l O t A P ~ \ , I O U P tutL.
1 I 1
8 8
Figure 2.3: Relational OLr\P (ROL-AP) Architecture
;\part From the different ways that the above architectures store their data. they bot h
provide managers with the information the- need to rnake effective decisions about an or-
ganization's strategic directions. The key indicator of a successful OLAP appiication is its
abiIity to provide information as needed. Le.. its ability t o provide 'just-in-timey informa-
tion for effective decision-making. Furthermore, due to the fact that data reiationships rnay
not be known in advance. the data mode1 rnust be fiesible. -4 truly flexible data mode1
ensures that OLAP systems can respond to changing business requirements as needed for
effective decision making.
.Lltliough OL-AP applications are fotind in widely divergent functional areas. the- al1
require the follorving key feat ures [Cou. AP98].
.. .. Muitidirnensional view of data. whicii provicies more tnan the abiiity to siice
and dice': it gives the Foundation for analptical processing through flexible access to
information. Database design should not prejudice which operations can be perfornied
on a dimension or how rapidly those operations are performed. .Cianagers must be
able to analyze data across any dimension, at ariy level of aggregation, wi th equal
functionality and ease.
Calculation-intensive capabilities. OL-AP databases must be able to do niore
than simple aggregation. Cl'hile aggregation dong a hierarcliy is important, there is
more to analysis t han simple data roll-ups. Esamples of more complex calculations
indude share calculations (percentage of total) and allocations (which use hierarchies
from a top-dorvn perspective).
0 Time intelligence. Tirne is an integral component of almost any analvtical ùppli-
cation. Tirne is a unique dimension because it is sequential in character (January
always cornes before February). True OLAP systerns understand the sequential na-
ture of tirne. At the same time business performance is almost alwaps judged over
time. for exarnple. this month us. last month. t his mont h rs. the same month last
vear.
2.2 What is Reverse Engineering
)[an! systerns. when they age. becorne difficult to understand and maintain. Sometimes.
this task also becomes inefficient due to its high cost. A "Reverse engineering environment
can manage the complexities of program understanding by helping the software engineers
ext ract hig h-level information from low-level artifacts- [Ti1981 .
.A major effort has been undertaken in the software engineering cornmunit- to produce
tools t hat help program analysts uncover the hidden structure of legacy code, LVe already
nieritioned Rigi and The Softwnre Bookshelfas two results of this effort pIOTL:93. FH1iç9T].
These systems are basicaIly focused on perforniing t h e central reverse engineering tasks
presented in [Ti198].
L. Program -4nalysis. This is the task where source code analysis and restructiiring
is perfornied.
2. Plan Recognition. This is the task ivhere cornnion patterns are identified. The
patterns can be betiavioral or struct.rira1. depending on what relationships ive arc
looking for in the code.
3. Concept Assignment . This is the task t hnt alloivs the softivare engineers to discover
human-orierited patterns in the sribject systetri. This task is still at an early research
stage.
-!. Redocumentation. This is the task that attempts to biiild documentation for an
undocumented. and probably old. systern. t hat describes its architecture and func-
tionality.
From the above. it is obvious that reverse engineering tools try to estract an already
esisting. but unknown. structure of a software system. This involves the break down of
the system either in spstem-oriented or human-oriented partitions that represent natural
groupings. Le.. different subsystems or directories of the same system.
The system eramination and management is based on the use of graph structures that
are produced. and later on presented to the user. taking into adrantage feôtures of the
original code. such as function calls or file inclusion. In the ne.- section, we discuss how
those natural groupings could be handled by an OLAP framework.
2.3 Can OLAP be used for Reverse Engineering ?
To the best of our knowledge. program analysts have not taken advantage of a multidi-
mensionai view of data t hat could help them model and analyze the alternative postulated
program structures, For instance, we could consider function cal1s stored in a fact table as
in Figure 2.4. In that figure, Function? is called by Functionl.
/ Functionl / .-. 1 FunctionZ 1 ... 1
Figure 3.4: .-\, Fact table ivith a graph structure
In these cases. ive would like to be able to identify useful graph partitions t hat basically
correspond to parts of the graph with a certain property. For instance. if the graph consti-
tiites a systeni's !dodule Dependency Graph. dense regions of the graph may correspond to
separate logical modules or siibsystems of the system.
The issue in question now is hoiv do ive identify worthy graph partitions. having a
table like the one in Figure 2.4. i.e.. how do ive horizontdly partition the fact table into
subtnbles wi th a certain structure and how do we efficiently query these systems. Therefore.
a particular algorithm h a s to be used to produce the partitions of the fact table. and i n
addition an OLAP system to represent and quer? the results. At this point. we would
like to stress that our Fociis will not be on imposing a specific structure on our table but
estracting its inherent one. This process should be based on the Following decision steps:
(. 1) CVhat is the current format of Our data:
(2) \.Vhat is the algorithm under consideration: and
(3) CVhat models have OLAP researchers proposed and used. and how information about
the results of our techniques can be incorporated in thern.
Lpon this, vue can nonr
system under consideration.
answer "what if?". ~ h y ? and ~here'!" questions on the
An esample OLAP q u e l could be: -What would be the
effect on the memory subsystem if the function cal1 to f o o 0 from bar() is ornitteci and
the io .h header file is rnoved to the /system directory'l". Unlike traditional OLAP. the
effect will Iikely not be a numeric aggregate. but rat her a new grouping of entities prodiiced
either by a query or by mining or prograni analysis algorithms. This also makes the the
--multidirnensional view of data" and -tirne intelligence" properties of more importance
compared to the -CalcuIation-intensive capabilities" one.
.\ ioreover. current reverse engineering tools do not support Version Cont rol of a soft-
tvare system. In order to investigate differences arnong versions of the system. one necds
to esamine all versions individually, and nianually find al1 points of interest. On the other
hand. in an OL-\P framework. t i m e is treated a s a separate dimension. niaking historical
data casier to analyze.
In out work. we shall consider data that are originally in the Riyi Standard Format
( RSF) format [II?VT94. \VTIISS-l] n-hich is used by existing systenis. such as Rigi and
Thc Sof tunr~ Bookshelf[FC¶1i+97). to provide understanding of softivare legacy +stems. In
the following chapter we present some graph theoretical and mining algorithrns that can
be used to unvei1 groups in software engineering data which can later on be modeled in an
OL.4P environment.
Chapter 3
Identifying Partitions and the use of
Hierarchies
:Li already trierit ioned in previous chapters. t lie problem of arialyzing and understanding
inforniation relateci to a software system consists of iïnding proper and meaningful partitions
of a graph. In reverse engineering siich graphs include control flou*. data fiou and resoirrce
flou- graphs [MC90]. They capture the dependencies or interactions arnong the software
entities that comprise a system.
Grapti based algorit hms are used in the software engineering and data rnining commu-
nity to find nat urai partitions of the set of vertices. or edges of a graph, and what follows
is an overview of these techniques and how they can be used. ive conclude with an anal-
ysis of which of the aforernentioned techniques are suitable for adaptation in an On-Line
.-I nalgticnl Proc~ssing ( O LA P) system.
3.1 Clustering Algorithms
3.1.1 Hierarchical Clustering
The first family of algorithms that result in groupings of the initial data set is the one of
clustering algorithms. Their main purpose is to find naturd and meaningful partitioning~~
or clusters. In some problerns the produced clusters can be used "as is', while in others
they may form the basis of constructing consequent clusters. thus producing a hierarchy of
clusters. This section gives an overview of the two major categories of hierarchical clustering
algorit h rns: agglomeru tire (or bottom-up) and dicisice (or top-tioum) . Before proceeding
with the brief description of these algorithms. we introduce the notion of a sequence of
partitions that are nested to each other [JDYS].
Consider a set -1- of k data items ( in our case this can be the set of nodes or edges):
-4 partition C of 'i breaks it into aubsets (Cl. C,.. . ..Cm} such that:
n C; = id. 1 < i. j 5 m. i # j. and
The set C is called a clustcn'ng and each of the C,'s â clustt~r.
Definition 1 Partition *P is nrsted into C if tcemj clirstcr of .P is a proper subsct O/ n
cluster of C.
In the following clusterings. 'l, is nested i n C. b u t D' is not:
Agglomerative Clustering
In agglomemtire clustering [JDYS], each data point starts being an individual cluster. .As
the algorithm goes on. clusters are merged to form larger clusterst thus nesting a clustering
into another partition. The rnerging of clusters is based on a similarity (or dissimilady)
function t hat decides horv similar (or dissimilar) two clusters are.
Figure 3.1 is an esample of how the agglomerative function works on a data set of 4
points. A special type of tree structure is used to depict the
each level, This structure is called a dendrugram.
rnergings and clusterings of
3 & r 4
Figure 3.1: An exarnple of agglomerative ciustering
A large collection of agglornerative algorit hms is presented i n [.J DSX].
DNisive Clustering
Dirisire clusteriiig algorithms [JDSS] perforrn the t a s k of clustering in reverse order. Stnrt-
ing w i t h ail the data points in a ~ingle .~big" . cluster. siich an algorithm iteratively divides
the -big- cluster into srnaller ones. This type of algorithms are not very popular due to
their high cornputational complerity: at each step. the number of partitions to consider is
esponential [JDSS].
hlost of the above cases olten lead to espensive solutions and maybe not near optimal
ones. These are the cases where we need to ernploy a smart procedure to identify meaningful
and interesting clusters of a graph. Heuristic approaches. then. corne into pla! in an attempt
to find optimal solutions in a moderate arnount of time. Researchers use variations olalready
known techniques. such as hill-climbing [IiLTO. .LI&IRC9S] to prune the search space of
ciusters in order to find &good7 clusters in the minimum possible amount of tirne. Depending
on the domain under consideration. different heuristics can be applied. wit h different results.
and certain attention should be paid to their evaliiation.
3.2 Discussion on the usage o f graph theory and clustering algo-
rithms in reverse engineering
Evaluatirig various techniques that pertorm clustering is crucial. and our concentration
ehould be on the following [CWX]. three issues:
( 1) on what data \vil1 the methods be applied:
(2) what is the computational cost of a met hockand
{ : 3 ) how -good" are the clusters.
To the above. ive add a fourth issue for consideration. which emanatcs from the arnoiint of
disk and niernory space availabte:
(4) whether the algofit hm is suitable for in-rnemory esecution o r the data should reside
on disk.
\Ve shall be dealing with graph da ta lrom the progrnm domain. specifically we'll focus
on what has been called Slodide D ~ p e n d ~ n c y Graphs or .\[DG'S [.\ISIR+SX]. . in lIDG is
a directed graph whose nodes are entities of a software system (procedures. files etc.) and
whose edges are relationships between them. The nodes and edges may be accompanied
wit h at tributes t hat depict properties of each procedure (developer. version. fan-in. fan-out
e tc . ) . A software system seems easy to analyze when the number of modules (nodes) is
fairIy srnall. In this work. ive are interested in analyzing large legacy systerns. consisting of
severai thousands of nodes and edges. which often corne undocumented. Our goal is to find
partitionings of the NDG in a way that the produced subsets are natural and represent
interesting information inherent in the system. -Aithough there might be sorne structure
inside a software system. we are often unab[e to single out individual components.
.-\gglomerative and divisive algorithms have been proposed by Wiggerts [Wiggi] as a
means of performing hierarchical clustering given an LIDG-like graph. Both categories of
algorithms are based on a sirnilarity or dissimilarity measure arnong the nodes of the graph.
This measure has to be updated each tinie a new clustering is formed or split into smaller
ones. The measure obviously affects the number and the qualit? of the clusters. mainly due
to the following reasons:
1. a. node (module) might end up being in a w o n g cluster due to ties:
'1. different measures can give different clusterings.
Ir1 the hierarchical algorithms. it is not clear what happens if there are more t han one edge
between two nodes. hence we have a multi-graph. L\.é woiild sa' that these algorithms
are inefficient. everi inapplicable in such cases. To make matters worse. parallel edges ofteri
appear in IIDG's, for esarnple ~vhen a function calls anot her fiinction in two different points
of the program.
Heuristic approaches seem to alleviate this pain and moreover give natural clusters of
a system. hIancoridis et al. describe a system that generates meaningful clusters based on
the inter- and int ra-con nections of nodes in SI DG'S [D.\I.\199. l[MCC;99. .\IlIR+SP]. The
cliisters conform to the widely used heuristic of -low-couplin:, and high cohesion". a heuristic
widely used in software engineering. Low coupling is a software principle which requires
that interactions between subsystems should be as few as possible. while hi& cohesion is a
related principle t hat requires t hat interactions wit hin a su bsystem should be rnaxirnized.
Inside the described framework. a genetic algorithm is applied to an hIDG. that esplores.
in a systernatic way. the estremely large space of partitions and gives a -goodU one. Their
systern. called Bi'.\'CH. operates well for any given set of nodes and edges.
\Te should note here the esistence of graph theoretical algorithnis that try to capture
groups in graph structures. 'lamely. algorithrns that investigate strongly connected com-
ponents or articulation points might be of signifiant interest for our problem. Strongly
connected components identify the -piecesn that comprise a graph. and two vertices are
in the same component if and only if t here is sorne path between them [WesgG]. On
the ot her hand articulation point algorit hrns find vertices whose deletion disconnect a
graph [Slii98]. .Ut hough bot h type of algorit hrns do not require significant amounr of mem-
or- and space [CLR92]. their applicability is not proven in the software reverse engineering
domain.
In the previoiis algorithms. we could add that of finding cliques in a graph. Intuitively.
a clique is a graph in which each pair of vertices is ari edge. -4 complete graph ( a grapfi in
whicti al1 pain of vertices forni edges) has tiiany subgraphs that are not cliques. but every
induced subgraph of a complete graph is a clique [LVesSü]. Finding cliques. however. is an
.L*P-cornplete probleni [C\\'3]. and in [S[S165] .1Ioon and ltoser showcd that the nuniber
of cliques in a graph rnay grow esporientially with the nurnber of nodes.
In our work. we do not coiisicier any graph theoretical algorithm.An interesting question
that nrises ivhen a clustering algorithm is applied. has to do [vit h the identity of the clusters.
If the algorithm is hierarchical. the question also includes the identity of the levels produced.
One way to deal with it. is to use esisting domain knowledge about the software systeni.
In the foilowing section. we present a more natural technique widely used in the software
engineering community. that of Concept .-Inalysis.
3.3 Concept Analysis
Concept analysis is a means to identify groupings of objects that have common attributes.
In 1940 G. Birkhoff [Bir401 proved that for every binary relation between -0bjects" and
their -attributes". a Iattice can be built. which allows remarkâble insight into the structure
of the original relation. The following definitions and the exarnple are taken from [LS97].
In concept anaiysis we consider a reIation T between objects O and at tributes A. hence
'7- E O x cl. -4 formaI context is the triple:
For any set of objects O C O. their set of common attributes is defined by:
o ( O ) = { a € A I V o € O : ( o . a ) € T }
while. for any set of objects -4 C A. their set of common objects is given by:
r ( 4 = { O E 0 1 Va E -4 : (o. a ) E T }
A pair (O. -4) is called a concept. if:
.-! = o ( 0 ) and O = r ( 4
Such a concept corresponds to a masimal rectangle in the table 7. A niasimal rectangle is
n set of objccts sharing çoninion attributes.
Concept analysis starts ivith the table 7 inciicating the attribiites of a given set of
objects. It then builds up JO-called conc~pts which are rnasimal sets of objects sharing
certain features. Ali possible concepts can be grouped into a single lattice. the so-called
concept l n t t i c ~ . The smallest concepts consist of few objects having potentially many
different attributes. the largest concepts consist of many different objects that have only
few attributes in common. A formal concept and its concept lattice estracted from a
FORTR-4.j source file are shown in Figure 3.2.
Figure 3.2: A source code, its variable usage and its concept lattice [LS9'7]
The set of ail the concepts of a given table conform with a partial order:
In the concept lattice. the infimum. or join. of two concepts is computed by intersecting
t k i r c ï : c i i L . iIi2 c:iiefii ùf & iùiiicpi Lriiig the art "f iia u b j e ~ ~ a 0:
Thus. an infimiirn describes the set of attributes common to two sets of objccts.
The suprrrnuni. or meet. is cornputed by intersecting the intents. the intent of a concept
being the set of its attributes A:
Thus. a supremum describes a set of'comrnon objects which share the two sets of attributes.
In order to interpret a concept lattice. we also need to define the following:
which corresponds to a lattice element labeled \vit h a, and
7 ( O ) = /\{c E L(C) 1 O E extent (c) )
which corresponds to a lattice element labeled tvit h o. The property t hat connects a concept
Iattice with its tabIe is as foltows:
Hence. attributes of object O are just those which show up above o in the lattice, and the
'LI
objects for attribute a are those which show up below a.
Interpreting the concept lattice of Figure 3.2. we have the following, according to the
aforementioned definitions:
;\Il subroutines below p(L..3) (R2. R3. R-!) use C'3 (and no other subroutines use
1:). Al1 variables above 7 (Rd ) (1.3. \4. 1 3 . C-6. I T C-8) are used by R-l (and
no othw variables iise R1). Thiis. the concept labeled R-I is:
and the concept labeled 1?.5/ R2 is:
[t is obvious that cl 5 c3. This can be read as: 3 n y variable that is rised by
subrotitine R2 is also used by R-l". Similarly. ( 5 ) 5 p(I.-3) = p(\,*-C), which
is reaci as: -:[II subroutines which 11s 5 i l also lise 1'3 and 4 . LIoreover,
the infinrurn of C'.5/R'L and 1% K. \.'Y/ R3 is labeled R-L meaning that R-k (and
al1 subroutines below 7(R-!) ) uses both F...5 and '19'6.1,'1. VS.
After dl. the lattice uncovers a hierarchy of conceptual clusters iniplicit in the origi-
nal table. To handle t hose concept ual clusters i n a rnultidirnensional way and. furt herniore.
query them. we need to introduce a proper multidirnensional rnodel. Several researchers have
proposed variations of a multidirnensional mode1 [Fir98. IH9-L. PJ99. Vas9S. CT9T. .AGS97]
and Our work ivi11 center towards an estension of it so as to include clusters and concepts.
From our description of hierarchical clustering it seems reasonable for our task to use an
agglomerative algorithm that incorporates a hierarchy with clusters organized in a -clus-
ter- dimension. The following chapter gives a fint approach torvards the multidirnensional
modeling of clustering and concept analysis.
Chapter 4
The Multidimensional Model
This chapter iritrociuces our approach to the niultidimensional mode1 that will incorporate
the results produced by reverse engineering tools and mining algorithrns. Firçt. we give the
defiriitions of the mode1 for the hierarchical chutering algorithms. and then we extend it to
include the rcsults of concept analysis.
4.1 A Multidimensional Model for managing hierarchical clusters
Lit consider the following.
- ir be a set of featirtrs over which we perform the clustering. Hence. F is given by:
where fi can be a f u n c t i o n - d l , a file inclusion. etc.
m A be a set of nodes. A is given by:
where .-1; can be a file. function, cariable. etc.
23
The dependencies ( i .e . interactions between entities of the software systern) that we have
are of the form:
where -4,. -4, and jk could. for esample. comply wi th the scherna of (Figure 4.1). which is
used in the Software Boobhel/tool [FHIi+ST]. Each of the nodes .Ai has a domnin. denoted
5:; Llo,x(.l,). ::'hick CDTTCS~)G~C!S t o :Lit * . d x s it CL;: Z ~ C . 1n:üitircl:;. fcatarc?~ ~czpic's~;;:
L J
funcdcl, îüncde f f 3
function
f
vardct. vardef \
* variable \ I
mricrodef. usemricro f >
macro \ J
uniondef, useunion T -l
union \ .J
typedef, usetype r J
enumdef. useenum e n m
*
Figure 4.1: The scherna for the relations used in Software Bookshelf
type
edge Iabels on the relations that connect two entities (nodes).
24
J
Each feature may be aceompanied by a set of attributes: B I . @, . . . . BI,, where ~f
is the k-th attribute of feature f;. If f;=function-cail. a set of attributes can be:
funciion-cal1 ! = line nurnber
Bfunction-cali r) = number of paranieters passed -
I n the same sense. a node .-Li may also be accompanied by a set of at t ributes: B:' . B;' . . . . . B t ' . L
If .-\;=file, the set of attributes that can be defincd is the following:
BI" = nuniber of lines
~ f " = nuniber of i/statenients
= nurnber of function calls
For each feature f;, ive create a table with the following schenia:
where { ~ f t ) is the set of attributes for /, . In the above. fi is a fact table of our Data
LVarehouse. Since ive might have parallel edges. i .e.. multiple appearances of a pair ( . - L i . .4,).
we give each pair a iiniqtte identifier. and the Follon-ing is the updated schema of the fact
table fi:
For esample. if fi=function-call. then. according to Figure 4.L. -Ai = .A, =function. and a
fact table could look filie the fol~owing:
id
1
1 4 /I main 1 printf 1 IO5 1 2
2
3 1
For each node -4; we create a table with the following schema:
funcl
1 foo
where (B; ' . } is the set of nocie attributes for the node A , .
From Figure 4.1 we can easily infer tliat s fact table incorporates a graph structure. For
instance. al1 dependencies of the forrn of Figure -1.1 constitute a graph. \Ve are interested in
splitting the nodes of the graph into meaningful horizontal partitions. Considering a node
-4,. Froni the original set. as an individual clustcr. we can apply a hiernrchical clustering
algorithrti on that set of nodes.
. i n initial clustering is defined as:
foo
main
tvhere each C* is a cluster that corresponds to a node .Ac. which participates in feature Ji,
func2
printf
Given a similarity (or dissimilarity) function <j ive may start by trying to find which
are the clusters that can be f'rrned from the initial clustering D$. UTe c d this clustering
scanf
foo
line # # of parameters passed
4 25 1 O
30 1
100 3
and @ is nested in 'Df since each of T$ iis a proper subset of a Ç!; of e.
Definition 2 Operator 4 denotes the nesting of one dustering into anoth~r uith rusp~ct
to a feature f i . Hence. if Z$' is nested in .@'. w r i t ~ :
\\.é perforrn consecutive clusterings. aay p. until we find t tie final clustering with I I I$ II = 1 .
Let Ji=Function-call. .-I,=function with donl(.-li) = {Joo. bar. main. printf. sean/} and
a sirnilnrity function G. The clustering algorithni nia? give the following clusterings:
,@luncttin -cul1 = { ( J o o ) . (bar). (rrrairt). (printf) . ( scc~nf ) )
.@metton - d l = { ( foo. bar) . (niain) . (print f. scan f ) }
@nctton -cal1 r) = { ( foo. bar. main). (print f. scan f ) } *
.fi$mction - c d = { (Joo . har. main. printJ. scan f ) }
The schema for the above hierarchical dustering is depicted in Figure 4.2 whiIe its instan-
tiation in 4.3. Figures 4.2 and 4.3 represent a dimension D and its levels Df.
Figure 4.2: A clustering schema Figure 4.3: -4 clustering instance
Definition 3 For eoch pair of clusterings ~, Z$ such that a $, there erists o roll-up
function R c ~ P ~ ' : DoI
Intuitively. the RLP function aogregates one or more clusters of one clustering fl('\ to a
cliister of the inmediate higher order clustering in the hierarchy (If,').
If a çliister rolls-up to anot her cliister wit h the sanie elenients. i.e.. C$ : ( z I . l n . . . . .ln )
PL rolls-up : ( y r . y?. . . .. gn) such that s, = r,. Vi. j : L 5 i. j 5 n. then RC.P =identity. P?
For each pair of nodes (A,. Ah) tha appears in the fact table fi. i.e. the fact table t hat
corresponds to feature f,. ive create two tables:
where {D~}\'@ is the set of clusterings escept for l$ tvhich is represented by -4, and .Ar,
i n each table.
The dimension D ~ S for the pair (rl,. - A b ) is given by:
After al[. the scherna of a Data \.\arehouse (DiV) is like the one depicted in Figure 4.4
This erample shows how iinherent hierarchies in a software system can be revealed using a
hierarchical clustering. and rnoreover how the results of such a clustering algorithm can be
modeled in a multidimensional way. Cpon the formation of such a mode1 and storage of
the data in the tables. navigation and browsing become easy and efficient.
Figure 4.4: The DIV scliema
4.2 A Multidimensional Model for rnanaging concepts
In our appronch to hierarchical clustering. we create al1 groupings without actually knotving
what earh @' represents. Concept Analysie is a nieans not only to discover the groupings
but also to describe them in a more naturai way.
Suppose that Ive have a relation F -+ -4, x .45. where .4, is the set of objects and -4,
is the set of attributes. For instance. n, can be a set of nodes corresponding to . c files and
.-ta the set of nodes correspoiiding to .h files. In this esample. F is the relation that depicts
file inclusion. F is called the midion rnatrix i n concept analysis. while here it represents
the fact table. described in the previous section. A n esample of a relation niatris is given
i n Figure 4..5. The esample was taken frorn[C;SI.-\95].
The relation matris can be accompanied by several attributes that characterize the edge
they represent(.i.e.. for a relation matris that represents file inclusion possible attributes
might be the decefoper of the . c files and the number of fines of the . h files).
Having such a matris avaiiable we may start building the concept Zattice of the above
relation. which depicts maximal rectangles of a relation rnatrk. The concept lattice for the
relation matris of Figure 4.5 is the one of Figure 4.6.
For each object Oi that appears in a Concept Ci, we create the table extents with
Figure 4.5: Esample of a relation rnatrix
the tollowing schenia:
1 I
Object 1
e r t r n t s ( 0 b j ~ c t . Concept)
Attribute a
The prirnary key for the above table is the combination of both attributes.
For each attribute .-1, that appears in a Concept Ci, ive create a table intents with
the following scherna:
intents(.-lttribute. Concept )
The primary key for the above table is the combination of both attributes.
Finally. once the concept analysis algorithm has computed the concepts and the links
between them. i.e.. the hierarchy inside the concept lattice. we create a table that depicts
the child -t parent relationships between concepts. The scherna of that table is:
hierarchy (Concept l. ConceptS)
30
Figure 4.6: The concept lattice of the matris i n Table 1.
Priniary lie- for the above table is the conibination of both attributes. C'onccpt 1 is the
rhilrl att ribute. ~vtiile Concept2 is the parent one.
The above tables for our exaniple are depicted i n Figures 4.7. 4.8 and 4.9.
The scherna of Figure 4.4 is now estended to include the above tables. The new scherna
is given in Figure 4.10. Iloreover. & - e n the tables extents , intents and hierarchy. tve
can conipute any concept in the lattice using standard SQL.
LVe undentand that for both Hierarchical Clustering and Concept Analysis algorithms.
levels are of a major importance. \Ye need to efficiently navigate through the different levels
of the hierarchy these algorithms produce and infer things that Iiappen above or belotv a
specific level. In general and for any instance of the Data LVarehouse we need to be able to
vierv the software system from different levels of abstraction (or detail).
The nest chapter introduces the SQL(H) multidimensional mode1 that gives first-class
status to the dimensions. i.e. the hierarchies t hey encornpass.
Figure -1.7: The extents table for the lattice in Fig- ure 4.6
Object 1
Attribute 1 Concept 1
Figure 4.S: The in t en t s table for the lattice in Fig- ure 4.6
Concept c 1
bot c'L bot c 3
1
Figure 4.9: The hierarchy table for t lie lat tice in Fig- ure -4.6
h iernrchy : 1 parent 1 Chifcl
r 3 r' 3
~ s t ~ n t s : Object Concept interits : .-lttribicte Corrcrpt
Figure 4.10: The estended DW schema
Chapter 5
The Extended SQL('H) Model
.-\Y nientiotied st the end of the previous chapter. performing mining algorithms over software
data needs to be flesible, in the sense tha t hierarchies ancl levels inside them should be ~ v e l l
defirieci and easy to iise. In the paper **\Vhat can Hierarchies do for Data \\/'arehouses"
.I ngndisli ~t al. proposeci a neiv ni i i l tidimension;rl model. ralled SQ 1.(31). ivhich estends
the relational data model of SQL and gives first-class significance tu the hierarchies in
dimensions.
I n this chapter. ive briefl. introduce the SQL('H) model. ive identify some key weak-
nesses of this inodel and go one step ftirther by estending it to the (E)stended SQL('R) (or
ESQL(R)) niodel. in order to make it more general and adaptable to ou r needs.
5.1 The SQL( 'H) model
Several models have emerged to handle multidimensional data. We can bbefly mention
the Star and S n o ~ a k schemata as the rnost prevalent and elegant ones. However sev-
ers1 limitations apply to t hese models. wit h heterogeneity wit hin and across levels being
one of them (especially for the Star schema). Restricting the case to Relational storage
of fact and dimension tables (ROL-AP architecture). those models require that the com-
plete inlormation concerning the levels of a hierarchy be stored in a single table. The
shortcomings are straightforward. For esample. having a dimension named location. USA
and Slonaco are constrained to be modeled in the same way. e.g. within the hierarchy
store-city-region-country. But . as Ive knotv. hlonaco is a city and a country a t the
same time.
The aut hors of [J LS99] refer to the limitations of the snowflake schema as the following:
a -Each hierarchy in a dimension h a s to be balanced- . L e . the lengt h froni the root to
a, leaf has t 0 h~ th^ wmP.
0 -.\Il nodes at any level of a hierarchy have to be homogeneous". i.e.. they should
include the sanie attributes.
Since the hierarchies in the aforementioned models are restricted to be part of the
metadata, L E . they do not have a first-class importance. even simple queries have to
include sequences of joins making them hard to read and understand. The SQL(31) mode1
tackles the above problem int roducing an a extension of standard SQL.
The SQL(H) niodel con1 prises:
.-1 Hierarchical Dornain which is ri coiiection of attribute values arranged in such
a way that form a tree. Sew predicates are defined over this domain. and these
predicates are:
- =. which is the standard equality predicate:
- <. which corresponds to a binary relation over the set of attribute values so that
they form a tree:
- <<. which is the transitive closiire of <: and
- <= (resp. <<=). ivhich corresponds to the relation that represents non-proper
child-parent (resp. descendant-ancestor) dependencies.
In general. we interpret each hierarchical domain as a special data type.
0 A Hierarchy Schema. tvhich forms a rooted Directed Acyclic Graph (DAG). In this
data structure the root has a special value .W. Each node of the DAG accommodates
a certain number of attributes including one that has a hierarchical domain. the
hiemrchical nitribute. and is denoted by .Ah.
a -1 Hierarchy Instance. which corresponds to a hierarchy sche'ma defined as above.
In the instance. al1 relational tables correspond to esactly one table of the schema.
~vhile at the same time no table can straddIe hierarchy levels. This riieans t hat al1 the
b d u e d i&e CULI t d i ~ l ~ Lefi~ug L W ~ i i e sariie riimet~isiwri. Fiiiaiiy, LU pies OC a specific LaLie
are properly related wit h tuples of a table (or more thari one tables) above it. This
means tliat given a tuple in a table and the Iiierarchy of the hierarchical attribute
that corresporids to this tiiple. tve can infer its ancestors.
O :\ Dimension Schema. which is a name together with a hierarchy schenia.
O A Dimension Instance. which is a name together with a hierarchy instance.
O '\ Data Warehouse Schema. tvhich is a set of fact tables together with n. set of
dimension sctienias. Fact tables are restrictcd to include hierarchical attribtites corrc-
sponding to only the leaves of the appropriate dimension.
Imagine a Data Warehouse t hat includes the dimensions of location. t ime and product.
and whose fact table captures dollar arnounts for sales with respect to t hese diniensions.
The schema of al1 tables that could form such a Warehouse are depicted in Figure 5.L.
In this figure locId. tId and pId are hierarchical attributes and prirnary keys for their
respective relations.
Recall the example of the concept Iattice in Figure 4.6. Trying to use the SQL(31)
model to represent the scherna of the concept analysis algorithm results. first of all. ive
observe that ive do not have a tree for the hierarchical domain of the first candidate for
such an attribute. which is the set of Concept Ids: {Ci. i <_ i 5 12}. In order to do so. we
need a more general structure. Before introducing such a structure let's see what estensions
the SQL(31) model adds to standard SQL.
(al Iocation dimension (b) t h e dimension (CI pmduct dimension
fart table
Figure 5.1: .A Data CVarehouse conforming to the SQL(7-l) model.
5.2 The Query language for the SQL('H) model
To take full advantage of the SQL(7-l) niodel a simple but poweriul estension of standard
SQL is proposed in LJLSSI)!. Considering single block SQL(31) queries. the basic extensions
are the following.
O DIMENSIONS clause: This clause permits the inclusion of dimension names in
n. query. tt is relevant to the tables mcntioned in a FR011 clause of standard SQL.
but they now refer to the tables of a dimension. Sloreover. just like in SQL we can
declare tuple variables. i n a DI3IESSIOXS clause al1 names that corne right after the
dimension name are called dimension cariables. :\lthough. it will be mentioned in
the semantics of the language. dimension variables range over al1 tu ples of al1 tables
appearing in a dimension.
Hierarchical predicates: in the SELECT, WHERE. HAVIXG or GROUP BY
clauses of an SQL query we can include domain expressions (DES) of the form T..-L.
tvhere T is a tuple kariable and .4 an attribute name. These DES are compared with
others. or values of compatible type. In order to take advantage of the hietarchical
operators that are defined in the SQL(2) model. we permit DES of the form V A
where V is a dimension variable and .4 an attribute narne. Moreover, we extend
DES to include hiemrchical dornain expressions (HDEs) which are of the form !'v..dh
where W is a tuple/dirnension variable and .4h a hierarchical attribute. HDEs can
be compared with cach other using the predicates (hierarchical predicate) that are
defined in the hierarchical domain. For example. given -4 and B. -4 < B means that
-4 is a child of B.
5.3 Semantics of the SQL(X) query language
For the sake of sirnplicity. the authors of [JLS99] use uni/orm SQL(U) queries. Such a query
is of the form:
SELECT dom ExpList. agglist
DilCEYS IONS dimList
FR011 from List
WHERE w hereconditions
GROVP BI' groupbyList
HAC'IYG haveconditions
Clauses that also appear in standard SQL have the same semantics. The question
is what happens with the newly introduced DILIEYSIONS clause and the appearance of
hierarchical predicates in the iVHERE clause. As far as the dimension variables of the DI-
SIENSIONS clause are concerned. ~ h e y should range over the set of nodes in the hierarchy
associated wit h the dimensionn [J LS99]. i. e.. over al1 heterogeneous t uples of a hierarchy
instance. The result of an SQL(R) query is a table according to the scherrra imposed by
the sets domExpList and agglist of the SELECT clause.
The semantics of a n SQL('H) query is given in the original paper [JLSSS]. We present
the semantics more forrnally in a following section. where ive discuss the semantics of
ESQL(R).
5.4 Limitations of the model
The model ive just described offers the advantages listed below.
hdds semantics of hierarchies to t lie data mode1 and the query language:
- gives first-class status to the hierarchies by:
* permitting Iieterogeneity in dimensions. and
* introducing hierarchical domains (trees) as first-class objects.
a Permits -diniension indepeiidence" of the queries. The DIlIE'iSIOSS clause allows
the definition of dimension variables and. f u r t hermore. allows t hese variables to range
over the tuples of the dimension. without taking into accoiint the schenia of each
table. Therefore. the evaluation of' an SQLtX) query is the same rio niatter what is
the schema of the dimension tables it refers to.
m ;\llows the fast evaluation of SQL(71) hierarchical queries. based on bi tmap indices.
However. there esist limitations that are particularly relevant to reverse engineering
data. As we already mentioned. the concept la t t ic~ of Figure 4.6 and the hierarchies that
cxist in it cannot be represented by the SQL(R) niodel. The basic restriction is that the
hierarchical domain must be a tree. .-hot her point is that if some of the hierarchies in the
lattice change in time. those changes might be difficult to capture. The following list gives
the two basic limitations of the model as well as the intuition behind its estension,
c Kierarchical attributes should conform to a domain t.hat has the structure of a tree.
The esarnple of the concept lattice of Figure -4.6 proves tvhy such a domain becomes
inappropriate for rnining and reverse engineering applications.
0 Each Ievel inside a hierarchy must be modeled as a separate set of tables. This implies
that changes in dimension values (e-g.. changes in the number of Ievels) rnay lead to
schema changes.
5.5 The ESQL('FI) model
To overcome the limitations listed in the previuus section ive need to provide an extended
model that fits our needs. The key point for this model is tu be more general than the
SQL('H) model. The notation used i n the following sections is the same as in the paper of
SQL(X). Definitions that are also the same are mentioned to be so. Brieflc in our mode1
...A ""ri-
- r c p ~ w p w ù ~ .
0 h more general structure for the hierarchical domain. and
Levels to straddle tables. so that any arbitrary table may contain values from many
levelç.
Definit ion 4 [Hierarchical Domain]
.-1 hierarchical domain is a partinlly orrkrcd set < h. s> u * h ~ r ~ Vx is cr rion-empty set of
rittributes and 5 a bina y relation uhich is mflex ic~ . antisymmetric and transitire.
The following hold:
1. The only predicstes defined on this clornain are: =. <. <=. < <. < <=. (<= is the same
as 5 in the above definition).
2. The equality predicate = h a . the standard interpretation of syntactic identity
3. The predicate < is interpreted as a binary relation over VR such that for every x. y E
L. r < y e x 5 y A x # y. the graph G< over the nodes of Vw c m be depicted as
a Hasse diagram [TSKq. Such a diagram is an undirected graph were al1 edges are
considered as arrows From bottom to top. i.e.. snialler elements are placed lower.
4. The predicate << is interpreted w the transitive closure of <.
5. For any two elements u. c E Vrr , u 2 c holds iff either u < c or u = LI. Respectively
for u <<= L;'.
C9 CIO CI 1 Cl2
Figure 5.2: The hierarchical domain for concept ids (Cids) of figure -1.6.
The partial order of concept ids for the esample of Figure -4.6 is piven in Figure 5.2.
Intuitively. VH is an abstract data type that corresponds to hierarchies where predicate
< relers to çhild-parent relationships and << to (proper) ancestor-descendant ones.
\Vhenever an attribute .-t conforms to a ciornain which is hierarchical. ive cal1 -4 a
hiernrchicczl nttribrite and denote it by
Definition 5 [Hierarchical Schema]
.4 hierarchy schema is a triple H = (G. A. a) such that:
(ci G is n collection of nocifs o j n n y structure. har-ing a special node .-\fi:
{ii) A is an altribute sel that contains a unique hiemrchical attribute Ah: and
(iii) a : G + 2" is a function that associates a node u E G r i th a set of attributes
o(u) C A. such that V u + Ali. A h E u(u). und cr(.Lll) = 0.
.-Ill nodes O/ G. ercept -411 should inelude the hiemrchical attribute .-lh in their attribute list.
Imagine that ive have a dimension called Concepts and a hierarchical attribute Ah =
Cid that corresponds to Figure 5.4. If the attribute set of the hierarchy is {Cid, Objects,
~ t t r i b u t e s } . then this attribute set can be associated with eractly orle node of the hi-
erarchy. Hence. we shall have nodes: n id, Objects, ~ t t r i b u t e s } and { } for the node
.-\il.
Definit ion 8 [Hierarchy Instance!
-4 hierarchy instance corresponding to a hierarchy s c h ~ m a H = (G.A. a ) is a collection o/
tables 31. thnt satisfy the jollou*ing: ~ n c h tnbk r E U corresponds to a unique node u E G,
( ~ x r e p t for node -411). a n d r is n table orer a(u).
Note that ive do not restrict the nodes to forni a DAC; and we permit the stradclling
of tables t hrough the levels of the hierarchy.
Definition ? [Dimension] JJLSg-
.-1 dimension scherna D(H) is n nunie D together trith a hierczrc-hy scherna H = (G,.-l. a ) .
I , iE mfer to uttributcs A as the nttribute set nssociatd with clirn~nsion L I .
.-i dimension instance 0 ( R ) occr a dimension s c h ~ m n D(H) is a dimension nana€ D
with a hierarchy instance 31 o / H .
Definition 8 [Data Warehouse Schema] [JLSSS]
.-i Data Warehouse Schema in the ESQL(31) mode1 is defineci as a sel of dimension schemas
D , ( H , ) . ui th associated hiemrchical uttnbutes -4;. 1 5 i 5 k. together with a set of fnct
table schemas o f the f omi f (.A;, . . .. .-th". BI.. . .. 5,). where Dj,. . . .. il,, are n subset of
the dimensions Di.. . .. Dk. and B,. L 5 j 5 m. arc additional attributes. including a n y
mensure attributes.
X Data LVarehouse, i .~ . . a fact table and a dimension for a concept analysis framework
are depicted in Figure 5.3. Yote that to store the Objects and Attributes colurnns of the
tables appearing in that figure. we are taking advantage of the object-oriented features of
SQL (3) [RamS'i]. which permits set-valued attributes.
Dimension "Concepts"
1 Cid 1 Objects 1 Attributes 1
Fact Table "Basic-Concep ts"
Figure 5.3: An esample Data LVarehouse
5.6 The ESQL(7i) query language
Before giving some esample queries to the data nioclel ive jtist described. WC will t ry to
analyze the seniantics of the ESQLIR) query language. The langiiage does not have any
differences with the SQL(R) query language as far as spntas is coricerned. The difference is
that when ive try to eialuate each query. we have to take into account the new. more general
hierarchical domain and the arbitrary number of tables that make up each dimension.
A uni/orm ESQL(R) query is defined in the same aay as in the SQL(R) mode1 [.JLSSO].
In general. a uniform ESQL(7f) Q is a Function:
where Q is t h e set of database. and îZ a set of tables of t h e output. under the schema of
the attribute list t hat appears in the SELECT clause. Taking into account al1 clauses of a
uniform of an ESQL(3) query we have the FoIlowing:
a SELECT clause: This clause enforces the schema of the output table. I t is inter-
preted as in standard SQL with the addition that it may contain hienrchical and
dimension attributes, i .e.. att ributes from the set 21 of a hierarchy instance.
DIMENSIONS clause: This clause permits the declaration of dimension names of
interest. .&II dimension variables range over the set of al1 (possibly non-hornogeneous)
tuples of ail tables associated with t hat diniensiori.
FROM clause: This clause is interpreted esactly as in standard SQL. Le.. it takes
the cross product of al1 tables appearing i n it. wliile al1 tuple variables declared i n it
range over al1 (horriogeneous) tuples of the fact tables they refer.
0 WHERE clause: To pin down the semantics of this clause. Ive should recall the
definition of an iristuntiation funçtion [JLS99]. Considering nll tuple and dimension
variables. the instantiation function maps them to appropriate tables of the data
warehouse. Yow, the key issue is to properly evaluate each wherecond of the LVHERE
clause of an the ESQL(IH) query. according to the type of relationship between the
operators and the operands. Thus:
- if the wherecond involves at tribiites from the fact table and operands of the sanie
type. the LVHERE clause is evaluated esactly as in SQL. This means tliat al1
tuple satisfying the wherecond mil1 appear in the result.
- if the wherecond involves attributes from dimension tables which are compared
to operands of the proper type based on a standard cornparison operator. the
query is again satisfied by all tuples in the dimension tables ivhich appear in
relationship wit h the operand,
- if the uherecond involves hierarchical att ributes which are compared to operands
of the proper type based on a hierarchical predicate. the query is satisfied by al1
tuples that are related to operands according to the hierarchical reIationship. i .e.
the hierarchicaI dornain of the hierarchical attribute.
In al1 the above. al1 cornparisons are performed through the mapping of the instanti-
ation function to the appropriate tuples. Taking the concatenation of al1 the results
(instantiations) from a query and restricting t hose tu ples to the attributes that appear
in the SELECT clause. ive have the final ansiver ro the ESQL('H) query. If there is no
measure defined in the fact table of the data warehouse (as in our esample) instead
of concatenation standard relational union shouId be employed.
a GROUP BY clause: It is i n t ~ r p r ~ t w l ~ s n r t l y as in standard SQI,.
HAVING clause: It is interpreted esactly as in standard SQL.
5.7 Sample queries
I n order to show the simplicity and power of the ESQL(R) language. ive give some esample
qiieries and esplain their semantics and their step bp step cornputation.
5 .Y. 1 Dimensional Selection
The following single block ESQL(R) qiiery. Ql . captures the query -find concepts that
contain more than 3 attributes-.
SELECT C. Cid
DIMENSIONS Concepts C
WHERE COUNT(C .Attributes) > 3
Here the approach is the same as in SQL(X). C will range over al1 tuples OF the Concepts
table (here the tuples are homogeneous) and select those C i d s that satisfy the condition of
the \VHERE clause. The resulting table is the following:
5.7.2 Hierarchical Join/Aggregation
The following single block ESQL(R) query. Q2. captures the query "Find the objects of
each concept that contain over 2 objects".
SELECT C.Cid, C.Objects
DIMENSIONS Concepts C
FROM Basic-Concepts F
WHERE F-Cid <<= C.Cid
GROUP BY C .Cid
HAVING COUNT(C .Attributes) > 2
[n this c,we the \VHERE clause contains condition: "F.Cid <<= C .Cidg-. which is of the
forrii * o i ' ~ ~ . - \ h eh opnd-. Here. ive do not have the tiierarchy of the levels in the concept
lattice giveii by the tables of the dimension Concepts. Thus. ive should use the attributes
of the hicrnrchical domain to get the eh-relation. Let's see hoiv the query will be evaluated.
1. ~(C)[C'idl ranges over al1 Cid attributes of the Tact table sales. i.e.. attributes {Cl ,C2, C3 ,C4 ,Cï}.
For each of t hese attributes we compute the (reflesive) transitive closure relation tc .
and get:
Now i(opnd) ranges over al1 hierarchical attributes of table Concepts. Taking also
into account t h e H.\VTXG clause condition. we have:
m Using tc(C1). the instantiation of Oh-relatives of CI is:
46
O k i n g t c (C1) . t h e ins t an t i a t ion of Oh-relatives of C2 is:
I:sing tc(C'3). the instantiation of dh-relatives of C3 is:
0 C'sing tc(C' - i ) . the instantiation of Bh-relatives of C 4
0 Lsing tc(C'7). the instantiation of Oh-relatives of C 7
1 top 1 {1.2.3.4.5} //
Cid Objects
2. The final result for Q2 is the union of al1 t h e above tables:
5.7.3 Hierarchical Join
The following ESQL(X) qiiery. Q3. captures the query -Finci the irnniediate breakdoivn of
concepts wit h more than 2 abjects".
SELECT Cl. Cid AS Concept 1, C2. Cid AS Concept2, C2. Objects C20b j ects
DIMENSIONS Concepts Cl, C2
FROM Basic-Concepts F
WHERE F.Cid <<= C1.Cid AND
CL. Cid < Cl. Cid AND
C1.Cid IN ( SELECT C. Cid
DIMENSIONS Concepts C
FROM Basic-Concepts F
WHERE F.Cid <<= C.Cid
GROUP BY C X i d
HAVING COUNT(C.Objects) > 2 )
GROUP BY Cl. Cid, C2. Cid
Taking into consideration the result of query Q2, it is easy to inf'er what the result of Q3
will be: For al1 tuples in the result of Q2. give the Cid of its immediate child. T h e result is
given in the follotving table:
Concept1 Concept2 C2O bjects
C9 CS {1.2}
C9 C -5 {2.3}
Cl0 CS { 1.2}
cro CS (W
Cl2 l CG 3:;;
Cl2 c'ï
L top cc3 { 1.2.3} i
top Cl0 ( 1.2.4)
top , Cl1 (4 3)
top CL2 (2.3.5}
Chapter 6
Conclusions
Iri t his ivork. LW st uciied ivaxs of put ring R e r ~ r s e Engineering and Data Il'wehousing tech-
niques toget her. Softlvare reverse engineering techniques t ry to capture the structure of.
usiially. iindocumented systems so that their understanding and maintenance becorrie cas-
ier. On the other hand Data \\*arehousing. and apecifically On-Line Analytical Processing
systerns. prol-ide the appropriate nieans to pose cornpIes. ( r d hoc. queries on inforniatiori
est racted by reverse engineering tools.
\Ve first investigated how several graph-theoretical algorithms can be used in order to
analyze and partition graph structures that are estracted from reverse engineering tools.
such as Rigi and the The Software BooksheiJ h[ost of these algorithms proved to be inef-
ficient to iniplenient due to time and space constraints and the nature of the graphs that
appear i n the results. hlost important is the fact that those algorithms do not reveal any
hierarchicâl structure of the underlying system-In the following chapters. ive described how
On-Line r\nalytical Processing systems handle situations rv here hierarchies exist. .A large
nurnber of researchers have been involved in the study of such systems so as to make their
modeling and querying easier for the naive user. These systems are basically employed by
decision makers who search for trends and Future estimates about their company's critical
parameters. To the best of our knorvledge OLAP systerns have never been comprehensively
studied and employed in the field of reverse engineering.
This thesis presented a new multidimensional model for hiemrchicaI clustering and
concept analysis algorit hms. Bot h types of algorit hms are often used by software engineers
and their results yield interesting observations about the systems under consideration. How-
ever. they have never been able to store these results in a natural and easy to use manner.
Our model is the ba is for optimal storage and natural aay of querying this data. Further-
more. we estended the work by Jagaciish. Lakshmanan and Srivastava [JLS99]. in order to
pive a more general mriltidimensional model which provides first-ciass status to dimensions.
The basic intuition is that the algorithnis mentioned above nia- give different results un-
der different paranieters. or given different versions of the same prograni. The estensions
corn p rise:
a A more general structure t'or the hierarcbical domain of a certain type of attribtites.
called hierarchical at tributes: and
a A refined definition of the notion of levels in t his model. so t hat tuples may appear i n
any table of a hierarchy.
Therefore. the hierarchy of levels can be estracted by the hierarchical domain of the hier-
archicd attributes and if new levels appear in the conceptual level. the Iiierarchy schema
does not need to be changed.
The work presented in this thesis can be estended in several ways. LCé focus on the
evaluation of comples OL..\P queries posed over the ESQL(U) model. In [JLS99]. a new
algorithm based on bitmap indices is given in order to compute queries that include the
<<= and = hierarchical predicates. This algorithm does not need to be further extended
for ESQL(X) queries because it is based on a preorder traversal of the hierarchical domain.
In ESQL(R) the hierarchical dornain is a partial order where such a traversal can be defined.
However. we need to consider algorit hms for evaluating queries including the < hierarchical
predicate. Bitmap indices could help. and moreover. the- con provide the appropriate
background for the faster evaluation of queries that entai1 COUYï' and SUM aggregate
functions in their SELECT or HAVING clauses.
Bibliography
[hCS97] Rakesh :lgrawr.al. .-1. Gupta. and Siinita Sarawagi. hIodeling ~Iultidiniensional
Databases. III Ales Gray and ~er-:\ke Larson. editors. Proc. of the 13th l n t l
Conf. ori Data Engineering. ([C'DE). pages 232-243. IEEE Press. 7-11 April
1997.
[:\P9Y] Periklis Andritsos and Athanassia Papagianni. On the dewlopment O/ a tool
thrrt supports O L.-1P q u ~ r i c s . Diplonia t hesis. Dept. of Elect rical and Corriputer
Engineering Xational Technical Lniveristy of At hens. 199s.
[BHB99] Ivan T. Bowrnan. Richard C. Holt. and Neil V. Brewster. Liniix as a Case
Study: [ts Estracted Software Architecture. In Proc. of the 2ist Int *l Conf.
on Soft uare Engineering, pages .3.5.5-.563, Los Angeles. C.1. C-S.A. hiau 1999.
.-\C'Li Press.
[Bir401 Garrett Birkhoff. Lattice Theo y. :\Sis Colloquium Public.. 23. .AS[S. Yew
York. 1940.
[CD971 S. Chaudhuri and Ir. Dayal. An overview of Data Warehousing and OL..\P
technolog. SIGJIOD Record. 26( 1 ) : 65-14. Slarch 1997.
[CFKIVS?] Yih-Farn Chen. Glenn S. Fowler. Eleft herios Kou tsofios. and Ryan S. LVailach.
Ciao: X G raphical Navigator for Software and Document Repositories. In lEEE
Pmc. of the Int 'l Conf. on Sojlware Maint~nance. pages 66-75, Nice? France.
October 1995.
[CLR92] T. H. Cormen. C. E. Leiserson. and R. L. Rivest. Introduction to algorithms.
hIIT Press and ATcGraw-Hi11 Book Company, 6th edition, 1992.
[COU] O L.4P Council. O L-4P Councit's LVhite Paper. In http://wu?u~.olapcouncil.org/.
[C'Tg;] Luca Cabbibo and Riccardo Torlone. Querying S[ultidimensional Databases. In
Proc. of the 6th Int '2 Wbrkshop on Database Progrnmniing Lnngirages. ( D B L P ) .
pages 3 19-33.5. Estes Park. Colorado. IY-20 August 199';.
[CWYS] D. G. Corneil and I I . E. !L*oodward. .A cornparison and evaluation of graph
t heoretical clusteririg techniques. T S F 0 R. L6( L):74-59. February 1975.
[ D I 9 D. Doval. S. .\Iancoridis. and B. S. .\litchell. Automatic clustering of software
systems ming a genetic algorithni. In Proc. of the [nt '1 Corif. on Softwnrr Tools
anci Engineering Prnctic~. Pittsburgh, P.4. .4ugiist 1999.
[FHlifS7] P. .J. Fiiiiiigan. R. C. Holt. 1. lialas. S. Kerr. K. Iiontogiannis. H. A. Lliiller.
.J. ~ I~lopoi i los . S. G. Perelgut. l l . Stanley, and II;. i.Vong. The Software Book-
shelf. 1B.U Sys t~ms .Journal, :j6(4) : 564-593. 199'7.
[FirSS] .Joseph l I . Firestone. Dimensional llodeling and E-R Uodeling I n The Data
\.Vuehouse. Executive Information Systerns Inc., Ii'hite Paper S. .lune 1998.
[GBL P96] dini Gray. Adam Bosworth. Andrew Layman. and Hamid Pirahesh. Data Cu be:
.-\ Relational Xggregation Operator Generalizing G rou p B y . Cross-Tab. and
Su b-Totals. Technical Report 1ISR-T R-95-22. IIicrosoft Research. Advanced
Technologv Division. Redniond. IV;\ 98032. LT.S..4.. 15 Sovember 1995.
[GhI.W.j] Robert Godin. Rokia SIissaoui. and Hassan Alaoui. Incrernental concept forma-
tion aalgorithms based on galois (concept) lattices. Computational Intelligence.
I l ( 2 ) 246-267. Xovern ber 1995.
[1 Hg-!] W. H. Inmon and R. D. Hackathorn. ITsing the Data Warehouse. John Wiley
k Sons. Inc.. 1994.
-4. K. .lain and R. C. Du bes. -4lgorithms for Cluster ing Data. Prentice-Hall.
Englewood Cliffs. YJ. 1983.
H. i'. Jagadish. Laks V. S. Lakstimanan. and Divesh Srivastava. What can Hier-
archies do for Data CVarehouses? In Proc. of the 25th I n t '1 Conf. on k r y Large
Data Bases. ( YL D B}. pages 530-54 1. Edin burgh. Scot land. L X . 7- 10 Septem-
Scr 1999.
Rick Iiazman and S. .Jerorny Carrière. PIaying Detective: Reconstructing Soft-
ware Architecture from *-Lvailable Evidence. Technical Report. CL IU/SEI-97-
TR-O IO. Software Engineering tnstitiite-Carnegie lfellon University. Pittsburg.
P.\ 1.52 13. October 1997.
B. \V. Kernighan and S. Lin. An efficient heuristic for partitioniiig graphs. Bell
Syst~n1.c; Technicd ./., 49:29 L-407. L910.
Riidolf K. Keller. Reinhard Schauer. Sebastien Robitaille. and Patrick Page.
Pattern-Based Reverse-Engineering of Design Coniponents. [n Proc. of the
>lst In t? Con/. on Sojlir.are Engineering. pages 226-235. Los .-ingeles. C.A.
US.-\. llay 1999. ACSI Press.
C. Lindig and G. Snelting. Assessing modular structure of legacy code based
on mathematical concept analysis. Ln Proc. 01 the Int '1 Con!. on Software
Engineering. Boston. Wi. 17-23 LIay 1997. IEEE Computer Society Press.
Renée J. .LIillet and Ashish Gujarathi. Slining for Prograrn Structure. Int ï
JO und on Software Engineering and Knotrkdge Discoce y. ?(?) :?*Y?-*?. 1999.
J. iV. hfoon and L. hIoser. On cliques in graphs. Israel Journal of .Clathematics.
3323-28. 196.3.
DIAICG991 S. Mancoridis. B. S. 4IitchelI. Y. Chen. and E. R. Gansner. Bunch: A clustering
tooI for the recoverÿ and maintenance of software system structures. In Proc.
of the [nt*/ Con/. on Software Jfaintenance. pages 50-59. Oxford, U'K, August
1999. IEEE Cornputer Society Press.
p l l l R f 981 S. I,lancoridis. B. S. Slitchell. C. Rorres. Y. Chen. and E. R. Gansner. Csing
au tornatic clustering to produce high-level systern organizations of source code.
In Proc. of the [nt '1 itorkshop on Progmm Iinderstanding. Ischia. ItaIy. J une
1 tl0Q L U U U *
[IlOTC93] Hausi A. Slüller. Ilehmet A. Orgun. Scott R. Tilley. and James S. ï l i l . A
reverse engineering approach to su bsystem structure identification. Softuwe
.\fain t e n a n c ~ : Rexarch and Practice. 5(4) : 18 1-20-1. Decem ber 1993.
[11V90] Hausi A. Ilüller and James S. Irhl. Coniposirig Si1 bsystem Structures usin:,
(k.2)-partite G raphs. Technical Report DCS- 128-IR. Depart ment of Corn puter
Science. University of C'ictoria. llarch 1990.
[h 1 \.VT94] Hausi A. blüller. Ken n u Kong. and Scott R. Tilley. Cnderstanding Software
Systems C'sing Reverse Engineering Technology. In Proc. of the 62nd Congms~
of L O.-lssociation Canadienne Francaise pour l*.-lcnncement des Sciencrs. (.-LC-
fis). pages 41-45. Ilontreal. PQ. 16-17 IIay 1904.
Torben Bach Pedersen and Christian S. Jensen. hIultidirnensiona1 Data Srod-
eling for Cornples Data. In P m . of the 15th 1nt.f Confkrence on Datu Engi-
neering, (ICDE). pages 336-345, 23-26 March 1999.
Raghu Ramakrishnan. data bas^ Management Systems. McCraw-Hill. 1997.
S teven S. Skiena. The -4lgorithm Design .\[an ual. Springer-Verlag, Berlin.
Germany / Heidelberg. Germanÿ / London. L X / etc.. 1999.
blichael Siff and Thomas Reps. Identifying Modules via Concept .\naIysis. In
Proc. O/. the Int l Con/. on Sof lua~ .Clointenance, pages 170-179, Bari. Italy.
September 1997.
Gregor Snelting and Frank Tip. Reengineering class hierarchies using concept
analysis. .4 CM SIGSO FT Sofiware Engineering :Votes. 'L3(6):99- 1 10, Novern ber
L998. Proc. of the Int'l Symposium on the Foundations of Software Engineering.
Scott R. Tilley. 4Ianagement Decision Support Th rough Reverse Engineering
Technology . In Proc. o j C.-1SC0.L'wY2. pages 3 l9-XS. 9- 1 1 .\;oveni ber 1992.
Scott Tilley. A Reverse-Engineering Environment Framework. Technical Re-
port. CbIV/SEI-98-TR-00.5. Software Engineering Instit u teCarnegie hlellon
hiversity. Pittsburg. PA 15213. April 1998.
J . P. Tremblay and R. llanohar. Discr~te SIathematicnl Striwturcs wilh .-lppli-
cations to Cornputer Science. SlcGraw-HiIl, Sew York. 1975.
Panos C'assiliadis. SIodeling !dultidimensional Databases. Cube and Cube Op-
mations. In Proc. O! the 10th SSDB.11 Conf.. Capri. Ital';. .JuIy 1998.
h i e van Deursen and Tobias Kui pers. Identifying O bjects using Cluster and
Coricept Analysis. In Proc. of the lnt '1 Conference on S o f t u w ~ Engineering.
pages ?-Ki--7.55. Los Angeles. C.1. 16-22 May 1999.
Douglas B. West. Introduction tu Grnph Theo y. Prentice-HaII. 1996.
T. A. \C'iggerts. k i n g Clustering Algorit hms in Legacy Systems Remodular-
ization. In Proc. of the 4th IVorking Confermce on Reverse Engineering. pages
24-32. .1msterdam, Netherlands. 6-8 October 1997.
[WT.\IS94] tienny LVong. Scott R. TilIey. Hausi A. lLüiler. and Margaret-.Anne D. Storey.
Structural Redocurnentation: .A Case Study. IEEE Software. I'L(1): 46-54.
January 1994.
[YHCSi] Alexander S. Yeh. David R. Harris. and '[elissa P. Chase. Manipulating Recov-
ered Software Architecture Views. In Pmc. of the 19th Int 1 ConJ on Soficcare
Engineering. pages 184-194. Boston, Massachusetts. USA. May 1997. Springer.