diploma thesis a visual evolution explorer...diploma thesis october 3, 2006 a visual evolution...

Diploma ThesisOctober 3, 2006

A Visual EvolutionExplorer

Visualize a Release History Database

Daniel Zuberbuhlerof Winterthur, Switzerland (01-712-488)

supervised by

Harald GallMartin Pinzger

Department of Informatics software evolution & architecture lab

Diploma Thesis

A Visual EvolutionExplorer

Visualize a Release History Database

Daniel Zuberbuhler

Department of Informatics software evolution & architecture lab

Diploma Thesis

Author: Daniel Zuberbuhler

Project period: 3. April 2006 - 3. October 2006

Software Evolution & Architecture LabDepartment of Informatics, University of Zurich

Acknowledgements

I would like to thank all the people who have supported me and have contributed directly orindirectly to the success of this thesis.

Special thanks go to Professor Harald Gall for giving me the chance to write this diplomathesis, and to Christoph Bommer who kindly supported the thesis with the ressources and dataof Siemens.

Many thanks also to Martin Pinzger for his advice and encouragement and for the valuablefeedback for the draft of this document.

Many thanks to my parents for proof reading and for their continuous support during mystudies.

Abstract

In this thesis we present the design and implementation of a plugin for the Eclipse platformthat provides facilities for visualizing the history of a software project. The plugin integrates afrontend for a Release History Database (RHDB) into the Eclipse IDE and provides interfacesfor data importers and extension points for visualization modules. The plugin includes a treemap module suitable for visualizing hierarchical data such as the metrics of files contained in asoftware release.

Furthermore we describe the implemention of data importers for the versioning and issuetracking systems used in the development of the case study system, a railway automation system.We populate the RHDB with the data from the case study system and evaluate the effectivenessof the plugin with visualizations based on this data.

Zusammenfassung

Diese Diplomarbeit prasentiert das Design and die Implementierung eines Plugins fur die EclipsePlatform zur Visualisierung der Geschichte eines Software Projektes. Das Plugin integriert einFrontend fur eine Release History Datenbank (RHDB) in die Eclipse Entwicklungsumgebung undbietet Schnittstellen an fur Datenimporter und Extension Points fur Visualisierungsmodule. DasPlugin enthalt ein Tree Map Modul geeignet fur die Visualisierung von hierarchischen Daten, wiezum Beispiel Metriken von Dateien welche in einem Software Release enthalten sind.

Desweiteren beschreiben wir die Implementierung von Datenimportern fur Versioning- undIssuetracking-Systeme welche bei der Entwicklung unseres Fallstudienprojektes – eines Bahnau-tomationssystems – eingesetzt werden. Wir fullen die RHDB mit den Daten des Fallstudienpro-jektes und evaluieren die Leistungsfahigkeit des Plugins anhand dieser Daten.

Contents

1 Introduction 11.1 Scope of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Related Work 32.1 Release History Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Background 73.1 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.1 Versioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1.2 Issue Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1.3 Linking the Versioning and Issue Tracking Models . . . . . . . . . . . . . . . 10

4 Approach 134.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2 RHDB Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2.1 Versioning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2.2 Issue Tracking Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2.3 Combined Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2.4 Metrics Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.3 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3.1 CMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3.2 Tracker & Clearquest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3.3 Metric Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.4.1 Data Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.4.2 Treemap Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Implementation 255.1 Used Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1.1 Eclipse Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.1.2 Hibernate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.1.3 Prefuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.2.1 Extension Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.3 EvoBrowser Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

viii CONTENTS

5.3.1 RHDB Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 295.3.2 Database Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.3.3 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.4 Data Importer for the Case Study Project . . . . . . . . . . . . . . . . . . . . . . . . 325.4.1 CMS Importer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.4.2 Tracker & Clearquest Importer . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.5 Tree Map Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6 Evaluation 416.1 Case Study Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.2 Tree Map Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.2.1 System Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.2.2 Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.2.3 Developers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

7 Conclusion 497.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

CONTENTS ix

List of Figures3.1 CMS Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Base Clearcase Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Clearquest Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Link CMS – Clearquest – Clearcase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1 EvoBrowser Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2 Versioning Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3 Issue Tracking Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.4 Combined Versioning & Issue Tracking Data Model . . . . . . . . . . . . . . . . . . 174.5 Metrics Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.6 Releasemetric Model Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.7 CMS Model – EvoBrowser Model Mapping . . . . . . . . . . . . . . . . . . . . . . . 194.8 Treemap Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.9 Sample Treemap Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.1 Release selector GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2 Visualization GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.1 Visualization: Project growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.2 Visualization: Revisions total / Revisions since last Release . . . . . . . . . . . . . . 456.3 Visualization: Revisions — Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

List of Tables4.1 Granularity & Structure of Metric Values . . . . . . . . . . . . . . . . . . . . . . . . 214.2 Relation of Multiple Metric Focus Objects . . . . . . . . . . . . . . . . . . . . . . . . 224.3 Data Interface for Metrics and Visualization Modules . . . . . . . . . . . . . . . . . 22

List of Listings3.1 Format of the Tracker dump file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.1 IEvoBrowserVisualization interface . . . . . . . . . . . . . . . . . . . . . . . . 225.1 hibernateMapping extension point schema . . . . . . . . . . . . . . . . . . . . . 275.2 visualization extension point schema . . . . . . . . . . . . . . . . . . . . . . . . 285.3 Hibernate Configuration for stand-alone data importers . . . . . . . . . . . . . . . . 305.4 Excerpt of a CMS history log file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.5 Excerpt of a Tracker export file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.6 Excerpt of a Clearquest export file . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

x CONTENTS

Chapter 1

Introduction

Over the past two decades the level of awareness about the problems involved with maintain-ing and evolving large software systems has increased. Software aging as an inherent problemhas largely been disregarded for a long time. Many large systems have since grown old. Theeffects and the costs of aging software have become evident and can not be ignored anymore.As a consequence more resources are put into the research in the area of software evolution andmaintenance.

In [Leh80] Lehman discusses several laws of software evolution and the life cycle of programs.He argues that several quality aspects such as complexity deteriorate over time unless work isdone to maintain or improve them. Parnas identifies several causes for software aging [Par94]and describes ways to prevent or slow down the aging. He also proposes techiques to rejuvenateaged software such as incremental modularisation or restructuring.

To maintain a good understanding of the workings and the weaknesses of legacy softwaresystems is not an easy task. But this in-dept knowledge is important to identify modules in needof restructuring.

In the last few years research has been done in analyzing the history of software projectsin the hope that information about the evolution of the system can significantly help with theunderstanding of its state, and consequently with maintenance decisions.

History data can be extracted from software repositories, such as versioning systems and bugtrackers. This data is then stored in a release history database (RHDB). Based on the data variousmetrics can be calculated. Through the analysis of these metrics new conclusions about differ-ent aspects of the system can be drawn, such as the stability of elements or the logical couplingbetween elements.

The number of metric values quickly grows to enormous amounts, so filtering techniqueshave to be applied to generate useful reports. An additional issue is that many facts only becomeobvious, when different metrics are correlated with each other.

Visual perception is especially well suited for promoting the clarity of facts and their interre-lations. Therefore the usage of graphical representations is a good way to achieve filtering andcombinations of different metrics at the same time. A cleverly chosen visualisation type appliedto the right metrics can make complex coherences intuitively accessible.

1.1 Scope of the workThe goal of this thesis is to implement a plugin to provide means to visualize and analyze the his-torical data of large legacy systems embeded into the Eclipse platform. The visualisation needs

2 Chapter 1. Introduction

to be configurable so that different aspects of the system can be examined. The plugin has to bedesigned in a modular fashion to enable the integration of additional data importers and cus-tomized, plugable visualization modules.

The major part of the thesis consisted of integrating a suitable release history database, design-ing and implementing a visualisation framework and implementing a base set of visualisationmodules.

To demonstrate the capability of our design, we will implemented importers for a case studyproject and populated the RHDB with its history data. Based on this data we presented severalvisualizations highlighting various aspects of the case study system.

1.2 OutlineAs an introduction into the matter we will introduce related work in Chapter 2. In the followingChapter 3 we will describe our data sources, because it is important to have an idea about thecharacter of the data sources to understand the heuristics for the data import. Chapter 4 presentsour approach and the design of the plugin while Chapter 5 explains its implementation. In Chap-ter 6 we demonstrate several visualizations and discuss the results. We complete the thesis witha conclusion and a proposal for future work in Chapter 7.

Chapter 2

Related Work

The principal task of the system developed during this thesis is to visualize data of a ReleaseHistory Database. Hence this thesis draws mainly from the two domains of “Release HistoryDatabases” and of “Software Visualization”. In the following sections we highlight for each do-main several related works that we used as inspiration or from where we borrowed concepts.

2.1 Release History DatabaseHismo: Modeling History as a First Class Entity

In the article [GD06] Gırba and Ducasse introduce the concept of explicitely including thenotion of History in a meta-model. They have identified several requirements for a generalevolution meta-model: (1) different detail and abstraction levels, (2) property evolutionscomparison, (3) combination of different property evolutions, (4) selectability, and (5) navi-gation.

Based on their findings they developed Hismo, an evolution meta-model that includes his-tory as a first class entity. Hismo can be used to build a historical meta-model with anysnapshot model.

They presented an implementation of Hismo based on the FAMIX model [TDD00].

Populating a Release History Database from Version Control and Bug Tracking SystemsIn [FPG03] Fischer et al. introduce the notion of a Release History Database. They realizedthat version control and bug tracking systems contain large amounts of historical informa-tion that can give deep insight into the evolution of a software project. Unfortunately, thesesystems provide only insufficient support for a detailed analysis of software evolution as-pects. They introduce an approach for populating a release history database that combinesversion data with bug tracking data and adds missing information not covered by versioncontrol systems such as merge points. The database allows that simple queries can be ap-plied to the structured data to obtain meaningful views showing the evolution of a softwareproject.

They demonstrated that joining the modification report information with the bugreportdatabase is useful in several ways. For example, it allows the detection of logically cou-pled files based on the bugs they have in common, for identifying error prone classes withaffected components or products, or estimation of code maturity with respect to the proba-bility of remaining bugs and discovery rate of bugs in earlier releases of the system.

4 Chapter 2. Related Work

Facilitating software evolution research with kenyonIn [BEJWKG05] Bevan et al. introduce Kenyon, a framework for facilitating software evo-lution research. Kenyon presents a meta model approach to release history related tocommon versioning systems such as CVS or SVN and it endorses the notion of history insoftware systems as a set of related “facts”. It provides solutions for efficient analysis ofarchived project artifacts, such as those found in source code repositories and bug trackingsystems. The results of this resource-intensive ”fact extraction” phase are stored efficiently,for later use by more experimental types of research tasks, such as algorithm or model re-finement.

The aim of Kenyon is to reduce the start-up time associated with software evolution researchby providing a framework where new analysis methods can use any supported source codemanagement systems and any supported data type.

Developing a Meta Model for Release History SystemsDane Marjanovic developed in his thesis [Mar06] a release history meta model. For hisapproach he took different models for the history concept as a base for the meta modelwhich commonly describes all the taken models and incorporates all the needed semanticsto describe other versioning data models as well. Like Fischer et al. in [FPG03] he combineda versioning data model with an issue tracking data model. His aim was to keep the metamodel oblivious from any specific versioning or issue tracking system.

2.2 VisualizationTree visualization with tree-maps: 2-d space-filling approach

Shneiderman proposed in his paper [Shn92] the use of tree-maps for visualizing hierarchicaldata structures. A tree-map layout can use the complete available space for informationrendering which is a significant improvement over the traditional approach of representingtree structures as a rooted, directed graph.

Tree-maps can be used to render trees with thousands of leaf nodes while still being com-prehensible. Furthermore a tree-map can indicate up to two attributes of the leaf nodes bytheir size and their color coding.

Visualizing feature evolution based on problem and modification report dataIn [FG04] Fischer and Gall use a visualization to depict feature evolution by projecting prob-lem report (PR) dependency onto (a) feature-connected files and (b) the project directorystructure of a software system. The two different views show how PRs, features and thedirectory tree structure relate. With their approach they are able to uncover hidden depen-dencies between features and to present them in an easy-to-assess visual form. They canshow how the features have co-evolved over time, and to which degree they are coupled toeach other.

Their visualizations use an automatic layout algorithm to generate a graphical representa-tion of the evolution analysis data to be used by a software engineer as diagnostic aid indetecting design erosion or architectural deterioration.

Fractal Figures: Visualizing Development Effort for CVS EntitiesIn [DLG05] D’Ambros et al. employ fractal figures to visualize the overall development ef-fort on file level, and to illustrate the distribution of the effort among various developers.A fractal figure has similarities to a tree-map layout applied to non-hierarchical data butadditionally allows for gestalt impressions to categorize them.

2.2 Visualization 5

Their visualizations allow to discover files of high development efforts in terms of team sizeand effort intensity of individual developers.

Visualizing Multiple Evolution MetricsIn their paper [PGFL05] Pinzger et al. present an approach concentrating on providing in-tegrated condensed graphical views on source code and release history data of multiple re-leases. To visualize metric values they use Kiviat diagrams in which the releases are reflectedas “annual rings”.The diagrams are used for different scenarios: a diagram can show the values of one metricin multiple modules, or it can show the values of multiple metrics in one module. Likewisethe diagrams can be used to characterize the relations between modules. A visualizationwith multiple diagrams can be overlaid with connections between modules to enrich theview with further relations information.

Chapter 3

Background

Our institute is active in the research of software evolution. Several papers and theses have beenpublished about software metrics calculation and the appropriate visualization of the metrics val-ues. In this thesis we applied some of the research results to analyze a large industral softwareproject. To make this possible, we worked together with a large company which develops soft-ware for the public transportation sector. They kindly gave us access to the RCS (Revision ControlSystem) and issue tracker data of one of their software projects. Some details about this case studyproject are given in Section 6.1.

During the life of the case study system different tools have been in use as RCS and issuetracker. Within the scope of this thesis we had to focus on the analysis of one of the used RCSes,but we took extensibility into account while designing the plugin in order that other datasets canbe included without modifications of the core plugin.

In the following sections we give an overview over the different RCSes and issue trackers usedin this case study project.

3.1 Data Sources

3.1.1 VersioningAs RCS the are two Systems in use:

• HP’s Code Management System for OpenVMS (CMS) is the native RCS of the OpenVMS op-erating system, originally by DEC1. It is an antiquated tool, its feature set is poor comparedwith current RCSes.

CMS is used to manage the Ada source files. This part of the system was originally de-signed to run under OpenVMS. It would be too costly to change the RCS. During the lifeof the project wrapper scripts have been implemented to circumvent the shortcomings andconstraints of the CMS.

• IBM’s Rational Clearcase is a state of the art versioning and configuration management sys-tem. It is feature rich, but only a small subset is used in our case study project, namely thepart that is called Base Clearcase. It is used to manage the new java parts of the project.

1Digital Equipment Corporation, bought by Compaq and now part of Hewlett-Packard

8 Chapter 3. Background

CMS Concepts

Figure 3.1 shows a simplified version of the CMS data model. Only objects and attributes whichare relevant for our analysis are shown.

The following list describes the main concepts of the CMS. This information has been takenfrom [Hew05b] and [Hew05a]. Features that are uninteresting for our analysis have been omitted.

Library The CMS stores all the information it needs in a Library. A Library is an OpenVMSdirectory containing specially formatted files. It serves as a container or repository for theCMS entities Element, Generation, Group, and Class. There are some further CMS entitiesbut they are of no interest for our analysis and we will not describe them further.The Library directory can not contain subdirectories but only files. This makes it impossibleto structure the managed project files in a directory tree. A large project, such as in ourcase study, with thousands of files becomes difficult to manage without filesystem basedstructuring. Because of this restriction the case study project has been split up into multipleLibraries. The libraries substitute the directories for structuring the project. This solves thestructuring problem at the cost that the system is not globally versioned anymore. Thiscan lead to inconsistencies between the Libraries if a consistent state is not ensured by anexternal tool.

Element An Element in a CMS library is a file and all of its versions. It is the basic structural unit.Every time a managed file is modified, a new version is generated. The author of the newversion can choose if he wants to place it in the main branch of the Element, or if he wantsto create a new side branch, called “Variant”.

Generation A Generation is one specific version of an Element. It has exactly one ancestor Gen-eration (except for the first Generation which has no ancestor) and can have multiple directdescendant Generations (the main branch and all variant branches).

Group A Group can be used to group multiple elements together. This can be useful for exampleto check out all the files of a feature or module with one single command. Its usefulnesscan be somewhat restricted because always the latest Generation on the main branch of thegrouped Elements are affected with group operations, variant branches can not be specified.Besides Elements a Group can also contain other Groups.

Class A similar concept to the Group is the Class, but it contains Generations instead of Ele-ments. A Class can hold only exactly one Generation of a certain Element. It can be usedto mark particular configurations of Element Generations. This is used for example to markall Generations which compose a release.

The CMS supports the branching concept only on the file level (Element/Generation), it doesnot have the capacity to branch the complete project. If a different development branch is needed,one has to keep track manually of which Generations belong to which branch.

In the CMS only files are versioned (Element - Generation), but not Groups and Classes. Thisimplies a severe drawback: from Group and Class only the present state is known, earlier statescan not be queried. This is especially important to know when we want to include Groups in ouranalysis, that are used to group files of modules together. The composition of such Groups willlikely change over time. If we want to analyze the composition of a Group at a past time, it mustbe reconstructed with the help of the history log.

Exactly the same problem exists with Classes, but here it is somewhat less serious because thestate of a Class is inherently much more stable as it contains specific Generations of Elements. Forexample will the composition of a Class that marks the Generations of a Release hardly changeover time, its last known state will be the only one we need to know.

3.1 Data Sources 9

elementelement_idremark_id

groupgroup_idremark_id

generationgeneration_iduser_name_idcreation_timeremark_id

classclass_idremark_id

reviewreview_user_idreview_remark_id

librarylibrary_spec_id

1

0..*

1

1..*

0..*0..*

0..*0..*

0..*0..*1

0..*

Figure 3.1: A simplified version of the CMS Data Model.

Clearcase Concepts

The following list describes the concepts of the Base Clearase. As an information source we usedthe book [Whi00]. Features that are uninteresting for our analysis have been left out.

Object The design of the Clearcase system follows object oriented principles. Every atomic ele-ment in a VOB which is under version control is called an Object.

VOB A Clearcase repository is called a Versioned Object Base or VOB. It can contain Folders andFiles, derived Objects, and metadata of the Objects.

Element An Element is a file system Object: a file or a directory. The Element contains a set ofversions which are organized in a version tree.

Branch The Branch Object specifies a linear sequence of Element versions.

Version A certain revision of an Element is represented by the Version Object.

Label A Label Object can be used to tag a Version of one or more Elements with a string value.

Attribute An Attribute is a metadata annotation of an Object. Every Object can have Attributes.The Attribute has the form of a name/value pair.

Hyperlink A Hyperlink is a logical, directed link between two Objects. It is composed of twostring identifiers: a from and a to identifier. Hyperlinks are generated automatically foroperations like merges (to mark the side branch participants of the merge). They can alsobe used manually for custom purposes.

Figure 3.2 shows a simplified version of the Base Clearcase data model. Only objects andattributes which are relevant for our analysis are shown.

3.1.2 Issue TrackingEarlier a tool called Tracker has been used for Issue tracking. It has been replaced with the muchmore powerful and adaptable Rational Clearquest.


VOBComment

ElementNamePath

BranchName

VersionVerNo

File

Directory

LabelName

AttributeNameValue

HyperlinkFromTo

1

0..*

1

1..*

1

1..*

Figure 3.2: A simplified version of the Base Clearcase Data Model.

Tracker

We had access to the Tracker dataset in form of a database dump which is a comma separated file.We do not have detailed informations about the Tracker data model, but the data from the dumpfile should be sufficient for our purpose.

Listing 3.1 shows the format of the Tracker dump file.

"Id";"Rapport Type";"Priority";"Frequency ofoccurrence";"Source/Category";"Reporter"; "Securitycritical";"estim.Time";"estim. Valid. Time"

Listing 3.1: Format of the Tracker dump file.

Clearquest

Rational Clearquest is a flexible and adaptable issue tracking system. The data model of a Clear-quest database can be designed according to ones specific needs, there exists no generally validmodel.

Figure 3.3 depicts a simplified model of the schema used in our case study project. Clearquestimplements an object-oriented concept. Different Issue types are designed as objects, even inheri-tence is supported. In our model a supertype called Entity is defined. It contains all attributes thatall Issue types have in common. Derived from Entity are the Issue types Bugreport (report of a de-fect in the system), Changerequest (request for changing an existing feature), and Featurerequest(request for introducing a new feature).

3.1.3 Linking the Versioning and Issue Tracking ModelsIssue Tracker and Revision Control Systems are usually linked together to make the best use of theavailable information. With the link it possible to query the concerning Elements and Revisions

3.1 Data Sources 11

EntityIDStateHeadlineNotes[]LinksSolutionTest

ChangeRequestPrioritySafetyCriticalDescriptionImpactAnalysis

FeatureRequestCommentReleaseNotes

TroubleReportPrioritySafetyCriticalAnalysisImpactAnalysis

Figure 3.3: The custom Clearquest data model as it is used in the case study project.

in the RCS for each Issue. Because multiple tools have been used during the life of the project,different means had to be employed to achieve this link:

• At the time when only CMS and Tracker were used for issue tracking, the issue number waswritten in the remark field (similar to a commit message) of the Revisions that were madeto solve the issue. This was done manually, so there happened several errors and the issuenumber was not always well formatted. Further, when a Revision solved multiple issues,all issue numbers had to be encoded in one remark field.

• The previews method was initially still used when the Tracker was replaced by Clearquest.Later a better solution was found and now CMS Classes are used to mark all Revisions thatare linked to an Issue. The class name contains the Clearquest Issue Identifier.

• With the deployment of Clearcase yet another method was introduced: Clearcase nativelysupports the Hyperlink feature, which can be used to link Revisions to Issues managed inClearquest.

Figure 3.4 shows all RCSes and Issue tracker linked together.


CMS

element group

generation class

library

1

0..*

1

1..*

0..*0..*

0..*0..*

0..*0..*

Base ClearCase

VOB

ElementFile

Directory

Branch

Version Label

Attribute

Hyperlink

1

0..*

1

1..*

1

1..*

Tracker & ClearQuest

Entity

ChangeRequest

FeatureRequest

TroubleReport

Figure 3.4: This Figure shows how Issues in Clearquest/Tracker are linked directly or indirectly to Generations in CMSand Versions in Clearcase.

Chapter 4

Approach

In this chapter, we present the concepts of our approach and lay out the planned architecture andthe internal data model of the EvoBrowser plugin. Further we present our choice for the firstvisualization module which we implemented during this thesis.

4.1 IntroductionThe EvoBrowser Plugin should be capable to visualize evolution metrics of software projectswhich are managed in completely different versioning systems and issues tracked with differ-ent issue trackers. To achieve this flexibility, we separated the visualization part logically fromthe data mining part.

The central object of our architecture is the Release History Database (RHDB) for which wedefined its own independent data model. The visualizations are based directly on the RHDBmodel and they are itself ignorant of the internal model of the original data sources. It is thejob of the data importer modules to translate from the proprietary data model of the versioningsystem and of the issue tracker to the model of the RHDB.

With this separation we have sucessfully decoupled the visualization part from the specificfeatures of the used versioning system and issue tracker. For adding support for a different ver-sioning system, we just have to implement a new importer module, but we do not have to touchthe code of the visualization modules.

The data importer modules do not necessarily have to be part of the Eclipse plugin, they canaccess the RHDB directly from outside of the plugin. When they are implemented as standaloneprograms, they can be directly integrated in the build or release process when they are designedapropriately.

On the other hand, when the importer modules are integrated in the plugin, it enables a differ-ent set of advantages, they can then leverage the features of the eclipse framework, including onesprovided by thirdparty plugins. One usecase is to make the data importers aware of the EclipseTeam API1: when the eclipse project is configured to use an RCS, this RCS can automatically beused as a data source for an importer without requiring manual configuration by the user.

We respected both alternatives in our implementation of the core plugin and leave the choicebetween integrating or standalone to the implementer of the data miner. Our RHDB Interface isdesigned to be used from within the Eclipse plugin, but it also provides the option to be accessedfrom a standalone application.

1The Team API provides a standardized interface for integration of RCS/issue tracker/configuration managementsystems.

14 Chapter 4. Approach

Visualization

CVSCMSClear-

case

Clear-quest

...Bug-zilla

Versioning Systems Issue Tracker

Data Mining

CVSImporter

ClearcaseImporter

ClearquestImporter

BugzillaImporter

...Importer

Visualisation Module Vis Module Vis Module Vis Module

Eclipse Plugin

Release HistoryDatabase

CMSImporter

Figure 4.1: EvoBrowser Architecture Overview

Figure 4.1 shows an overview of the planned system. Clearly visible is the separation betweenthe visualization part and the data mining modules with the RHDB as the joining point.

This separation gave us natural milestones for the plugin design and implementation processwhich will be reflected in the following sections: First we define the data model for our RHDB,then we develop the importer for the CMS versioning system, and finally we design and imple-ment a first visualization module.

4.2 RHDB Data Model

The EvoBrowser Plugin has to be able to visualize evolution metrics of software projects which aremanaged in completely different versioning systems. Those versioning systems have each theirown internal data model with different features and specialities. We want to use our visualizationmodules with distinct projects without the need to adapt the data mining code. To fulfill thisrequirement, we have to identify the common concepts behind the data models and consequentlydefine our own model which implements the highest common denominator. Consequentely wecan design our visualization modules based on a stabile data model and do not have to knowabout the specific properties of the different versioning systems.

There has already been done work in this domain and we do not have to develop a new model.Our RHDB model is based on the models introduced by Dane Marjanovic [Mar06] and the modeldeveloped in the Evolizer Base [JW05] project.

4.2 RHDB Data Model 15

4.2.1 Versioning ModelFigure 4.2 shows the diagram of our versioning data model. It is a barebones model that imple-ments only the most common concepts:

• We assume that the repository contains a hierarchy of directories and files. If this is not thecase, it might be possible to simulate it as we will see in 4.3.1.File and Directory both are descendants of SourceUnit. A SourceUnit has a Name and anassociation with its parent SourceUnit.

• A File object is different from a file in the filesystem. It is a container for all versions of thefilesystem file. The versions are called Revision.

• A Revision is identified by its owning File and its version name and it has a creation time-stamp.

• Each Revision has been created by an Author.

• A Revision can be part of one or more Releases. A Release has a name and is composed ofone or more Revisions.

• For each Release we know its preceeding Release. Two different Releases can have the sameRelease as parent. This means that the releases build a tree structure.

SourceUnitNameParent

Directory File

RevisionVersionname

ReleaseNameParent

AuthorName

Figure 4.2: Versioning Data Model

Many versioning systems support the concept of different development branches. We cur-rently do not support this concept, because we are mainly interested in the history of the releases.A release is a discrete snapshot from the continuing development of the project. We do not needto know the exact development process between two releases.

Nevertheless our model implicitely implements some sort of a branching concept through thetree structure of the Releases. In our case study project this can be observed with the releases ofthe development branch and of several maintenance branches.

4.2.2 Issue Tracking ModelFigure 4.3 depicts our issue tracking model. Again we kept it as simple as possible.


For the time being we do not take into account the Issue lifecycle. This can be added at a latertime. We currently evaluate which Issues have been resolved in a certain Release. We are abledo this because we know for every closed issue, which Revisions have been modified in order toresolve it.

IssuePrioritySafety CriticalResolved DateResolverModified Revisions

Change Request

Feature RequestBug Report

Figure 4.3: Issue Tracking Data Model

• There are different types of Issues which are implemented as subclasses of Issue:

– Bug Reports describe a defect in the existing code.– Change Requests are used to propose changes of existing features.– Feature Requests are used to request new features.

So far the Issue subtypes do not have attributes on their own, they are just used to categorizethe Issues. Additional types can be added easily.

• Each Issue has a priority value and a security critical flag.

• An Issue has been resolved by an author, the Resolver, at a certain Resolved Date.

• One or more Revisions have been modified in order to resolve the Issue.

4.2.3 Combined ModelFigure 4.4 shows the combined versioning and the issue tracking model. On paper this is quitesimple to achieve because we know which Revisions are generated to resolve the Issue. Thereforewe can draw a simple association line between Issue and Revision in the diagram. In reality it isnot that simple because our RHDB does not contain the complete versioning history, but only theconfigurations which are used to build the Releases. What we see in Figure 4.4 is a snapshot ofthe state when the Issue has been resolved. It is most unlikely that those two different snapshotsmatch. Therefore we have to calculate, which Releases are associated with which Issues. This isdone with a heuristic approach which we describe later in Section 4.3.2.

4.2.4 Metrics ModelWe keep our metrics model flexible because we want to be able to attach metric values to eachand every object in our data model. We call the described object the owner of the metric value. But

4.2 RHDB Data Model 17

SourceUnit

Directory File

Revision ReleaseAuthor

Issue

Change Request

Feature RequestBug Report

Figure 4.4: Combined Versioning & Issue Tracking Data Model

an owner is not enough for unambiguously defining the context of the value, sometimes a furtherreference on another object is needed.

An example of such a metric is one that describes a characteristic of files in a Release. The samefile version might be included in two subsequent Releases, but this not necessarily means that thesame metric value applies in both instances. Lets assume a “Commits-since-last-release” metric.From Release A to Release B the file X has been modified, so the metric value will be greater thanzero. From Release B to Release C the file was unchanged which means the metric value is zero.Releases B and C contain the same revisions of the file, but the metric values are different: thecontext of the metric value can only be determined if both File and Release are known.

MetricDescriptionOwner TypeReference Type

Metric ValueOwnerReferenceValue

1 0..*

Figure 4.5: Metrics Data Model

Our metrics model as shown in Figure 4.5 consists of only two objects: a Metric and a Met-ricValue. The Metric object describes the metric, it is metadata for the MetricValues. It contains anatural language description of the metric, and it defines which objects are used as owner and asreference for the MetricValues. The MetricValue contains the two references to the owner objectand reference object, and a numerical value.

Figure 4.6 shows an example how a Metric can describe Revisions that are part of a Release.The Revision is the “owner” of the Metric Value, but this value applies only to the regardingRelease. The same Revision can be part of other Releases, therefore we have to store a referenceto the correct Release in order to be able to identify the context in which the Metric Value applies.


SourceUnit

Directory File


Metric

Metric Value

ownerreference

Figure 4.6: An example of a release metric: The Metric Values describe the Revisions in the context of the referencedRelease.

4.3 Data PreparationIn this section we describe the algorithms used to populate the RHDB and to calculate the metricvalues.

4.3.1 CMSWe use the CMS history log files to populate our RHDB with the versioning data. This is a muchcheaper operation than working directly with the CMS Libraries. Through processing the logfiles, we can reconstruct the whole history of a Library.

The CMS data model is simpler than our RHDB versioning data model and no direct mappingis possible:

• Multiple CMS Libraries are used to manage the project source files. Our model in contrastis limited to one single repository.

• The Directory concept does not exist in CMS, it can only manage files. However, we canuse Directory objects in our RHDB to simulate the different CMS Libraries: we map everyLibrary to a corresponding Directory object.

• In several Libraries Groups are used to group modules. These Groups can be mapped tosubdirectories in the corresponding Library Directory, but manual work is required to definethe mapping.

• The Release object does not exist natively in the CMS model. Classes are used to mark theGenerations which are part of a Release. In every Library such a Class has to be created.The name of those Classes contain the Release name. With pattern matching we can find allClasses that are associated with a certain Release.

Figure 4.7 illustrates the mapping from the CMS data model to our RHDB data model.We decided to divide the versioning data preparation in two steps. We gave our importer its

own intermediate database (CMS DB) whose schema is modeled after the CMS data model. Inthe first step we parse the log files and mirror the complete history in the CMS DB. In the secondstep we do the mapping from the CMS DB to the RHDB.

The introduction of the CMS DB proved also useful for calculating different metric values.The RHDB contains only the versioning data associated with Releases. For calculating metricswhere the data between the Releases is needed, there are two possibilities: either we do calculatethem simultaneously with mapping the versioning data to the RHDB, or there must be a means

4.3 Data Preparation 19

SourceUnit

Directory Element

Generation ReleasePerson

SourceUnit

Directory File


Library

Group

Class

EvoBrowser Model

CMS Model

Figure 4.7: Mapping of the CMS Data Model to the EvoBrowser Data Model. Objects with no direct counterpart arepainted in white. Notes show possibilities how these objects can be interpolated from the CMS Model.

to get back to the original versioning data. The first variant is not optimal because it prohibitsto introduce new metrics without reprocessing the whole versioning history. The second variantproved to be easy to implement thanks to the use of the intermediate database.

4.3.2 Tracker & ClearquestIn the scope of this thesis we were interested in associating the Issues with the Releases in whichthey have been resolved. Therefore we did not take into account the whole lifecycle of the Issuesevaluated only closed Issues. When we later want to visualize values regarding the Issue lifecycle,this can be implemented in the metrics calculation part and our Issue model does not necessarilyhave to be redesigned.

After importing the Issues in our RHDB, we calculated the affected Releases with the follow-ing heuristic:

1. As a clue and anchor point we take the date when the issue was closed.

2. We consider a timeframe of -1 week to +4 weeks around the anchor date. The reason that wesubstract one week is, that the Issue might have been closed late. The size of the timeframehas to be calibrated according to the analyzed project. In our case the case study project seesa weekly release, therefore we decided a timeframe of five weeks should be sufficient.

3. In the timeframe we chronologically check for every Release, if it contains a file that wasaffected by the Issue and if its revision is the same or a later one. If we have a match, theIssue is associated with this Release.

4. When we have a matching Release, we continue checking the Releases of other branches.We do not check later Releases on the same branch of the Release tree.


5. Finally we have a list with all Releases (assuming there were no other potential matchesoutside our timeframe), in which the Issue has been solved.

4.3.3 Metric CalculationThe base for the calculation of Metric Values is the data in our RHDB. This gives us only very lim-ited possibilities of Metrics we can calculate. Depending on how the data importers are designed,other sources can be included. In our case we can use the CMS DB which gives us access to thecomplete versioning history, not only on the Release history like the RHDB.

For this thesis we calculated a number of basic metrics. They are not meant to allow revolu-tionary new insights, but to demonstrate our visualizations. Our sample metrics are all focussedon Releases. This means the Metric Values describe aspects of a Release, or of objects like Revi-sions or Issues which are associated with a Release. We calculate for all Metrics the total valuessince the project begin date and the delta compared with the last preceeding Release. The follow-ing list describes the sample metrics for our thesis:

Revision Count is the number of Revisions that have been created for a File. For this value weonly respect the Revisions of the same Branch. There is no way to detect a merge operationin the CMS, so we can not derive a reasonable value when between two Releases the branchof a File is switched. We defined that if a File did not exist in the preceeding Release or if itsRevision was of a different Branch, the Revision Count Value shall be defined as 1.

This Metric Values can give an indication of the stability of a File if we compare the Totaland the Delta Value.

Author Count shows the number of different Authors have commited Revisions of a File. Thisvalue might be interesting if compared with the Revision Count.

Issue Count indicates for every File of how many Issues it has been affected. This Metric canfurther be specialized to show detailed aspects of the Issues. One possibility is to categorizethe Issues according to their type and show the number of Bug Reports or Feature Requests.Another interesting value is the number of security critical Issues.

With those values it is possible to identify Files which will likely be affected by future issuesand may require special attention.

4.4 Visualization

4.4.1 Data InterfaceOne requirement of our plugin is that new Metrics and additional visualization modules shouldbe pluggable without code changes. We met this requirement in the design of the plugin: theplugin is aware of all available Metrics and visualizaton modules, and for every Metric it candecide which visualizations are compatible. To enable this matching, the characteristics of theunderlying data of the Metric and the capabilities of the visualization have to be known. Basedon this information we decide in which data structure the Metric Values have to be provided tothe visualization. In other words, this is the data interface of the visualization module.

In the following paragraphs we explain the different characteristics:

4.4 Visualization 21

Structure

Granularity A Metric Value has an associated focus object and it can reference other objects. Thismeans that the focus object can have one general value associated or multiple values, thatreference other objects.With our sample Metrics, the focus object is always a Release. A Metric Value without areference is a value that applies to the Release as a whole. An example is the File count in theRelease, or the number of resolved Issues. When a Metric Value references another object,its value applies to the referenced object in the context of the owner object. An example isthe count of commits of a File since the last Release.

Structure When a Metric references other objects beside the focus object, those other objects canhave a certain structure. It makes sense to respect this structure in the visualization.Our sample Metrics all reference the Files in a Release. Those Files are arranged in a Direc-tory tree. It can increase the clarity when this tree is reflected in the visualization.

Table 4.1 summarizes the granularity and structure classifications we planned for our plugin,and gives a visualization example for each.

Type Metric Example Visualization ExampleSingle Value Issue/Filecount Ratio of a Release “Thermometer” DiagramMultiple Values Commits per Author of a File Map, Pie ChartMultiple Hierarch. V. Issue Count per File in a Release Treemap

Table 4.1: Granularity & Structure of Metric Values

Relation

A visualization can display the Metrics of multiple instances of the focus object at the same time.An example would be a visualizaton, that compares Metric Values of multiple Releases.We distinguish two types of visualizations with multiple focus objects:

• Comparison of two arbitrary instances of the focus object.An example is the comparison of Metric Values of two different files.

• Visualization of the development of a metric over multiple instances of an object in chrono-logical order.An example is the examination the Metric Values of a single File over time.

For example can the metric values of two releases be compared with each other, or the devel-opment of a metric over multiple consequent releases can be visualized. Table 4.2 shows examplesfor those two relation types.

Each Metric can have attached its classifications as metadata and each visualization modulecan have a capability description which specifies what sort of Metrics can be visualized. If thesedata are available, the plugin can offer automatically all suitable visualizations for a certain Met-ric.

We realize that our classifications can by no means be exhaustive. Additional classificationscan be added any time.


Focus Metric Example Visualization ExampleSingle Object Issue/Filecount Ratio of a Release “Thermometer” DiagramArbitrary Objects Issue/Filecount Ratio Comparison Bar ChartConsequent Objects Issue/Filecount Ratio Development Line Chart

Table 4.2: Relation of Multiple Metric Focus Objects

Data Structure

For making the visualization modules exchangable, we have to use a set of predefined data struc-tures. We defined for each classification which the data structure has to be used, they are listed inTable 4.3.

As a side note, we use the Table and Tree classes of the Prefuse visualization framework2 forthe implementation of the data interface. It was a convenience decision because those two datastructure implementations offer all the features we need. Of course they can also be used withvisualization modules that are not based on prefuse.

Single Value Multiple Values Multiple Hierarch. V.Single Object Primitive Type Table TreeArbitrary Objects Set of Primitives Set of Tables Set of TreesConsequent Objects Ordered Set of Primitives Ordered Set of Tables Ordered Set of Trees

Table 4.3: Data Interface for Metrics and Visualization Modules

The class attribute contains the name of the Java class, that implements the visualization. Thisclass must be a subclass of java.awt.Panel and it must implement the interface evoBrowser.visualization.IEvoBrowserVisualization. The contents of this interface is included inListing 4.1. The actual data structure to be used for the data parameter is one of the subclassesof AbstractTupleSet and is decided by the classification of the visualization.

package evoBrowser.visualization;

public interface IEvoBrowserVisualization {

public void initVisualization(AbstractTupleSet data, String labelField,String idField, String[] dataFields);

}

Listing 4.1: Every EvoBrowser visualization module has to implement the IEvoBrowserVisualizationinterface.

Thanks to the classification and the common data interface our plugin is very flexible and canbe extended with further Metrics and visualization modules without modifying the plugin code.All it has to do to make the new extensions available, is to read the classification metadata of thenew Metrics and the capability description of the visualization modules.

2http://prefuse.org/doc/api/

http://prefuse.org/doc/api/

4.4 Visualization 23

4.4.2 Treemap VisualizationOur sample Metrics are all focussed on describing characteristics of Releases. A Release is com-posed of Files contained in a Directory tree. We decided that we want our first visualization toreflect this tree structure. For this task there are mainly two visualization types as contender: treegraph and tree map.

Tree graphs are inapt to visualize trees with a high number of leafs and the visualizationquickly becomes cluttered and inaccessible. Our Releases can contain thousands of files, so a treegraph is unusable in that context.

Tree maps have first been introduced by Ben Shneiderman [Shn92] as “a method for displayinginformation about entities with a hierarchical relationship, in a space-constrained environment”.In contrast to the the tree graph, the map can handle thousands of nodes in a fixed space.

We decided for the tree map because it is better suited to visualize the contents of a File/Di-rectory tree. A tree map layout algorithm is comparatively simple and a map with many leafs stillis intuitively comprehensible.

A treemap can show two different values for the tree nodes at the same time. One value ismapped to the node size, or more precisely to the area, in which the node is rendered. The othervalue can be indicated by the color shading of the node.

In a tree map subtrees must be marked specifically, otherwise they are impossible to recognize.We decided to demark a subtree by a spacing border. This gives very good results as long as wedo not have many subtree levels. In our case study project, only one subtree level is used forthe time being, and maybe an additional one in a future stage. Therefore the spacing border isperfectly suitable for our project.

Figure 4.8 illustrates the concepts of our Tree Map visualization.

Figure 4.8: The Treemap Concepts: Value X is mapped to the area of the leafs. In D additionally the value Y isindicated by the color shade, i.e. on a scale from green to red. Figure A shows only the root node of the tree, its areais equal to the sum of all X values of the leafs. The area size is indicated with the number. In B the tree has threeleafs, the area is divided proportionally to their X value. In C the root node has two leafs and a subtree with threeleafs, demarkated by the spacing border. In D we have the same tree as in C , but additionally we have colored theleafs according to their Y value.

The layout algorithm used in our treemap visualization is called “squarified treemap”. Itcauses that the nodes are rendered with a low width-to-height ratio. The result is that the visuali-sation looks appealing and is more intuitively comprehensible than some other treemap layouts.The algorithm has been described in [BHvW00].

Figure 4.9 shows a sample output of our treemap visualization module. The visualized Releasecontains more than 4000 files. Albeit it might not be visible on the printed version, on screen eventhe smallest leaf is clearly recognizable. When the leafcount increases drastically or when we


Figure 4.9: Sample Image of our Treemap Visualization. More than 4000 files are shown. Every single file is clearlyrecognizable despite the high number.

choose a different Metric for calculating the leaf size, where the size difference is much bigger, thesmaller nodes might not be recognizable anymore. To solve this issue we added a filter option toour visualization. If the small leafs are not recognizable, the big leafs can be filtered out so thesmall nodes get use more space.

We display labels for the toplevel elements of the tree. In this example these are the namesof the Directories contining the Files. For elements on a lower level the name, and also the exactMetric Values, are displayed in a tooltip when the mouse pointer hovers over them.

Chapter 5

Implementation

In this chapter we introduce how we have implemented our plugin. First we give an overviewof the used libraries. Then we continue by outlining the architecture of the plugin and its inter-face followed by the description of the plugin implementation itself. The remaining two sectionsdescribe the data importer and the visualization module.

5.1 Used TechnologiesBesides plain Java, our plugin uses a number of libraries and frameworks from the domains ofdata persistence, visualization and graphical user interfaces. The major ones are briefly describedin the following subsections.

5.1.1 Eclipse PlatformIt was a soft requirement for our applicaton to be implemented as a plugin for the Eclipse IDE.Eclipse is used for development in the case study project, therefore it makes sense to integrate theevolution analysis tool as a plugin into this platform.

Eclipse has originally been developed as a Java IDE. Since then it has evolved to a versatileand flexible application framework. Its design allows the user to adapt it to needs while keepingthe needed effort relatively low thanks to the plugin concept and to many generic modules forcommon tasks.

Eclipse supports plugins through so called extension points. At these points plugins can addfunctionality, it is then called an extension. Plugins can declare new extension points themselfsto allow further extensibility. The interface of an extension point has to be defined in the pluginmanifest file.

A plugin contains code or other resources like documentation. A plugin has its own classpathand a lifecycle which is managed by its Activator class. Fragments can be used to extend pluginswhen no separate lifecycle and classpath is needed. They are separately packaged, but theircontents is treated as if they were part of the original plugin.

For more detailed information about the Eclipse architecture refer to the Eclipse documenta-tion1.

1http://help.eclipse.org/

26 Chapter 5. Implementation

5.1.2 HibernateHibernate2 is an object-relational mapping frameworks. It can be used to transparently persistJava objects in a relational database. It supports the object-oriented idiom and enables us to accessthe stored objects in an object-oriented way. Manual object decomposition and reconstructionwhen storing and querying the object from the database is not needed. Nevertheless we still canuse plain SQL if we wish. This can be very usefull when performance is more important thanconvenient access.

We use Hibernate in our plugin to store our RHDB in a relational database. Hibernate enablesus to map our RHDB model directly to the relational schema. This saves us much manual work.

5.1.3 PrefusePrefuse [HCL05] is an extensible user interface toolkit for crafting interactive visualizations. Itprovides a set of fine-grained building blocks for constructing tailored visualizations instead ofdepending on ready-made information visualization “widgets”. It offers an integrated structurein which novel techniques and domain-specific designs can be developed.

The toolkit uses the formalism of a graph (a set of entities and relations between them) as itsfundamental data structure. Two special cases of a graph, tree and table (which is a collectionof items without relations), are also supported. This enables a broad class of visualizations ofstructured and unstructured (edge-free) data. Optimized implementations of data structures fortables, graphs and trees are provided.

Prefuse includes a library of layout algorithms, navigation and interaction techniques, inte-grated search, and more. It is written in the Java programming language using the Java2D graph-ics library.

Especially the interactive features prove to be very useful for examining a visualization indetail. In a view that contains many objects, some aspects usually are obscured by others. Thishidden information can often be revealed by choosing a different focus, by using filtering tech-niques or by providing a search functionality. Prefuse provides native support to implement suchmeans.

We use Prefuse in two places: (1) for displaying a graphical representation of the release tree,and (2) to generate the tree map visualization.

5.2 ArchitectureWe followed the modular design as shown in Figure 4.1 and split the EvoBrowser system up inone Eclipse plugin and several plugin fragments.

The plugin is the core of EvoBrowser. It contains the implementation of our RHDB model inthe form of Java classes and corresponding Hibernate mapping files. It manages the databaseconnection and provides convenient access to the objects stored in the RHDB. Furthermore itprovides a user interface and integrates the available visualization modules. The plugin declarestwo types of extension points, one for including additional visualization modules, and one to givedata importers the opportunity to store their custom models in the RHDB.

The database access is managed centrally by the class called HibernateUtil. One job of theHibernateUtil is to set up the database connection. The database parameters can be configuredwith a dialog in Eclipse. Additionally it is possible to use the HibernateUtil from outside ofEclipse. In this situation the database configuration has to be supplied in form of a HibernateConfiguration object. More details about this are given in Section 5.3.2.

2http://www.hibernate.org/

5.2 Architecture 27

Visualization fragments are used to add further visualizations to the system. Our design al-lows new visualization modules to be automatically integrated in the user interface. The fragmenthas to provide the necessary information about the data interface of the visualization modules itprovides (see Section 4.4.1).

Data importers can be implemented as fragments or as stand-alone programs. Our importerwas initially planned to be controlled from within the plugin and is consequently packaged as afragment. But because the import process proved to be very time intensive with our case studyproject, we decided to integrate it into the release process rather than to integrate it in the devel-opment environment. Therefore we did not implement the GUI elements that would be necessaryto control the import process from within Eclipse.

5.2.1 Extension PointsThe plugin must offer a means to integrate the visualization modules and importers that aresupplied by fragments. This is done by leveraging the extension facility of Eclipse. The followingsubsections describe the extension points that are defined by the plugin.

Data Importer

When a data importer uses an intermediate database with its own model, it can use the samedatabase that is configured for the RHDB. In this case the importer must provide an implemen-tation of the model in the form of Java classes, and it must include the corresponding Hibernatemapping files.

In order that the additional model is included in the Hibernate configuration, the HibernateUtilwhich manages the connection setup has to be notified. The EvoBrowser plugin provides theextension point hibernateMapping which can be used to add further mappings to the Hibernateconfiguration. Listing 5.1 shows the relevant passages of the hibernateMapping schema.

<schema targetNamespace="evoBrowser">

<annotation><appInfo>

<meta.schema plugin="evoBrowser" id="hibernateMapping"name="hibernateMapping"/>

</appInfo>...

</annotation>

<element name="extension"><complexType>

<sequence><element ref="mapping" minOccurs="1" maxOccurs="unbounded"/>

</sequence>...

</complexType></element>

<element name="mapping"><complexType>

<attribute name="class" type="string" use="required"> ... </attribute>



</schema>

Listing 5.1: hibernateMapping extension point schema: An extension that implements this extension pointprovides additional Hibernate mappings. Each mapping declares the fully qualified name of a java class which is addedto the Hibernate configuration.

An importer can provide an extension which implements this extension point. The extensionmust contain at least one mapping. A mapping has an attribute named class which has to bethe fully qualified name of the class that is added to the Hibernate configuration. To ensure thatHibernate is able to find the mapping description files, they must be located in the same packageas the corresponding class files.

Visualization

According to our design a visualization module must specify its data interface. The plugin definesthe extension point visualization for this task. Its schema is shown in Listing 5.2.

<schema targetNamespace="evoBrowser">

<annotation><appInfo>

<meta.schema plugin="evoBrowser" id="visualization"name="visualization"/>

</appInfo>...

</annotation>

<element name="extension"><complexType>

<sequence><element ref="vizModule" minOccurs="1" maxOccurs="unbounded"/>

</sequence>...


<element name="vizModule"><complexType>

<attribute name="name" type="string" use="required"> ... </attribute><attribute name="description" type="string"> ... </attribute><attribute name="class" type="string" use="required"> ... </attribute><attribute name="structure" type="string" use="required"> ...

</attribute><attribute name="relation" type="string"> ... </attribute>


5.3 EvoBrowser Plugin 29

</schema>

Listing 5.2: visualization extension point schema: A visualization module must implement this extensionpoint. It specifies the Java class of the visualization and its data interface.

name The name used to identify the visualization module.

description An optional natural language description which can be used in the user interface.

class The Java class that implements the visualization.

structure �”single”|”table”|”tree”� (see Section 4.4.1)

relation �”single”|”arbitrary”|”sequential”� (see Section 4.4.1)

5.3 EvoBrowser PluginThe EvoBrowser Plugin consists of several main elements: the implementation of the RHDBmodel including the Hibernate mapping files, classes for the database connectivity (the HibernateUtiland a convenience class for RHDB access which bundles many often used queries), and GUI ele-ments for creating the visualizations (this includes selecting of the desired data to visualize andselecting a compatible visualization module).

The following list gives an overview over the main packages of the plugin:

evoBrowser This is the main package of the plugin. It contains the classes that control the pluginlifecycle, the HibernateUtil, and the database convenience class.

evoBrowser.editors This package contains the evoBrowser editor classes. The central componentof the plugin user interface is implemented as an editor and its contributing classes. Formore details refer to Section 5.3.3.

evoBrowser.dataModel This package is the container for the different data models which arestored in subpackages. The Autor class which is not specific to one certain model is placedhere directly in the toplevel model package.

evoBrowser.dataModel.issuetracking/.metrics/.versioning The implementations of the differentdata models are grouped in the respective model subpackages.

evoBrowser.preferencesPages The classes in this package contribute a page to the Eclipse pref-erences dialog for the database configuration.

evoBrowser.views.releaseTreeView This package provides a “release selector”. It can be usedfor choosing a release from the release tree for the visualization.

evoBrowser.visualization.treeMap Our tree map implementation and its helper classes are placedin this package.

5.3.1 RHDB Model ImplementationWe use Hibernate as frontend for our RHDB. Hibernate enables us to persist and to query Javaobjects. This allows us to implement the model as it is shown in Figure 4.4 and 4.5 direcly withJava classes. Hibernate has to be instructed how to persist an object. This is done with a mappingfile for every persistable class.


5.3.2 Database ConnectivityHibernateUtil

Configuring a Hibernate session is an expensive operation which can cost several seconds. Itmakes sense to manage the session centrally and reuse it as long as possible.

When the plugin is started, the plugin activator calls the init() method of HibernateUtilwhich causes that the session gets initialized and stored in the HibernateUtil for further usage.The session can get accessed through a public method getSession().

Furthermore HibernateUtil contains methods for global transaction handling. This helpsto significaltly simplify the transaction management code needed in the classes where the databaseis accessed.

Befor the HibernateUtil can be used, it must be initialized. During this process the databasesession is set up. There are two init methods:

public void init() is called from within the plugin, when the plugin is activated. It configuresthe Hibernate session with the database connection settings as they are configured on thepreference page. It queries the plugin registry for Hibernate mappings that are supplied byextensions (see REFERENCE for details) and adds them to the standard RHDB mappings.

public void init(Configuration cfg) has to be called when the HibernateUtil is used in a stand-alone data importer. In this case the HibernateUtil has no access to the Eclipse pluginregistry and to the database preferences, because it runs in a different context. The settingsand mappings have to be supplied as parameter to the init method. Listing 5.3 shows anexample how the configuration process can look like.

Our implementation of HibernateUtil is heavily inspired by the example proposed in[BK04].

1 import org.hibernate.cfg.Configuration;2 ...3 Configuration cfg = new Configuration();4 cfg.configure();56 addResource("myDataImporter/model/ObjectA.hbm.xml");7 addResource("myDataImporter/model/ObjectB.hbm.xml");8 addResource("myDataImporter/model/ObjectC.hbm.xml");9

10 Porperties prop = cfg.getProperties();11 prop.setProperty(IEvoBrowserPreferences.USERNAME_PREF, "username");12 prop.setProperty(IEvoBrowserPreferences.PASSWORD_PREF, "password");13 prop.setProperty(IEvoBrowserPreferences.DRIVER_CLASS_PREF,

"com.mysql.jdbc.Driver");14 prop.setProperty(IEvoBrowserPreferences.URL_PREF,

"jdbc:mysql://localhost/evobrowser");15 prop.setProperty(IEvoBrowserPreferences.DIALECT_PREF,

"org.hibernate.dialect.MySQLDialect");1617 HibernateUtil.init(cfg);18 ...

Listing 5.3: This is an example how the HibernateUtil can be initialized in a stand-alone context outside ofEclipse.

5.3 EvoBrowser Plugin 31

Line 4 causes the standard settings to get loaded. In line 6–8 new Hibernate mappings are added and in line 10–15the database connection is configured. In line 17 the HibernateUtil gets initialized with the configuration.

MetaDbDAO

The MetaDbDAO is a convenience class for accessing the RHDB. It bundles many often usedqueries. For example it provides methods to query the instances of the model classes by name orby other attributes.

Everything that the methods of this class do can also be done directly through operating withthe hibernate session, but the collection of these often used queries in one place helps to keep thecode clean and comprehensible.

5.3.3 User InterfaceThe user interface integration has two tasks: to provide a GUI for configuring the database set-tings and to provide a means to configure and control the visualizations. It is not the job of thecore plugin to provide user interface integration for data importers.

When a data importer needs GUI elements, for example for controlling the import process, itis the responsibility of the importer fragment to provide its integration. Our case study importeris not integrated because it is designed to be run as a stand-alone application.

Database configuration

The database connection parameters are stored as plugin preferences. The plugin contributes apage to the Eclipse preferences dialog to configure these database parameters.

Data importer fragments that need additional configuration options can add subpages.

Visualisation control

We wanted to enable the user to save and restore multiple visualization configurations. Thereforewe introduced a new filetype, the EvoBrowser file with the extension .evo. An instance of thisfiletype represents the configuration and state of one visualization. The user can create multipleEvoBrowser files with different configurations.

When the user doubleclicks such a file, it is opened in the EvoBrowser editor. This editor isthe central userinterface element of our plugin. It is used to configure the visualization (choosethe objects to visualize and the metrics to include in the visualization, and open the visualizationwith the desired visualization module.

However, within the scope of the thesis we decided to hardcode the configuration part becauseof the time constraints. We have included only one single visualization and a few metrics (we canload the metrics values for all available metrics without a huge performance hit), so there is noimmediate need for the configuration capability.

The editor has a multitab layout:

• The first tab is used to configure the visualization parameters (omitted for now). This in-cludes choosing the focus type and the optional reference type, the metrics which should beincluded in the visualization, and the visualization module.

• The second tab provides an appropriate selector for the chosen focus type. As our metricsare all focused on the Release type, we have implemented a release selector which shows allreleases as a release tree (see Figure 5.1).


When other metrics with a different focus type are added, one has to ensure that an appro-priate selector is available.

• When an object has been choosen in the selector, a new tab is opened with the accordingvisualization (Figure 5.2).

The user interface elements of the visualization tab are specific to the used visualizationmodule. The GUI of our tree map module is described in detail in Section 5.5.

The editor also provides an action (menu entry & toolbar button) for exporting the visual-ization as a bitmap. For the time being this implementation works only for prefuse based vi-sualizations (like our tree map visualization) because we use a prefuse functionality. For othervisualizations it is deactivated.

5.4 Data Importer for the Case Study Project

5.4.1 CMS ImporterOur CMS importer uses an intermediate database, we call it CMS DB. It is modeled after thenative CMS data model (see Section 3.1.1). The importer uses CMS history log files to reconstructthe complete project history in the CMS DB. This data is then used to extract the Release historyand populate the RHDB with it.

These are the steps of the import process:

① Import the CMS history log files into the database.

② Parse the log entries and reconstruct the history in the CMS DB.

③ Identify the CMS Classes that compose the Releases.

④ Map the Releases into the RHDB.

These steps are explained in the following paragraphs:

① Import the CMS history log

A CMS history log contains an entry for every operation on the CMS library that changed its state.The history log can be exported into a text file. A log entry corresponds to one line in the file. Eachentry is in the following format:

date_time userID command files remarkWe import the history log file of every CMS library of the case study project into the database. Tobe able to use the Hibernate facilities, we created a HistoryEntry class. It contains a variable for ev-ery field of a history entry. This means we have to parse the entries in order to assign the variables.Initially we implemented a parser based on the history log format information from the CMS ref-erence manual [Hew05a]. This parser proved to be insufficient because of many irregularities in

5.4 Data Importer for the Case Study Project 33

Figure 5.1: The Release selector of the EvoBrowser Editor shows the complete release tree. The user can click on aRelease to open a tab with the according visualization.

Figure 5.2: The visualization panel contains the visualization itself and a number of UI elements to control thevisualization interactively.


the real history log files. For example should a Generation identifier according to the manual con-sist of the file name followed by the Generation number in parenthesis: ILSS STRING.ADA(6).But often the parenthesis are simply dropped: ILSS STRING.ADA 6. There are several moresuch inaccuracies which make it difficult to parse the log. We suspect that these “errors” are theresult of a feature of the CMS client command parser. It seems that the client is very tolerant andaccepts commands that are not well formatted according the reference manual. Commands arewritten to the log exactly as the user have written them, not as the client has interpreted them.Therefore the log files reflect the tolerancy of the CMS client.

One particular irregularity proved to be the stumbling block for our initial parser. Remarkstrings should be delimited by double quotes. When they are missing, it is very difficult to distin-guish the individual fields of the history entry. To write a parser capable of reading these entriesdemands a disproportionately high effort.

Finally we found an elegant solution: HP provides a dll3 which can be used to implementcustom client applications. We created a client with a modified history output format. In ourcustomized version every field of an entry is delimited by a pipe character ’|’:

date_time | userID | command | identifier | remarkWith the added field delimiter we have no problems to identify the individual fields of an entry,even if it was not well-formed.

Listing 5.4 shows an excerpt of a history log file produced with our customized output format.

6-MAY-1994 16:13:36 | EB | REPLACE | ILBS_MZ_5_MAIL.ADA(2) | erweitert6-MAY-1994 16:13:39 | EB | REPLACE | ILBS_MZ_5_MAIL_.ADA(3) | erweitert6-MAY-1994 16:13:41 | EB | REPLACE | ILBS_MZ_MAIL_TASK.ADA(3) | erweitert6-MAY-1994 16:13:43 | EB | REPLACE | ILBS_MZ_MAIL_TASK_.ADA(2) | erweitert6-MAY-1994 16:13:45 | EB | REPLACE | ILBS_MZ_TASK.ADA(3) | erweitert6-MAY-1994 16:13:48 | EB | REPLACE | ILBS_MZ_VERWALTUNGS_TASK.ADA(4) |

erweitert6-MAY-1994 16:15:17 | EB | UNRESERVE | ILBS_MZ.ADA(2) |6-MAY-1994 16:15:21 | EB | UNRESERVE | ILBS_MZ_2_ABO_VERWALTUNG_.ADA(1) |6-MAY-1994 16:15:27 | EB | UNRESERVE | ILBS_MZ_VERWALTUNGS_TASK_.ADA(1) |6-MAY-1994 16:21:41 | HAY | CREATE CLASS | FPV_940506 | Release 336-MAY-1994 16:21:50 | HAY | INSERT GENERATION | ILBS_STRING.ADA(2) FPV_940506

| Release 336-MAY-1994 16:21:52 | HAY | INSERT GENERATION | ILBS_STRING_.ADA(2) FPV_940506

| Release 336-MAY-1994 16:21:54 | HAY | INSERT GENERATION | ILBS_ZDB.ADA(10) FPV_940506 |

Release 336-MAY-1994 16:21:57 | HAY | INSERT GENERATION | ILBS_ZDB_.ADA(9) FPV_940506 |

Release 33

Listing 5.4: Excerpt of a CMS history log file with customized format.

After the import our database contains the complete CMS history log. It is so far still a collec-tion of chronological commands and not an image of the actual CMS model.

② Process the log entries

In this step we parse the commands in the log entries and reconstruct the CMS history in our CMSDB. This is mostly a straightforward operation. We query the log entries in chronological orderand apply string matching to the command field:

3dynamic link library


• For every CREATE command we instanciate a corresponding object �Element|Class|Group�of the CMS model and store it in the CMS DB.

• For a DELETE command we mark the object in the CMS DB as deleted. We do not removethe object from the database because we want to be able to query a past state.

• For a REPLACE (the equivalent of a commit) we query the affected Element from the CMSDB and add a new Generation to its collection of Generations.

• For an INSERT or REMOVE we insert/remove the respective Element into the Group orGeneration into the Class.

Other commands such as the UNRESERVE in the Listing 5.4 are not relevant for our model,so we ignore them.

The mapping process is straightforward as long as the history is complete, however this is notalways the case. Several logs of the case study project are partially purged. This means that thehistory of these libraries is missing up to a certain date. Additionally it seems that in some cases aCMS operation is omitted in the history. For example it happened several times that a Generationthat never has been created according to the log, gets inserted in a Class.

When operations are performed with an object whose history is missing, we have to guessvalues. In our example, we create the Generation with the timestamp of the insert operation andwith empty userid and remark values.

③ Identify the Release Classes

The RHDB is not designed to contain the complete versioning history, but the Release snapshots.To be able to map those snapshots from the CMS DB to the RHDB, we have first to identify theGenerations (respective the Revisions) that constitute the Releases.

In the CMS Classes are used to mark a Release. In every CMS library of the project a Classis created and the Generations, that make up the Release, are added to this Class. The Class isnamed according to the Release.

To find the Classes of a Release, we search in every library for the Classes containing thespecific Release name.

④ Map the Releases into the RHDB

In the last step we have identified which Generations in the CMS are part of a Release. The nexttask is to translate those Generations and the owning Elements to the equivalent objects of theRHDB model, the Revision and File. We use the following heuristic:

• For every Generation we query if its owning Element has already a corresponding File entryin the RHDB. If the File does not exist yet, we create a new one. As parent Directory we takethe Directory with the same name as the CMS library. Again if it does not exist yet, we createit.

• For the Generation we query if the corresponding Revision is already contained in theRHDB. If not, we create and store it.We also store the id value of the Generation object as a variable in the Revision object. Whencalculating metrics it is elemental to be able get back from the Revision in the RHDB to theGeneration in the CMS DB. The reason is that the RHDB does not contain the whole historybut only snapshots. This might not be enough information depending on the metric wewant to calculate.

• Finally we add the Revision object to the collection of Revisions of the Release.


5.4.2 Tracker & Clearquest ImporterWe have access to the Issue data of both Tracker and Clearquest in form of export files. These areCSV text files.

These are the steps of the import process:

① Import the Issues into the intermediate database.

② Find the concerning Generations.

③ Associate the Issues with the Releases.

④ Map the Issues to the RHDB.

These steps are explained in the following paragraphs:

① Import the Issue files

The export files of Tracker and of Clearquest are CSV text files. For Clearquest there are threedifferent export files for the tree types of Issues: Bugreport, Changerequest, and Featurerequest.All files have a different format, so we have to use a different parser for each.

We use an intermediate database for the Issues similar to the one for the CMS data. For ourpurpose we only import closed issues. This means our data contains only the final state of theIssues, not their complete history. Every line in the export file corresponds to one object in theIssue model.

Listing 5.5 and 5.6 show excerpts of the Tracker export file and of the Clearquest Chang-erequest export file. The Clearquest export files for Bugreports and Featurerequest have the samesyntax like the listed one for Changerequests, but they are missing some of the fields.

In the Clearquest example you can find the field BuildIntoRelease. This is basically the sameinformation that we calculate with the heuristic described in Section 4.3.2. We do not use theinformation from the export file because it is often missing or incomplete.

② Find the concerning Generations

After importing the Issues into the intermediate database, we investigate which Generations havebeen created for the resolution of the Issue. We can extract this information from the versioninghistory in the CMS DB. There have been employed two different means to store the informationin the CMS:

Tracker Issues and older Clearquest Issues For the older Issues the Issue number is stored in theremark attribute of the Generations. We have to use a tolerant pattern matching to find theGenerations, because the remark has been written by hand and the correct format of theremark has not been enforced.

Newer Clearquest Issues To mark the Generations for newer Issues, CMS Classes are used. Inevery library that contains Generations associated with an Issue, a Class is created. TheClass name contains the Issue number.


"Id";"Systemrapport-Typ";"Prioritaet";"Fehlerhaeufigkeit";"Ursprung/Kategorie";"betroffene Aufgabe (Analysierer)";"Sicherheitskritisch";"gesch. Aufwand inStd";"gesch. Valid.-Aufwand:"

"2";"Fehler";"normal";"mehrmals,reproduzierbar";"Kundenanlage";"MERKUR";"<<None>>";"16";""

"3";"Fehler";"normal";"mehrmals, reproduzierbar";"<<None>>";"STW";"nein";"1";"""4";"Fehler";"wichtig";"mehrmals,

reproduzierbar";"<<None>>";"STW";"nein";"0";"""5";"Fehler";"wichtig";"einmalig";"<<None>>";"ZL";"nein";"0";"""6";"Fehler";"wichtig";"mehrmals, reproduzierbar";"Int.- & Sys.-Test

ZDA-Redesign";"JZV";"nein";"1";"""7";"Fehler";"<<None>>";"mehrmals, reproduzierbar";"Int.- & Sys.-Test

ZDA-Redesign";"MXD";"nein";"0";""

Listing 5.5: Excerpt of a Tracker export file.

dbid|id|Headline|Submitter|SubmitDate|BuildIntoRelease|Priority|SafetyCritical33556571|00002139|Erweiterung um Pruefung der Release-Nummer - Aufgabe gegen

Shared-Image|abd|20.06.2003|3918.1.39|wichtig|Ja33556573|00002141|Socket-Options in TCPIP_Connection fr VMS nicht

implementiert|las|16.02.2004||normal|Nein33556574|00002142|Anzahl DG-Buffer pro

AF-Task|gap|27.03.2004|4002.7.40|wichtig|Nein33556575|00002143|HYPER_CHECK, Element-Art Fehler bei

Luecken|ers|04.05.2004|3962.1.39|normal|Nei33556579|00002147|Tool: verteil|ert|12.08.2004||normal|33556584|00002152|Toleranteres Verhalten bei nicht Uebereinstimmung der

passiven Bereiche|gab|17.11.2004||normal|33556585|00002153|Aufgabe ESV|abd|01.12.2004||normal|

Listing 5.6: Excerpt of a Clearquest Changerequest export file. The export files for Bugreports and Changerequestslook similar, but they are missing some of the fields.


③ Associate the Issues with the Releases

We use the data in the CMS DB to calculate for every Issue in the intermediate database in whichRelease it has been resolved. We use the heuristic from Section 4.3.2 to achieve this.

④ Map the Issues to the RHDB

This step is straightforward: we create a corresponding Issue object in the RHDB for every issuein the intermediate database.

5.5 Tree Map VisualizationOur tree map visualization is based on the prefuse visualization toolkit. The prefuse distributioncontains an implementation of a squarified tree map layout which we used as a base for ourvisualization module. The original tree map layout however does not support filtering. This isa fundamental capability that we demand of our visualization, because it is used to render treemaps of repositories that contain thousands of files. Without a filtering feature much informationremains hidden. We need to add an option to our tree map layout to enable to filter the big leafsout so that the small ones get more room and become visible.

Prefuse is an open source product and the source code of the original tree map layout is freelyavailable. This allowed us to extend the layout for our needs with comparatively small effort. Wewere able to modify the layouting code so that no space gets allocated for leafs that are markedas invisible.

The complete visualization module consists of the visualization itself, and a number of UIelements that can be used to control the visualization interactively (see Figure 5.2). The followinglist describes the main control elements:

Size combobox This combobox controls which metric values are used to calculate the leafsize inthe tree map layout algorithm. The area is allocated proportionally to the value. Leafs witha value of <= 0 are not rendered.

Color combobox The color shading of the leafs is calculated based on the metric values choosenin this combobox. A leaf with a value of <= 0 gets no color (respectively white). For theother leafs we generate a color map for the value range of the metric. Prefuse provideshelper classes for generating such a color map. First we query the data structure for thelowest and the highest value, then we create a new color map for this value range:

int maxValue = DataLib.max(...);int minValue = DataLib.min(...);int[] colors = ColorLib.getInterpolatedPalette(maxValue - minValue,

ColorLib.rgb(80, 255, 80), ColorLib.rgb(255, 80, 80);ColorMap colorMap = new ColorMap(colors, minValue, maxValue);

The color map contains a color gradient from green to red. The leaf with the lowest metricvalue is colored green, the one with the highest is colored red. The colors of the other leafsare allocated according to their value: colorMap.getColor(colorValue);

Size filter With this range slider one can restrict which leafs are rendered based on their sizevalue. Leafs whose size value is outside the selected range are not included in the visual-ization.

5.6 Summary 39

When the range slider is ajusted, we mark all leafs with size values outside the new sizerange as invisible and rerun the layout.

The space in the tree map is limited, it is not feasible to show much more information thanwhat is already encoded in the size and color of the leafs. We do show labels for the names ofthe subtrees in the first hierarchy layer. There is however not enough room to show the namesor other information of the individual leafs. Nonetheless it is desirable to be able to query moreinformation of a leaf. After all color and size allow no clear conclusion about the actual values,they just give an idea about the magnitude.

We solved this problem by using tooltips. When the mousepointer hovers over a leaf, a tooltipshows the name of the element and the actual numeric values of the metrics used for color andsize.

5.6 SummaryIn this chapter we have given an overview of the architecture of the system with its core pluginand the extension points. We have described the implementation of the plugin including the inte-grated RHDB data model and the user interface integration. Furthermore, we have presented thedata importers for the case study project. Finally we have introduced the tree map visualizationmodule.

In the next chapter we will demonstrate visualizations based on data imported from the casestudy project. We will evaluate the effectiveness of those visualizations and propose several im-provements.

Chapter 6

Evaluation

6.1 Case Study ProjectOur case study system is an integrated control and information system for railways. It enablesefficient automatic operations management. It guarantees safety-critical operator control of inter-lockings and a reliable display of the operating situation.

It is developed with the Ada82/95 programming language which has the reputation to be par-ticularly reliable (which is an important quality for systems used in mass transportation). Origi-nally it was designed for the VMS operation system and now being ported to Windows. Some ofthe new features of the system are being developed in Java.

The following list shows some characterizing figures of the project:

• At the time of this writing the first parts of system are about 15 years old. The planningof the project has started in 1991 and the first source code has been imported into the CMSLibraries a year later.

• The complete system is split up over 17 different CMS Libraries.

• As of May 2006 the Libraries contain 13’625 files with a total of 84’376 revisions. Of these5189 files are Ada specifications and 5668 Ada implementations (body). The remaining 2768are various files like documentation and user interface declarations. These figures do notinclude the system parts that are developed in Java and are managed in Clearcase.

• We had access to the reports of the 4540 resolved Issues from the last six years. From these3112 are bugreports. 142 Issues have been marked as safety critical.

• Up to 66 developers have worked on files (this number includes some changed user ids).

• 356 releases have been made between January 2001 and March 2006. Of these releases 200were maintenance releases, the other 156 were internal development releases.

42 Chapter 6. Evaluation

6.2 Tree Map Visualizations

6.2.1 System GrowthDescription We want to highlight the growth of the case study project over time in terms of the

number of files contained in a release. We apply no metrics to the visualization, so thatall files are rendered with equal size and without any coloring. We choose three differentReleases from the development branch, one of the beginning of the analyzed time span, onefrom the end, and one approximately in the middle.

Evaluation Figure 6.1 shows the three resulting visualizations.The increase of the number of files is clearly visible. At the beginning of the time span therelease contains 2938 files. Two and a half years later it has increased by about 10% to 4257.For the last release the number has more than doubled compared to the first one, this releasecontains 8193 files.Also visible is the increase of libraries (in these visualizations depicted as the top level di-rectories). It grows from 10 to 12 and finally 17 libraries. Not all of the librarys see muchgrowth, several of the smaller ones have about a constant number of files. Interestingly, thethree biggest librarys grow over the whole periode. These libraries were at the begin of theanalyzed period already nearly 10 years old.

Result The visualizations give us an idea about the growth of the system regarding the numberof files and libraries. We can see in each visualization what share of the whole system alibrary constitutes.However we can not compare the different releases directly. When more files are added, thecomplete visualization area has to be shared by more files and the absolute size of all filesis reduced. This makes it hard to see if the size of a library has in fact increased, stagnatedor even decreased. It would be desirable that there were an option to render the files with afixed size and use a dynamic visualization size instead.

6.2.2 ModificationsDescription In this example we analyze the values of the number of created revisions of the

files in a release since the last release and since the file creation (as we have specified inSection 4.3.3 only ancestors of the concerning files are counted, but not revisions in otherbranches). We try to find unstable files.

Evaluation Figure 6.2 shows three different visualizations with the revision count values.

• In Figure 6.2(a) we used the number of total commits as the size value and the numberof commits since the last release as the color value.We can see that most modified files have only been changed less than half a dozentimes since last release.In the ILPV library however three files stand out. It strikes that they are ones of themost often modified files of the library. After inspecting those files (by mouse hoverand the information tooltip) we found that the two smaller ones are in fact interface andimplementation of the same Ada type. The big one is the implementation of a distincttype, but its name is nearly identical with that of the other two files. We conclude fromthe name similarity and from the fact that they have been modified at the same timethat there is probably a strong coupling between these files.

6.2 Tree Map Visualizations 43

(a) Release January 2001 — 10 Libraries — 2938 Files

(b) Release August 2003 — 12 Libraries — 4257 Files

(c) Release January 2006 — 17 Libraries — 8193 Files

Figure 6.1: These three visualizations show the growth regarding the number of files of the case study project over aperiod of 5 years. We applied no metric values to the visualization so that all files are rendered with identical size andno color.


• Figure 6.2(b) shows the same value configuration as the previous one but with a sizefilter applied. The filter is configured to mask all files with less than 100 revisions. Theresulting visualization shows the 18 most often changed files of the project.About two thirds of the files have been modified again for this release. Two of the filesfrom the ILPV package that we noticed in the previews visualization are included asexpected. Several of the other files are related with the user interface as we can assumefrom their filename. It is not unexpected that user interface files change often whennew features of the system have to be integrated in the UI. Another of the files is usedfor identifier declaration. We presume that this file is used to centrally manage theglobal identifiers. This would certainly explain that this file has to be modified often,every time when a new global identifier is introduced. The names of the remainingmodified files are too generic, we are not able to guess their function and character.The unmodified files might be older ones which have become more stable. We can nothowever draw a definite conclusion because we only see the change values for thisrelease and for the complete lifetime but not for intermediate periods. It is very muchpossible that it is a pure coincidence that they were not modified for this release.

• For Figure 6.2(c) we have swapped the values: the change count since last release de-fines now the size, and the total change count the color. In this visualizaton we onlysee the changed files, the other ones are not shown since they have a zero value.Clearly visible is now the addition of the THIRD PARTY package. Since the files appearin a release the first time, a change count of 1 is assumed (refer to Section 4.3.3). Alsomost of the other files seem to be quite young, many have been modified less than adozen times during their complete lifetime.

Result We were able to identify several unstable files. Most have probably a legitime reasonfor the numerous modifications like adapting the user interface or introducing new globalidentifiers.But the three files from the ILPV library need further investigation. It seems that there is astrong coupling between them, they might be candidats for a refactoring.

6.2.3 DevelopersDescription With the visualizations in Figure 6.3 we examine the relation between the number

of revisions of a file and the number of persons that have modified that file.

Evaluation In Figure 6.3(a) we used the total numer of revisions for the size value and the totalnumber of authors for the color value. As expected we can recognize a tendency that oftenmodified files have many authors.After applying a size filter, we get Figure 6.3(b). The filter hides all files with less than 85revisions. Especially two files are noticable. The file in the ILIS library is an Ada specifi-cation file and seems to be related with a print functionality. Without further inquiry aboutthe function of the file we can not reliably conclude the reason for modifications by so manyauthors. The other file that stands out in the ILSS package is again the file defining theglobal identifiers that we have noticed already in Figure 6.2(b).In Figure 6.3(c) we consider only the changes since the last release. Again we map therevision count to the size and the authors count to the color. Because of the fast releasefrequency in the case study project, not many files have been modified, and for most ofthem there is only one new revision.

6.2 Tree Map Visualizations 45

(a) size: revisions total, color: revisions since last release

(b) size: revisions total, size filter: >= 100, color: revisions since last release

(c) size: revisions since last release, color: revisions total

Figure 6.2: These visualizations show the revision count per file value (compared with last release and total) appliedto one release. Figure (c) is Figure (a) with the size/color values swapped. Figure (b) corresponds to Figure (a) with asize filter applied, only files with 100 or more commits are shown.


Result These visualizations proved to be of marginal value so far. We can get not much moreinsight than with the number of modifications value alone. It remains to verify if this is truefor the majority of the other releases.

6.3 Summary and DiscussionWith the visualizations we were able to highlight several aspects of the case study system:

• We have demonstrated the continuing growth of the system during the last five years.

• We highlighted several often changing files based on the number of changes and were ableto identify files which we suspect to have a high coupling.

• We have seen that there is, not unexpectedly, a strong relation between the number ofchanges of a file and the number of developers that have modified the file.

Though we had to realize that our delta values are of restricted benefit. Because the case studyproject sees weekly releases, there are no huge differences between two releases. As we haveseen with the developers visualization in Figure 6.3(c) we do not have enough values for a usefulvisualization. This could be improved with an option to choose the release span for calculatingthe delta values.

We demonstrated that our treemap visualization is an adequate means to visualize releasemetrics on file level. We have shown that a treemap can look concise even with thousands ofleafs. Further we demonstrated that a filter for the leafsize can be useful, or even necessary.

We did however also identify some shortcomings in our implementation which can be cor-rected with an adapted user interface:

• We provide a filter for the size value. We did encounter situations where we wished that wecould have the same filter feature for the color value.

• The color palette is determined by the span of the color values. When we filter out leafs,it is possible that the extreme colors are no more included in the visualization. The paletteshould be recalculated based on the visible leafs. This would enhance the contrast betweenthe values and differences would be better visible.

• We missed a further filter facility based on the leaf name. Our case study project consistsmainly of Ada source files. In Ada the interface and the implementation of a type are inseparate files. We want to be able to hide all interface files in order to see only the imple-mentations. This can be achieved easily by implementing a pattern matching name filter.

• Our treemap shows the whole tree of the filesystem structure in a release. This can lookoverwhelming because of the high number of elements. We would like the capability tonavigate in the tree. Then we can “enter” into a subtree. Then the visualization hides theother subtrees and show the magnified map of the selected one.

Further we realized that a tree map is not the optimal visualization method for all kind ofmetrics. For example in Figure 6.1 we used the treemaps to show the growth of the system.We succeed in giving an impression about the growth, but the three tree maps were not directlycomparable and the maps look rather complex for demonstrating a simple value. In this instancea different visualization like a linechart or a barchart is more appropriate.

6.3 Summary and Discussion 47

(a) size: revisions total, color: authors total

(b) size: revisions total, size filter: >= 85, color: authors total

(c) size: revisions since last release, color: authors since last release

Figure 6.3: These visualizations show the relation between the number of revisions and the number of authors of afile. All three visualizations concern the same release.

Chapter 7

Conclusion

During this thesis we designed and implemented an Eclipse plugin for visualizing metrics of thehistory of a software project. The plugin consists of the following parts:

• simplified, generalized data models for versioning data, issue data, and metrics data

• the implementation of an RHDB based on the data models

• interfaces to populate the RHDB which can be accessed from within Eclipse or from stand-alone importers

• a data interface definition for visualization modules to enable drop-in extension by addi-tional modules

• a treemap visualization module suitable for visualizing hierarchical data

• GUI integration in Eclipse for configuration and for operating the visualizations (choosingthe metrics and a compatible visualization module)

Furthermore we implemented a stand-alone data importer which we used to populate theRHDB with the data of a large industrial software project.

We demonstrated that the treemap is an appropriate means to visualize the metrics of hierar-chical objects like a directory tree. We did however identify several shortcomings of the visual-ization module which leave room for improvement.

7.1 Future WorkWe plan to extend the plugin in several aspects:

• We will enhance the tree map module in several aspects: (1) providing additional filteringcapabilities, such as filtering by the color value or by a third leaf attribute, and (2) imple-menting a feature to enable navigating into subtrees. Furthermore we will optimize thehandling of the color shading to use the full color range on the visible leafs. For more de-tails refer to Section 6.3.

• We plan to add further visualization modules. In a first step we will implement simplecharts like line charts and bar charts. This allows the presentation of aggregated metricvalues of modules or of the whole system.

50 Chapter 7. Conclusion

Later we will evaluate other visualization algorithms like Kiviat graphs for visualizing thecorrelation of more than two metrics.

• We will calculate additional metrics. The first candidates will be metrics relating to theissues. This will allow us to identify files that are likely to be affected from future issues.Further we will investigate how we can calculate change couplings – as introduced by H.Gall et al. in [GJK03] – to allow us to detect logical couplings.

• We will examine how we can calculate metrics that are based on the contents of files, suchas counting the lines of code or the number of methods. This will give us additional infor-mation about the evolution of the complexity of the files. It means we will have to interfacewith the versioning system. To find a ressource saving solution for a big project like ourcase study project will be a challenge.

• We plan to extend the handling of issue reports so that we can take into account the wholeissue lifecycle. This means we will also have open issues in our RHDB and we will have toensure that they are correctly updated in case that their state has been changed. This updatealso includes all metric values that are dependent on the specific issue.

• We have been asked to add support for Clearcase, the other versioning system used in thecase study project. We will have to write an additional importer based on the model andmappings introduced in Section 3.1.1. It remains to be discussed if it makes sense to includethe Clearcase data in the RHDB where also the CMS data is stored, or if it rather should getits own RHDB.

References

[BEJWKG05] Jennifer Bevan, Jr. E. James Whitehead, Sunghun Kim, and Michael Godfrey. Facil-itating software evolution research with kenyon. In ESEC/FSE-13: Proceedings of the10th European software engineering conference held jointly with 13th ACM SIGSOFT in-ternational symposium on Foundations of software engineering, page 177186, New York,NY, USA, 2005. ACM Press.

[BHvW00] M. Bruls, K. Huizing, and J. van Wijk. Squarified treemaps, 2000.

[BK04] Christian Bauer and Gavin King. Hibernate in Action (In Action series). ManningPublications Co., Greenwich, CT, USA, 2004.

[DLG05] Marco D’Ambros, Michele Lanza, and Harald Gall. Fractal figures: Visualizingdevelopment effort for cvs entities. In Proceedings of the 3rd International Workshopon Visualizing Software For Understanding and Analysis, pages 46–51. IEEE CS Press,2005.

[FG04] Michael Fischer and Harald Gall. Visualizing feature evolution of large-scale soft-ware based on problem and modification report data. Journal of Software Mainte-nance and Evolution: Research and Practice, 16:385–403, 2004.

[FPG03] Michael Fischer, Martin Pinzger, and Harald Gall. Populating a Release HistoryDatabase from Version Control and Bug Tracking Systems. In Proceedings of the 19thInternational Conference on Software Maintenance (ICSM), page 2332, Amsterdam, TheNetherlands, September 2003. IEEE, IEEE Computer Society.

[GD06] Tudor Gırba and Stephane Ducasse. Modeling history to analyze software evo-lution. International Journal on Software Maintenance: Research and Practice (JSME),18:207–236, 2006.

[GJK03] Harald Gall, Mehdi Jazayeri, and Jacek Krajewski. Cvs release history data fordetecting logical couplings. In Proceedings of the International Workshop on Principlesof Software Evolution, pages 13–23, Helsinki, Finland, 2003. IEEE Computer SocietyPress.

[HCL05] Jeffrey Heer, Stuart K. Card, and James A. Landay. prefuse: a toolkit for interactiveinformation visualization. In CHI ’05: Proceedings of the SIGCHI conference on Humanfactors in computing systems, page 421430, New York, NY, USA, 2005. ACM Press.

[Hew05a] Hewlett-Packard Company, Palo Alto, California. Code Management System Refer-ence Manual, July 2005. Order Number: AAQJEVCTK.

52 REFERENCES

[Hew05b] Hewlett-Packard Company, Palo Alto, California. Guide to the Code ManagementSystem, July 2005. Order Number: AAKL03HTE.

[JW05] Andreas Jetter and Michael Wrsch. Evolizer Base CVS Importer Documentation, Octo-ber 2005.

[Leh80] Meir M. Lehman. Programs, life cycles and laws of software evolution. Proceedingsof the IEEE, 68(9):1060–1076, September 1980.

[Mar06] Dane Marjanovic. Release history meta modeling. Master’s thesis, University ofZurich, 01 2006.

[Par94] David Lorge Parnas. Software aging. In ICSE ’94: Proceedings of the 16th inter-national conference on Software engineering, pages 279–287, Los Alamitos, CA, USA,1994. IEEE Computer Society Press.

[PGFL05] Martin Pinzger, Harald Gall, Michael Fischer, and Michele Lanza. Visualizing mul-tiple evolution metrics. In Proceedings of the ACM Symposium on Software Visualiza-tion, pages 67–75, St. Louis, Missouri, 2005. ACM Press.

[Shn92] Ben Shneiderman. Tree visualization with tree-maps: 2-d space-filling approach.ACM Trans. Graph., 11(1):9299, 1992.

[TDD00] Sander Tichelaar, Stéphane Ducasse, and Serge Demeyer. Famix and xmi.In WCRE ’00: Proceedings of the Seventh Working Conference on Reverse Engineering(WCRE’00), page 296, Washington, DC, USA, 2000. IEEE Computer Society.

[Whi00] Brian A. White. Software configuration management strategies and Rational ClearCase: apractical introduction. Addison-Wesley Longman Publishing Co., Inc., Boston, MA,USA, 2000.

diploma thesis a visual evolution explorer...diploma thesis october 3, 2006 a visual evolution...

Documents