program comprehension & software evolution - [lightweight] principles and [real] practice

Program Comprehension & Software Evolution -

[Lightweight] Principles and [real] Practice

Program Comprehension & Software Evolution -

[Lightweight] Principles and [real] Practice

Michele LanzaFaculty of Informatics

University of Lugano

Switzerland

2

Prologue

• Reverse engineer 1’200’000 lines of C++ code in ca. 2300 classes• * 2 = 2’400’000 seconds• / 3600 = 667 hours • 667 hours / 8 = 83 working days• 83 days / 5 = 16 working weeks and 3 days• ~ 4 months

• Questions:– What is the size and the overall structure of the system?– What is the internal structure of the system and its elements?– How did the software system become like that?

Once upon a time…

3

The Life Cycle of Software Systems

Issues• Tool support• Scalability• Flexibility

?

Time

RequirementsAnalysis

Design

Implementation

4

Object-Oriented Reverse Engineering

• Goal: take a (large legacy) software system and “understand” it, i.e., construct a mental model of the system

• Problem: the software system in question is– Unknown, very large, and complex– Domain- and language-specific– Seldom documented or commented– “In bad shape”

?

?

5

Object-Oriented Reverse Engineering (II)

• Constructing a mental model requires information about the system:– Top-down approaches

– Bottom-up approaches

– Mixed Approaches• There is no “silver bullet” methodology• Every reverse engineering situation is unique• Need for flexibility, customizability, scalability, and

simplicity

?

6

Reverse Engineering Approaches

• Reading (source code, documentation, UML diagrams, comments)

• Running the SW and analyze its execution trace

• Interview users and developers (if available)

• Clustering

• Concept Analysis• Software Visualization• Software Metrics• Slicing and Dicing• Querying (Database)• Data Mining• Logic Reasoning• …

?

7

The “Information Crystallization” Problem

• Many approaches generate too much or not enough information

• The reverse engineer must make sense of this information by himself

• We need the right information at the right time

?

8

..take a step back..block the ground..think about it..• The information needed to reverse engineer a legacy

software system resides at various levels• We need to obtain and combine

– Coarse-grained information about the whole system

– Fine-grained information about specific parts

– Evolutionary information about the past of the system

!

9

Contents

• Polymetric Views• Software Visualization vs. Reverse Engineering

– Coarse-grained– Fine-grained– Evolutionary– Dynamic Information

• Discussion• Demos

10

A Solution - The Polymetric View

• A lightweight combination of two approaches:– Software visualization (reduction of complexity,

intuitive)– Software metrics (scalability, assessment)

• Interactivity (iterative process, silver bullet impossible)

• Does not replace other techniques, it complements them:– “Opportunistic code reading”

11

The Polymetric View - Principles

• Visualize software:– entities as rectangles– relationships as edges

• Enrich these visualizations:– Map up to 5 software

metrics on a 2D figure– Map other kinds of

semantic information on nominal colors

width metric

height metric

2 position metrics

Entities

Relationships

color metric

12

The Polymetric View - Example

Nodes = ClassesEdges = Inheritance Relationships

Width = Number of AttributesHeight = Number of MethodsColor = Number of Lines of Code

…

System Complexity View

13

The Polymetric View - Example (II)…

• Get an impression (build a first raw mental model) of the system, know the size, structure, and complexity of the system in terms of classes and inheritance hierarchies• Locate important (domain model) hierarchies, see if there are any deep, nested hierarchies• Locate large classes (standalone, within inheritance hierarchy), locate stateful classes and classes with behaviour

• Count the classes, look at the displayed nodes, count the hierarchies• Search for node hierarchies, look at the size and shape of hierarchies, examine the structure of hierarchies• Search big nodes, note their position, look for tall nodes, look for wide nodes, look for dark nodes, compare their size and shape, “read” their name => opportunistic code reading


Reverse engineering goals View-supported tasks

Nodes = ClassesEdges = Inheritance Relationships

Width = # attributesHeight = # methodsColor = # lines of code

14

Structural SpecificationTarget ......Scope ..........Metrics ....... ...... ......

....... ..... ........Layout ............Description ........................................................

.........................Goals ………………………………………..

……………………………Symptoms ……………………..

……………………………Scenario

Case Study ………………………………………..………………………..

The Polymetric View - Description

• Every polymetric view is described according to a common pattern

• Every view targets specific reverse engineering goals

• The polymetric views are implemented in CodeCrawler


…

15

Coarse-grained Software Visualization

• Reverse engineering question:– What is the size and the overall structure of the system?

• Coarse-grained reverse engineering goals:– Gain an overview in terms of size, complexity, and structure– Asses the overall quality of the system– Locate and understand important (domain model) hierarchies– Identify large classes, exceptional methods, dead code, etc.– …

16

Coarse-grained Polymetric Views - Example

Method Efficiency Correlation View

Nodes: MethodsEdges: -Size: Number of method parametersPosition X: Number of lines of codePosition Y: Number of statements

LOC

NOS

Goals:• Detect overly long methods• Detect “dead” code• Detect badly formatted methods• Get an impression of the system in terms of coding style• Know the size of the system in # methods

17

CodeCrawler Demo

18

Clustering the Polymetric Views

First Contact

System HotspotsSystem ComplexityRoot Class DetectionImplementation Weight Distribution

Candidate Detection

Data Storage Class DetectionMethod Efficiency CorrelationDirect Attribute Access ViewMethod Length Distribution

Inheritance Assessment

Inheritance ClassificationInheritance CarrierIntermediate Abstract

Class Internal

The Class Blueprint

19

Coarse-grained SV - Conclusions

• Benefits– Views are customizable (context…) and easily

modifiable – Simple approach, yet powerful– Scalability

• Limits– Visual language must be learned

20

Fine-grained Software Visualization

• Reverse engineering question:– What is the internal structure of the system and its elements?

• Fine-grained reverse engineering goals:– Understand the internal implementation of classes and class

hierarchies– Detect coding patterns and inconsistencies– Understand class/subclass roles– Identify key methods in a class– …

21

The Class Blueprint - Principles

Invocation Sequence

Initialization External Interface Internal Implementation Accessor Attribute

• The class is divided into 5 layers• Nodes

• Methods, Attributes, Classes• Edges

• Invocation, Access, Inheritance

• The method nodes are positioned according to

• Layer• Invocation sequence

22

The Class Blueprint - Principles (II)

Attribute

Read Accessor

Delegating Method

Constant MethodAbstract Method

Overriding Method

Extending Method

Write Accessor

Method

# invocations

# lines

Attribute

# external accesses

# internal accesses

Direct Attribute AccessMethod Invocation

23

The Class Blueprint - Example

• Delegate:– Delegates functionality to other classes– May act as a “Façade” (DP)

• Large Implementation:– Deep invocation structure– Several methods– High decomposition

• Wide Interface• Direct Access• Sharing Entries

24

The Class Blueprint - A Pattern Language?

• The patterns reveal information about – Coding style– Coding policies– Particularities

• We grouped them according to– Size– Layer distribution– Semantics– Call-flow– State usage

• Moreover…– Inheritance Context– Frequent pattern

combinations– Rare pattern combinations

• They are all part of a pattern language

25

The Class Blueprint - Example (II)

• Call-flow– Double Single Entry – (=> split class?)

• Inheritance– Adder– Interface overriders

• Semantics– Direct Access

• State Usage– Sharing Entries

26

The Class Blueprint - What do we see?

27

CodeCrawler Demo

28

Fine-grained SV - Conclusions

• Benefits– Complexity reduction– Visual code inspection technique– Complements the coarse-grained views

• Limits– Visual language must be learned– Good object-oriented knowledge required– No information about actual functionality =>

opportunistic code reading necessary

29

Evolutionary Software Visualization

• Reverse engineering question:– How did the software system become like that?

• Evolutionary reverse engineering goals:– Understand the evolution of OO systems in terms of size and

growth rate– Understand at which time an element, e.g., a class, has been

added or removed from the system– Understand the evolution of single classes– Detect patterns in the evolution of classes– …

30

The Evolution Matrix - Principles

Last VersionFirst Version

Removed Classes

Time (Versions)

Growth Phase Stagnation Phase

Added Classes

Version 2 .. Version (n - 1)

31

The Evolution Matrix - Principles (II)

• The Evolution Matrix reveals patterns– The evolution of the whole system

(versions, growth and stagnation phases, growth rate, initial and final size)

– The life-time of classes (addition, removal)

• Moreover, we enrich the evolution matrix view with metric information

• This allows us to see patterns in the evolution of classes

Class

# methods

# attributes

32

The Evolution Matrix - Pattern Language

• Repeated Modifications make it grow and shrink. • System Hotspot: Nearly every new system version requires changes.• No “cheap class”

Pulsar

Supernova

• Suddenly increases in size, possible reasons: • Massive shift of functionality towards a class.• Data storage class• Developers knew what to fill in.

Time (Versions)

33

The Evolution Matrix - Pattern Language (II)

• Lost the functionality it had and now trundles along without real meaning. • Possibly dead code.

White Dwarf

Red Giant

• A permanent god class which is always very large

Idle

• Keeps size over several versions. • Possibly dead code,possibly good code. Time (Versions)

34

The Evolution Matrix - Pattern Language (III)

• Exists during only one or two versions. • Perhaps an idea which was tried out and then dropped.

• Has the same lifespan as the whole system. • Part of the original design. • Perhaps holy dead code which no one dares to remove.

Persistent

Dayfly

35

The Evolution Matrix - Example

36

Evolutionary Software Visualization - Demo

37

Evolutionary SV - Conclusions

• Benefits– Complexity reduction

• Limits– Scalability (can be solved)– Rename problem (can be solved)– Relative changes hard to see (can be solved)

38

Run-Time Analysis - Problems and Challenges• RTA and Reverse Engineering - useful (in combination with static information)?• Procedural RTA vs. Object-Oriented RTA• OO RTA - Conceptual problems

– Polymorphism and late-binding– Inheritance and incremental class definition– Functionality (features) spread over the system– Which trace to generate? How?

• Technical challenges and constraints– Instrumentation problem (logging, VM patching, wrapping, ..)– Amount, density, and noise of generated information (Thousands of events in a few seconds..)– Granularity of information (object instantiations, message sends, attribute accesses, ..)– How much can we automate?

39

RTA - Questions

• Can we merge the dynamic information with static information?• Can we use a ‘’successful’’ static technique like polymetric views in RTA?

– What are the most instantiated classes?– Are there any singletons?– Which classes are object factories?– What is the percentage of actually used methods in classes?– Memory consumption?– Speed bottlenecks?– …

40

Case Study and Experiment Setup

• Case Study: Moose, our reengineering environment– Implementation language: Smalltalk– Age: 6 years– Size: >250 classes and >3500 methods and a test suite of more than 280 unit

tests (a veritable legacy system ;-)• Setup

– Code instrumentation using MethodWrappers– Trace Scenario(s) given by the Unit test suite– Wrapping down to method body level

• During trace-time we record events and increase counters• Afterwards we map the counter values as metrics

41

Run-time Measurements

• NCM, the number called methods• NMI, the number of method invocations• NCI, the number of created instances, that is the number

of times a class has been instantiated• NCO, the number of created objects, that is the number of

‘foreign’ objects that a class’s objects instantiated • Condensed information leads to greater scalability• Tradeoff with granularity and sequence of a trace• Interval of the values can be great (logarithmic scaling

useful)

42

Instance Usage Overview

Nodes ClassesEdges InheritanceMetric Scale LogarithmicLayout TreeNode Width # of Created InstancesNode Height # of Called MethodsNode Color # of Method Invocations

SymptomsSmall, light: unusedNarrow, tall: few, but used, instancesFlat, pale: heavily instantiated, seldom usedFlat, dark: heavily instantiated, functionality partially but heavily used

ClassesA: CDIFScannerB: AttributeDescription (3500 instances, 350’000 calls!)C: FAMIX metamodel rootG: Uninstantiated FAMIX classes (!)I: Smalltalk AST Visitor hierarchy

43

Creation Interaction View

Nodes ClassesEdges InstantiationMetric Scale LogarithmicLayout Embedded SpringNode Width # of Created ObjectsNode Height # of Created InstancesNode Color # of Created InstancesEdge Width # of Instantiations

SymptomsUnconnected: uninstantiatedConnected, small: classes with few instancesFlat, light: instance creators, seldom instantiated, possibly factoriesNarrow, dark: heavily instantiated, but do not create many other instancesWide, dark: heavily instantiated and used

Class ExamplesA: AttributeDescription - C: VWImporter (high-level import), D: VWParseTreeEnumerator (low-level import)E: FAMIXClass, ..Method, ..Attribute, etc. F: FAMIXAccess, FAMIXInvocation - G: MSEMeasurement (short-lived objects)

44

Dynamic Information SV - Conclusions

• Pros– Some new views on

software systems– Intuitive and compact way

of presenting very large amounts of information

– Insights into implementation issues

– Side result: assessment of test suite

• Cons– Loss of granularity and order– Suitability for optimization

domain unclear– Probably does not really scale

up for very large systems (but this depends on the viewer and his/her will to interact..)

– The current approach is intrinsically interactive (automatisation would be possible using advanced metrics-based techniques like detection strategies)

45

What about reality?

• Most IDEs have no or limited visualization support• Not an industry “standard”, most developers still have vi &

emacs mentality• Still poor usability

– May be used as “stand-alone” browsing tool, but not as part of a development metholodogy

– Needs much more effort (and people) to be “sexy”• Ongoing work must cope with the “present hypes”, such

as distributed development, eXtreme programming, etc.

46

Epilogue

• Did we succeed after all?• Not completely, but…

– System Hotspots View on 1.200’000 LOC of C++

– System Complexity View on ca. 200 classes of C++

The End

47

Industrial Validation - The Acid Test

• Several large, industrial case studies (NDA)• Different implementation languages• Severe time constraints

System Language Lines of Code Classes

Z C++ 1’200’000 ~2300

Y C++/Java 120’000 ~400

X Smalltalk 600’000 ~2500

W COBOL 40’000 -

Sortie C/C++ 28’000 ~70

Duploc Smalltalk 32’000 ~230

Jun Smalltalk 135’000 ~700

ArgoUML Java 220’000 ~1400

48

Questions and CommentsLet’s do it…

program comprehension & software evolution - [lightweight] principles and [real] practice

Documents

large legacy software

right information

systemfinegrained information

kinds of semantic information

number of lines

raw mental model

number of attributesheight

number of methodscolor