a language to script refactoring transformations
TRANSCRIPT
A Language to Script
Refactoring Transformations
Mathieu Verbaere
Wolfson College
Michaelmas Term 2008
Submitted in partial fulfilment of the requirements for
the degree of Doctor of Philosophy
�Oxford University Computing Laboratory
Programming Research Group
A Language to Script
Refactoring Transformations
Mathieu VerbaereWolfson College
D.Phil. ThesisMichaelmas Term 2008
Abstract
Refactorings are behaviour-preserving program transformations, typically for improving the
structure of existing code. A few of these transformations have been mechanised in interactive
development environments. Many more refactorings have been proposed, and it would be
desirable for programmers to script their own refactorings. Implementing such source-to-
source transformations, however, is quite complex: even the most sophisticated development
environments contain significant bugs in their refactoring tools.
We introduce a domain-specific language to script refactoring transformations. The lan-
guage, named JunGL, is a hybrid of a functional language in the style of ML and a logic
query language. It allows the computation of static-semantic information, such as name bind-
ing and control flow, and the expression of refactoring preconditions as queries on a graph
representation of the program. Borrowing from earlier work on the specification of compiler
optimisations, JunGL notably uses path queries to express dataflow properties.
We have been careful to keep the semantics of all logical features very declarative to
provide a sound basis for rigorous reasoning on the transformations. All constructs translate
to a novel variant of Datalog, a query language originally put forward in the theory of
databases. This variant works on duplicate-free sequences rather than sets, with the rationale
to present logical matches in a meaningful deterministic order. We call it Ordered Datalog.
Ordered Datalog programs, like Datalog programs, can be classified depending on how
nonmonotonic constructs such as negation are used. We identify the new class of partially
stratified programs as sufficiently expressive for our application, and highlight an evaluation
strategy following the Query-Subquery approach. Finally, we describe the current imple-
mentation of JunGL, and validate the whole design of the language via a number of complex
refactoring transformations.
Acknowledgements
I would first like to express my gratitude to my supervisor Oege de Moor for his guidance
and support, and for offering me to return to Oxford for a DPhil after my MSc project in his
group and a year away in Paris.
I would also like to thank Microsoft Research for funding my work through its European
PhD Scholarship Programme. I am particularly grateful to Fabien Petitcolas at MSR Cam-
bridge for making sure scholars always get great opportunities to present and discuss their
ongoing work.
Thanks also go to my final examiners, Mike Spivey and Ralf Lammel, for their comments
and suggestions during the viva which helped me improve this thesis.
The Programming Tools Group in Oxford has been a very pleasant and productive envi-
ronment to work in. Thanks to all its members. I am especially grateful to close friends Rani
Ettinger and Elnar Hajiyev. It is Rani who introduced me to the research field of refactoring
tools. It is Elnar who later set out with me on the Datalog adventure. I am also grateful
to Arnaud Payement for his enthusiasm while experimenting with JunGL, and to Damien
Sereni who has always been willing to help and share his broad knowledge of computer sci-
ence. Many thanks to all of them for their highly valuable inputs at different stages of this
work. I have enjoyed our discussions a lot.
I am also greatly thankful to my family and friends, in France and the UK, for their
support and the happy moments we shared in Oxford, London, Bidford, Martigues, Aix,
Joigny, Marcellaz, Les Contamines, Strasbourg, Luneville and Paris.
Finally, I want to thank Dorothee for her true love and the great life we have together.
ii
Contents
1 Introduction 1
1.1 The process of refactoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Some refactoring examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 On automating transformations . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Trends and challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 A scripting language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Alternative solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Design of the language 16
2.1 ML-like features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.1 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2 Pattern matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Logical features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 Lazy edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.3 Path queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Computational model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Other features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 The toolkit around the language . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.1 The graph structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.2 The interpreter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.3 Editors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6 Further examples on While programs . . . . . . . . . . . . . . . . . . . . . . . 37
2.6.1 Binding and definite assignment checks . . . . . . . . . . . . . . . . . 37
2.6.2 Rename Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6.3 Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.7 Summary and references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
iii
CONTENTS iv
3 Datalog 44
3.1 Logic programs and syntax of Datalog . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.1 Minimal models and least fixpoints . . . . . . . . . . . . . . . . . . . . 47
3.2.2 Safe Datalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.3 Mapping predicate calculus to relational algebra . . . . . . . . . . . . 50
3.2.4 Evaluation of strata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Evaluation strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.1 Top-down vs bottom-up . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.2 Query-Subquery and magic sets . . . . . . . . . . . . . . . . . . . . . . 55
3.3.3 Existing implementations . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 General logic programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.5 Summary and references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4 Ordered semantics of the logical features 63
4.1 Why order matters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Duplicate-free sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.2 Relational operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Stratified Ordered Datalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.1 Non-termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.2 Chasing nonmonotonic ordered operators . . . . . . . . . . . . . . . . 71
4.3.3 A refinement of stratified Datalog . . . . . . . . . . . . . . . . . . . . 75
4.4 Data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5 Translating predicates, edges and path queries . . . . . . . . . . . . . . . . . . 79
4.5.1 Abstract syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5.2 Relational equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.5.3 Ordered Datalog rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5.4 Encoding dynamic edge dispatch . . . . . . . . . . . . . . . . . . . . . 87
4.5.5 A full translation example . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.6 Summary and references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5 Evaluating more general ordered queries 93
5.1 On accepting more queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Beyond stratified Ordered Datalog . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.1 Partial instantiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.2 Partial stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3 Demand-driven evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3.1 Top-down sequence-based evaluation . . . . . . . . . . . . . . . . . . . 99
5.3.2 The issue with first . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
CONTENTS v
5.3.3 Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4 Generating partial reductions . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5 Back to sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5.2 The orelse operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.6 Summary and references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6 Scripting refactorings 113
6.1 Rename Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.1.1 The object language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.1.2 Name lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.1.3 Detecting conflicts and renaming . . . . . . . . . . . . . . . . . . . . . 121
6.1.4 Minimising rejection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.2 Extract Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.2.1 The object language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2.2 Name and type lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.2.3 Generating type constraints . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2.4 Solving and transforming . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.3 Extract Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.3.1 The object language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.3.2 Control and data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.3.3 Checking validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.3.4 Inferring parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.3.5 Placing declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.3.6 Transforming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.4 Summary and references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7 Discussion and future work 147
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
A JunGL grammar 160
B Rename Variable 165
C Extract Method 170
Chapter 1
Introduction
1.1 The process of refactoring
Refactoring is the process of improving the design of a program while preserving its behaviour.
Often the purpose is to correct existing design flaws, to prepare a program for the introduction
of a new functionality, or to take advantage of a new programming language feature such as
generic types.
Although refactoring has been done informally (and manually) for decades, it was first
seriously examined only fifteen years ago by William Opdyke in his PhD dissertation [Opd92].
There, Opdyke presents refactoring as a disciplined technique along with a classification of
useful transformations to improve indeed the design of object-oriented programs. Perhaps
the most obvious and most popular example is renaming: a variable, or any other program
artifact with a name, is given a new name to better reflect its purpose in the code, and
therefore improve the overall readability of the program.
Later, the upsurge of iterative programming methodologies (such as extreme programming
and other agile methodologies), which promote evolutionary change throughout the entire
life-cycle of a project, contributed greatly in increasing the general interest in refactoring.
Martin Fowler’s catalogue [Fow99] long remained the single classical reference for developers,
before the recent editions of more books on the topic, e.g. [Ker05]. All of these present a
remarkable number of different refactoring transformations, more or less complex, tedious
and hence error-prone. To cope with such difficulties, practitioners are often recommended
to run unit testing after each refactoring to check that the behaviour of the resulting program
is indeed externally similar to the behaviour of the original program.
This also explains the considerable interest in providing automated (or semi-automated)
support for applying refactoring transformations. The Smalltalk Refactoring Browser by
John Brant and Don Roberts was the first tool to provide that kind of automated support
[RBJ97, Rob99]. Since then, a lot of engineering effort has been put in refactoring tools
and most Integrated Development Environments now provide such support, in the form of a
1
CHAPTER 1. INTRODUCTION 2
fixed menu of transformations that may be applied, for instance for renaming, extracting a
method, extracting an interface, and so on.
1.2 Some refactoring examples
To better illustrate what a single refactoring transformation is, we shall present two well-
known refactorings, namely Encapsulate Field and Extract Method. We expose each refactor-
ing as it is described in [Fow99], that is with the motivation for it, a tiny example and the
general mechanics to achieve it.
Encapsulate Field In an object-oriented program, a public field should be turned into a
private one and accessors should be provided for it. The rationale is that data and behaviour
are best separated.
The following Java declaration:
public Str ing name ;
ought to be refactored to:
private Str ing name ;
public Str ing getName ( ) { return name ; }
public void setName( St r ing aName) { name = aName ; }
The mechanics are described as:
• “Create getting and setting method for the field.
• Find all clients outside the class that reference the field. If the client uses the
value, replace the reference with a call to the getting method. If the client
changes the value, replace the reference with a call to the setting method.
[. . .]
• Compile and test after each change.
• once all clients are changed, declare the field as private.
• Compile and test.”
That excerpt gives evidence of two natural but important technicalities. First, the me-
chanics of a transformation is tightly coupled to the object language of the transformation.
Indeed, in C# for instance, the support for properties makes the second step useless, as prop-
erties are accessed in exactly the same manner as fields. Second, it is clear that another
variant of Encapsulate Field could be derived where references to the field which occur inside
the class of the field would also be updated.
CHAPTER 1. INTRODUCTION 3
Extract Method A method that is too long and serves too many purposes should be split
into several single-purpose and well-named methods.
For instance, the following piece of Java code:
void printOwning (double amount ) {
pr intBanner ( ) ;
// pr in t d e t a i l s
System . out . p r i n t l n ( ”name : ” + getName ( ) )
System . out . p r i n t l n ( ”amount : ” + amount ) ;
}
is better refactored into:
void printOwing (double amount ) {
pr intBanner ( ) ;
p r i n tDe t a i l s ( amount ) ;
}
void p r i n tDe t a i l s (double amount ) {
System . out . p r i n t l n ( ”name : ” + getName ( ) )
System . out . p r i n t l n ( ”amount : ” + amount ) ;
}
This time, the mechanics read as follows:
• “Create a new method, and name it after the intention of the method (name
it by what it does, not how it does it). [. . .]
• Copy the extracted code from the source method into the new target method.
• Scan the extract code for references to any variables that are local in scope to
the source method. These are local variables and parameters to the method.
• See whether any temporary variables are used only within this extracted code.
If so, declare them in target method as temporary variables.
• Look to see whether any of these local-scope variables are modified by the
extracted code. If one variable is modified, see whether you can treat the
extracted code as a query and assign the result to the variable concerned. If
this is awkward, or if there is more than one such variable, you can’t extract
the method as it stands. [. . .]
• Pass into the target method as parameters local-scope variables that are read
from the extracted code.
CHAPTER 1. INTRODUCTION 4
• Compile when you have dealt with all the locally-scoped variables.
• Replace the extracted code in the source with a call to the target method. [. . .]
• Compile and test.”
As we see, this is both complex and informal. A key point of the mechanics is the implicit
presence of preconditions: “if there is more than one such variable, you can’t extract the
method as it stands”. Preconditions play an important role in the automation of refactoring
to ensure the transformation will either be completed and behaviour preserving, or rejected.
Note also that preconditions might differ from one language to another. In C#, multiple
variables can be modified by the extracted method and returned to the original method since
the language supports ref and out parameter passing modes [SH04].
1.3 On automating transformations
The quest for serious automated refactoring support is related by Fowler in his article enti-
tled Crossing Refactoring’s Rubicon [Fow01]. At an early stage, transformations were often
performed only at the level of text, or at best on the Abstract Syntax Tree but purely syn-
tactically. Of course, these were hardly behaviour-preserving. It is only in 2001 that the
Rubicon was crossed with the implementation of Extract Method by a few tools.
The mechanised version of Extract Method allows the programmer to select a contiguous
block of code, which is then extracted into a new method. For that kind of automatic ex-
traction, the tools need to perform a deep semantic analysis to determine what parameters
should be passed to the new method, and whether the transformation is at all possible. If
not, the refactoring should be rejected. This indeed ought to happen, for Java programs, if
more than one variable is assigned in the block to be extracted, as their value can simply
not be returned from the new method - at least not without the encapsulation of the re-
turned variables into a dedicated wrapper, which is likely to impede readability of the code.
Unfortunately, although current tools work out the correct solution for most extractions,
they still fail on some corner cases depending on the implementation. Eclipse, IntelliJ IDEA
and Visual Studio provide this refactoring, but we could find correctness issues in all three
implementations [EESV08].
An example of such a flaw in the first release of Visual Studio 2005 is shown in Figure
1.1. On the left is the original program, and the region to be extracted is indicated by the
‘from’ and ‘to’ comments. On the right is the resulting code: note that in the new method,
the variable i is returned without necessarily being assigned. The refactored version does not
compile as it violates the definite assignment rule of C#. In fact, the new method does not
need to return the variable i because it is not live at the end of the selection. We reported
that bug and it has been fixed in the new version of Visual Studio.
Another perhaps more subtle issue has been reported by Ran Ettinger in Eclipse 3.3. In
the artificially constructed Java code of Figure 1.2, one cannot extract the region between
CHAPTER 1. INTRODUCTION 5
public void F(bool b){
int i ;// fromi f (b){
i = 0 ;Console . WriteLine ( i ) ;
}// toi = 1 ;Console . WriteLine ( i ) ;
}
public void F(bool b){
int i ;i = NewMethod(b ) ;i = 1 ;Console . WriteLine ( i ) ;
}
private static int NewMethod(bool b){
int i ;i f (b){
i = 0 ;Console . WriteLine ( i ) ;
}return i ;
}
Figure 1.1: Extract Method bug in Visual Studio 2005.
the ‘from’ and ‘to’ comments. The rejection is accompanied with the following explanation
message: “Ambiguous return value: selected block contains more than one assignment to local
variable”. In fact, only n is used after the selection. A true flow-sensitive dataflow analysis
would have noticed the effect of the break.
public int g ( ) {int n = 10 ;int i = 0 ;while ( i<n) {
// fromi++;n−−;// tobreak ;
}return n ;
}
Figure 1.2: Extract Method rejection issue in Eclipse 3.3.
These kinds of bugs go to the heart of the difficulty of implementing new refactorings: it
requires dataflow analysis (in particular variable liveness), of the same kind as in compiler
optimisations. From these and similar examples, we deduce that a framework for refactoring
CHAPTER 1. INTRODUCTION 6
must provide dataflow analysis facilities as well as other, perhaps more obvious, features such
as pattern matching and mechanisms for variable binding. We shall show the correct way to
refactor the Visual Studio’s example in Chapter 6.
The two flaws presented here are just illustrative of more faulty refactorings documented
in [EESV08]. Another study has also reported many issues in two of the mainstream Java
IDEs [DDGM07]. The authors developed a technique for automated testing of refactoring
engines, based on the iterative generation of structurally complex test inputs. They found
a total of 21 new bugs in Eclipse and 24 in NetBeans. The issues concern refactorings of
different kinds, among them Rename Field, Encapsulate Field and Pull up Method (for moving
a method from a subclass to some superclass).
All these bugs show the inherent complexity of implementing correct program transfor-
mations. They also give evidence that most transformations cannot be expressed in purely
syntactic terms without any recourse to compiler-like analyses.
1.4 Trends and challenges
In view of the large number of refactorings that have been proposed and of the complexity in
correctly expressing refactorings, it is natural to think about providing some kind of a toolkit
to facilitate their implementation. Additionally, some current trends in software development
also make a strong case for more numerous, more sophisticated, and more reliable refactoring
features. In order to draw a complete picture for the requirements of a refactoring toolkit,
we briefly expose those trends together with the challenges they present with respect to
refactoring.
A profusion of languages Software developers are faced with a profusion of technologies
and languages when starting a new project. Even on legacy code, the choice of a different
language for developing a new functionality is often considered.
One of the design goals of the Common Language Runtime environment in the .NET
framework was to enable cross-language development, which includes cross-language debug-
ging, cross-language exception handling and even cross-language inheritance. That is, any
.NET compliant language is seamlessly usable with another .NET compliant language. Well-
known examples of mainstream object-oriented .NET languages are C# and VB.NET, but
other languages such as F# [Sym05], a variant of OCaml, can also be compiled to the .NET
intermediate language, known as CIL. Assemblies written in C# and other .NET languages
can be directly accessed from F#, and vice versa. With cross-language development, a de-
veloper can choose the language that best suits her needs and still be able to integrate into
a single application. Obviously, cross-language refactorings are expected in that context:
renaming a C# method should update calls eventually present in an F# program.
In parallel, language designers are trying to bridge the gap between certain paradigms
in software development. Designers are notably addressing the so-called O/R (for object-
CHAPTER 1. INTRODUCTION 7
relational) and X/O (for XML-object) impedance mismatches [LM07] which are encountered
when using a relational database or an XML stream to store objects. The idea is to provide
relation-specific or XML-specific features to raise the manipulation of data, stored in these
respective formats, at the level of objects. Concretely, this is done by integrating support for
native queries into a host object-oriented language like Java or C#.
The XJ project at IBM Watson Research proposes that kind of novel mechanisms for
the integration of XML as first-class constructs into Java [BBPR05]. The LINQ project
has taken another, more general, route and provides general-purpose query facilities to the
.NET Framework that apply to all sources of information, not just relational or XML data
[MBB06]. Each project can be seen, however, as the introduction of an embedded Domain
Specific Language (DSL) into a host language - for the purpose of manipulating XML in the
case of XJ, and for general-purpose queries in LINQ.
Of course, all these languages and language extensions should be properly supported in
development environments. End-users expect syntax highlighting, on-the-fly semantic anal-
yses and refactoring support. Yet, building sophisticated development environments is a
difficult task. IMP, developed at IBM Watson Research, is an Eclipse-based meta-tooling
framework that is exactly intended to speed the creation of rich IDEs [imp07]. IMP aims to
provide a set of APIs to help in the implementation of semantics analyses and refactorings.
Such APIs are already useful but we wish to facilitate even more the automation of refactor-
ing transformations, in order to help refactoring authors manage the growing demand and
complexity of refactoring tools.
In terms of refactoring support, language extensions have indeed two major consequences.
Firstly, existing implementations of refactoring transformations must be updated to support
the new constructs that were not originally present in the host language. Secondly and
perhaps less obviously, developers expect new refactoring tools for migrating their old code
to take advantage of the new constructs. In the context of XJ for instance, it is desirable
to transform code for constructing an XML fragment via calls to the DOM API into safer
and more readable XJ code for constructing the same XML fragment. To illustrate, one may
wish to convert this Java code:
Element r eg i on = doc . createElement ( ” r eg i on ” ) ;
Element name = doc . createElement ( ”name” ) ;
Text tex t = doc . createTextNode ( ” c en t r a l ” ) ;
name . appendChild ( t ex t ) ;
r eg i on . appendChild (name ) ;
Element s a l e s = doc . createElement ( ” s a l e s ” ) ;
t ex t = doc . createTextNode ( ”12” ) ;
s a l e s . appendChild ( t ex t ) ;
s a l e s . s e tAt t r i bu t e ( ” uni t ” , ” m i l l i o n s ” ) ;
r eg i on . appendChild ( s a l e s ) ;
CHAPTER 1. INTRODUCTION 8
into that XJ snippet:
r eg i on r = new r eg i on (
<r eg ion>
<name>c ent r a l </name>
<s a l e s uni t=” m i l l i o n s ”>12</s a l e s >
</reg ion >);
Another more obvious example of language extensions challenging for refactoring tools is
Java 5. Beside the engineering effort required to make existing refactoring implementations
aware of the new features in Java 5, much research work has been done to automate the
introduction of generic types [DKTE04, vDD04, KETF07] or to convert constants to enums
[KSR07].
User-defined transformations Beyond the emergence of new languages and language
extensions for which it is desirable to provide new refactoring transformations, advanced
developers may wish to author their own transformations. Perhaps the most relevant appli-
cation of user-defined transformations is the migration of library calls using an old API to
a refactored one, which is in a way a less extreme form of language extensions. Of course,
the complexity of the mechanisation depends on the sophistication of the transformation.
Changing the name of a method at all client call sites is fairly straightforward, but the fully
automatic migration of applications that use legacy library classes to newer, sometimes quite
different, classes is more difficult [BTF05]. Because of these different levels of sophistication,
there is no single silver-bullet solution to user-defined transformations. Support for them is
very diverse in existing systems.
A first solution, available in Eclipse, is to keep an history of refactorings. The provider
of the API records the refactorings she plays on the codebase of the API, and later ships
the recorded script with the new library. This approach appeals for its simplicity, but it
limits very much the kind of modifications that can be done to the API. Indeed, complex
changes that are likely to affect the sequence of library calls in the client code, can often not
be expressed as a series of general-purpose refactoring transformations, at least not with the
ones currently proposed in IDEs.
Another approach, still in Eclipse but certainly more heavyweight, is to write a new
plugin. Yet, like with IMP, this requires the mastery of complex APIs offered by the Java
Development Tools (JDT) and the Refactoring Language Toolkit (LTK). Very recently, at the
first workshop on refactoring tools, Robert Fuhrer and other people involved in the Eclipse
refactoring support have related the history of these APIs and mapped out a roadmap for
refactoring’s future [FKK07]. One of the main challenges presented there is indeed to ease
the development of refactorings. To address the matter, the authors suggest:
CHAPTER 1. INTRODUCTION 9
• a declarative AST-based transformation language, in which transformation could ac-
tually be type-checked to check upfront that a transformation will always result in a
valid AST;
• better means of specifying underlying analyses, with clean declarative formulas that
would be mapped onto efficient data representations.
In the meantime, and with the same goal of facilitating the development of refactorings,
the use of Scala [Ode07] for the implementation of user-defined transformations has been
suggested [Fal07]. Scala fully inter-operates with Java and its functional features, like pattern
matching, are highly desirable in the context of meta-programming, i.e. for writing programs
that manipulate programs.
Halfway between replaying fixed refactorings and writing a full plugin, one can find intu-
itive, flexible but mostly syntactic solutions. In IntelliJ IDEA, there is indeed a user-friendly
facility called Structural Search and Replace that enables limited transformations by pattern
matching on the syntax tree [Mos06]. On the same line, Marat Boshernitsan developed iXj
[BGH07], a visual tool that allows programmers to make complex code transformations in an
intuitive manner. The strength of iXj is its alignment with programmers mental models of
programming structures. Indeed it does not require the manipulation of complex source code
representations such as Abstract Syntax Trees. The tool looks very promising but currently
lacks support for accessing more semantic information about the code. Although it already
provides pattern matching of specific variables with a particular static type in a given scope,
most of the interesting transformations for manipulating library calls require control and
dataflow information in addition to bindings. The current visual model of iXj could certainly
be extended to integrate these additional concepts. Nevertheless, visual models often have
inherent limitations, and a top-down approach of extending the model of iXj might hit one
of them.
We believe it is more appropriate to start from a more heavyweight solution and provide
some useful constructs and means of abstraction to ease the development of refactorings. Of
course, in that kind of bottom-up approach, only tool experts can at first author their own
refactorings, but the ultimate aim is to reach end-user developers by providing high-level
building blocks for authoring custom transformations.
1.5 A scripting language
The observations we have made in the previous sections can be summarised in three points.
First, the correct implementation of refactoring transformations is hard and requires some
deep static-semantic information about the programs to transform. Second, there is a growing
demand for refactorings to support new languages or language extensions. Third, advanced
developers may wish to implement their own transformations.
CHAPTER 1. INTRODUCTION 10
To address these issues, we propose a domain-specific language that enables the concise
description of refactoring transformations as scripts. The notion of scripts usually conveys
the idea of a small piece of code that can be run in an interactive environment. Our wish
is indeed to allow tool authors to quickly prototype and express refactorings within scripts
that could be exchanged and replayed. Scripts should be concise and be as close as possible
to the specifications of refactorings.
How should such a scripting language look like? Looking at the mechanics of some
refactorings, we note there are three common steps in their mechanisation:
• Finding elements of interest;
• Checking preconditions;
• Performing the actual transformation.
In addition, some tool authors may also wish to check postconditions to guarantee some
correctness of the transformation. It is clear from these steps that logical features to find
elements and check conditions on the program to refactor are useful. As for the third step
and the actual manipulation of the program, the benefits of functional features for meta-
programming is very well-known.
Moreover, in view of the increasingly prevalent mixtures and embeddings of languages,
we wish to target any object language or indeed several languages at once for cross-language
refactoring. Since most refactoring transformations require some knowledge about name and
type lookup, as well as control and dataflow information, it is an absolute requirement to
allow, in the scripting language itself, the description of static-semantic information about
the object languages. Furthermore, there are two other less obvious reasons for such a
requirement. First, we envision the use of similar scripts for other, perhaps more simple, tool
support such as the navigation between artifacts in an IDE. Second and more importantly,
we believe having the computation of that information in a clear formalism will allow us to
reason more precisely about the correctness of the refactorings that build on that information.
We shall turn to that point in Chapter 7 where we discuss future work in detail.
One may rightly wonder, however, how the above requirements differ from that of a
compiler. The main difference, beyond the fact that we wish to perform transformations at
the source level, is in the ability to find program elements of interest, and compute some
properties on these particular elements only. In contrast, compilers perform global analyses
on the complete program. For instance, compilers usually build a complete symbol table
for their input program, in order to resolve any variable reference to its declaration. On
the other hand, refactoring transformations are most of the time fairly local, and even when
they require a global search (for instance, when renaming a global variable), not all static-
semantic information is actually needed. The ability to query the program structure and
compute static-semantic information in a demand-driven manner are, in fact, two important
CHAPTER 1. INTRODUCTION 11
and singular requirements for a language to script refactorings. We shall explain throughout
the thesis and in particular in Section 5.3 how we achieve this.
1.6 Alternative solutions
A wealth of techniques and research tools are closely related to the domain of refactoring.
Although none of them appears to be the right solution for scripting refactoring transforma-
tions, they are all inspirational.
General-purpose transformations A refactoring is just a special kind of code trans-
formation. One might therefore wonder if general-purpose transformations systems could
elegantly address the issues we have raised earlier and enable the expression of refactoring
transformations into a concise and readable formalism.
An example of such general-purpose transformation tool is the TXL programming lan-
guage [Cor06]. TXL is a hybrid functional and rule-based language designed to support source
transformation tasks. In particular, it allows rapid prototyping of new language parsers and
new language extensions.
Another example is the ASF+SDF Meta-Environment, a complete toolkit for the im-
plementation of transformations and other language processing, based on a Generalized LR
parser [vdBHdJ+01]. It focuses mostly on syntax definitions with SDF and on syntactical
transformations. Stratego/XT [BKVV06] is a language and toolset for program transfor-
mation that builds on SDF. The Stratego language provides rewrite rules for expressing
transformations, and the XT toolset offers a collection of tools, such as powerful parser and
pretty-printer generators and grammar engineering tools. Its original focus was also on syntax
analysis, but Stratego now supports dynamic rewrite rules for expressing context-sensitive
transformations and more semantic analysis [BvDOV06]. Although they can be used to com-
pute static-semantic information, dynamic rewrite rules are sometimes difficult to use for
that purpose, as the context they capture can only be propagated top-down.
All these systems have primarily focused on more syntactic analyses, and added support
for more semantic tasks trying to stay close to their original formalism. As a result, the
computation of contextual information, such as name lookup, is sometimes hard to define
in an intuitive way. On the other hand, these systems support rewrite rules which are very
appealing for specifying the actual transformation steps of a refactoring.
APTS [Pai94] is another general transformation system. It allows sophisticated program
derivation and, in that sense, closely relates to the area of refactoring where behaviour preser-
vation is important. In addition to rewrite rules, it supports inference rules for computing
semantic information. Interestingly, these inference rules are expressed in a language simi-
lar to Datalog, a database query language that we present in Chapter 3. As we shall see,
the logical features of our scripting language are also reminiscent of Datalog. Nevertheless,
CHAPTER 1. INTRODUCTION 12
the formalism of APTS, though very powerful, is too heavyweight. The expression of a
refactoring script should be more intuitive.
Attribute grammar systems The general-purpose transformation systems cited above
are not well-suited for the computation of static-semantic information necessary in the im-
plementation of developers tools. Systems based on attribute grammars have proved much
more successful.
The Synthesizer Generator [RT84] demonstrated the use of declarative specifications for
implementing language-based editing environments. Their formalism was that of an at-
tribute grammar tailored to the application domain of language-based editors. The context-
dependent features (i.e. the static-semantic information) of a language were described using
a combination of synthesized and inherited attributes. The former are expressed using infor-
mation from the children of a node, whereas the latter are passed down from parent nodes.
JastAdd is a recent system which also builds on the formalism of attribute grammars
[EH04]. One of its strengths is its integration with a mainstream language, namely Java. In
addition, JastAdd supports circular attributes for fixpoint computations, reference attributes
for relating nodes in the AST, and collection attributes for specifying cross-reference-like
properties such as sets of variable uses. The elegance of JastAdd has notably been demon-
strated with the implementation of JastAddJ, a full Java 5 compiler [EH07]. We shall discuss
attribute grammar systems again in Chapter 7.
Compiler optimisations In many ways, refactoring transformations are similar to com-
piler optimisations. The main difference though is that refactorings are applied at the source
level, rather than at the level of a convenient intermediate representation.
Over the past fifteen years, there has been much activity in the formal specification of
compiler optimisations, and in generating program transformers from such specifications,
e.g. [WS97, KKKS96, LM01, LJVWF02, DdMS02, MLVW03, OV02, LMC03, SdML04,
LMRC05]. All these works contrast with research that seeks to express transformations
only in syntactic terms, and provide foundations for the specification of refactoring trans-
formations. We will mention these works in more detail in Chapter 7. In particular, we
will compare our work with Optimix, an optimiser generator that mixes Datalog and graph
rewriting [Aßm98].
Logic meta-programming We mentioned earlier that our scripting language shall embed
some logical features to find elements in the code and check static conditions on the program.
This is what others have proposed in the context of code queries for spotting refactoring
opportunities [TM03] and for other software engineering tasks. There are many examples of
code querying systems and all of them are inspirational.
JQuery [JV03, MV04] is an Eclipse plugin for querying Java code empowered by a Prolog-
like engine. CodeQuest [HVdM06] is a prototype compiler of code queries expressed in
CHAPTER 1. INTRODUCTION 13
Datalog to procedural SQL. GraphLog [CMR92] is a query language with enough power to
express path properties on graphs, equivalent to linear Datalog, but with a graphical syntax.
PQL [Jar98] is a representation-independent query language with a syntax close to SQL.
Finally, ASTLog [Cre97] focuses on traversing syntax trees.
The important difference though is that, in all these works, results of code queries are not
directly used to transform the program. In addition, although these systems are expressive
enough to encode the complex preconditions of transformations (except maybe ASTLog which
was really designed for tree queries only), most of them are actually not expressive enough for
the computation of static-semantic information, such as name binding. It is usually assumed
that this kind of information is computed in an earlier pass and made available in some
built-in relations.
1.7 Contributions
In this thesis, we suggest that the techniques which have proved successful in specifying
compiler optimisations form an appropriate basis for scripting refactoring transformations –
with the important difference again that in refactoring one transforms source code, and not
some convenient intermediate representation.
Furthermore, we propose to bridge the gap between such techniques and code queries by
allowing the expression of both complex contextual static-semantic properties (such as name
lookup or dataflow) and more structural code queries (for finding elements of interest) in a
clean uniform formalism that translates to a variant of Datalog.
The principal contributions of this thesis are:
• The identification of the need for a scripting language for refactoring transformations,
and of its requirements. The language must notably allow script authors to:
– easily find program elements of interest;
– describe, for different object languages, static-semantic information, such as name
binding, type analysis and flow analysis;
– concisely express preconditions of refactorings using that static-semantic informa-
tion;
– perform the actual transformation.
• The formulation of features for such a language, in particular:
– functional features (borrowed from ML, such as higher order functions and pattern
matching) for manipulating ASTs;
– logical queries (akin to Datalog) for expressing complex static relationships be-
tween program elements;
CHAPTER 1. INTRODUCTION 14
– path queries as a convenient shorthand for queries that capture complex static-
semantic properties, such as control and dataflow properties.
• The integration of all these features in a clean, coherent design.
• An implementation of the language on the .NET platform.
• The validation of the language design on a number of non-trivial examples, and the
first, to our knowledge, complete specification of the core part of Extract Method for a
large subset of C#.
• A variant of Datalog where query results are returned in a meaningful order, and whose
semantics is based on duplicate-free sequences rather than sets.
• A class of partially stratified Datalog programs (sufficiently expressive to encode the
computation of static-semantic information), along with a top-down set-based resolu-
tion strategy to evaluate such programs.
1.8 Outline
This thesis assumes the reader has basic knowledge of functional programming, predicate
calculus and relational algebra. It also assumes general background knowledge on the broad
field of meta-programming, and more specifically in the area of compiler construction. The
remainder of the thesis is organised as follows.
In Chapter 2, we introduce the ideas in the design of our language, called JunGL. That
design is illustrated through the implementation of representative analyses and refactorings
on a toy imperative language. We also briefly present the toolkit that we have built around
our language using both C# and F#, a .NET functional language inspired by OCaml and
developed at Microsoft Research.
In Chapter 3, we give an introduction to Datalog in its classical version based on a finite-
set semantics. Several important classes of Datalog programs have been characterised, e.g.
statically stratified Datalog and modularly stratified Datalog. We detail them along with
their common implementation strategies.
In Chapter 4, we explain that the order of results produced during the evaluation of a
JunGL query is important. In that respect, Prolog-like resolution mechanism seems at first
appropriate for our application, but termination of queries would be hard to tract. Instead,
our logical features translate to an ordered variant of Datalog whose semantics are based
on duplicate-free sequences rather than sets. We study this variant of Datalog, which we
call Ordered Datalog, and give a precise translation of predicates, edges and path queries to
Ordered Datalog programs.
In Chapter 5, we introduce a broader class of stratified Datalog programs that appears in
practice sufficiently expressive for the computation of static-semantic information. This class
CHAPTER 1. INTRODUCTION 15
allows the use of nonmonotonic constructs inside recursion, but remains smaller than the
class of modularly stratified Datalog. Furthermore, we describe the evaluation of Ordered
Datalog programs in a demand-driven manner on a top-down stream-based framework, and
we also address the relationship between Ordered Datalog and normal set-based Datalog by
exploring how to express Ordered Datalog queries in normal Datalog.
In Chapter 6, we put to test the whole design of JunGL and discuss a number of complex
refactorings for large subsets of languages like Java or C#. We choose to present three well-
known refactorings, Rename Variable, Extract Method and Extract Interface, as we believe
they are representative of three important classes of refactorings. The first class deals with
scoping, the second with control and data flows, and the last one, more specific to object-
oriented programming, consists of refactorings that alter the type hierarchy of a program.
In Chapter 7 finally, we discuss more related work, compare our language with other
approaches, and highlight directions for future work.
Three appendices have been attached to the thesis. Appendix A is a reference for the syn-
tax of our scripting language. Appendices B and C are example scripts of complex refactoring
transformations.
Chapter 2
Design of the language
In this chapter, we introduce informally the features of JunGL — short for Jungle Graph
Language — that make it appealing to the specific domain of scripting refactoring transfor-
mations. JunGL borrows features both from functional ML-like languages and from logic
languages. It differs, however, from previous early approaches in combining these two styles
of programming, such as LogLisp [RS82]. First, our language focuses on querying and ma-
nipulating a representation of a program. Second, the logical features of JunGL are mostly
based on a variant of Datalog, a database query language with a very declarative semantics,
that we shall introduce in the next chapter.
Here, we illustrate most constructs with excerpts from a common JunGL script. That
script describes functions and predicates for manipulating a toy imperative language, called
While.
2.1 ML-like features
JunGL is primarily a functional language in the tradition of ML [MTHM97]. Like ML, it
has important features such as pattern matching and higher-order functions, while allowing
the use of updatable references. The advantages of this type of programming language in
compiler-like tools is well-known [App98]. As a very brief illustration of the style of definition,
here is the ‘map’ function that applies a function f to all elements of a list l :
l e t rec map f l =
match l with
| [ ] → [ ]
| x:: xs → ( f x ) :: (map f xs )
That is, map is recursively defined: in the body, we examine whether l is empty or whether
l consists of an element x followed by the remaining list xs . In the latter case, we apply f to
x and recurse on xs .
16
CHAPTER 2. DESIGN OF THE LANGUAGE 17
The function map can now be used as in:
l e t succ = fun x → x + 1 in
map succ [ 1 ; 2 ; 4 ]
We first define another function succ that takes an integer and returns its successor, and
we then ask for the result of mapping succ to the list [1; 2; 4]: that is [2; 3; 5].
2.1.1 Types
Usual ML types As we see from our previous example, functions are first-class values. A
function can be passed to another function as a parameter, be assigned to a variable (such as
succ above) and be returned as a result of a function. So are lists (e.g. [1; 2; 4]), and tuples
(e.g. (1, 2, 4)).
Obviously, in addition to functions, lists and tuples, JunGL also manipulates basic types:
booleans, numeric values and strings. Numeric values currently consist only of integers. In
the course of designing JunGL, we omitted reals since we could not think of an application
for them. The refactoring research community has proposed, however, to locate refactoring
opportunities using metrics [SSL01]. One could also use heuristics to guide the transforma-
tions that cannot be optimally defined. For those applications, we agree that a support for
real values would be convenient. Adding such support is straightforward of course.
Streams In addition to the primitive types common to ML-like languages, JunGL offers
streams as another built-in data type. Streams are lazily evaluated lists and do not come
built-in in strict languages like ML. In JunGL, we use streams exactly like one would use lists
and list comprehensions in a lazy functional language such as Haskell [Bir98]. Indeed, as we
explain in more detail later in this chapter, answers to lazily evaluated predicates are returned
as streams. This often allows us to specify a search problem in a nice, compositional way:
generate a stream of successes, take the first one and no further elements will be computed.
AST custom data types Another difference lies in custom data types. Traditionally,
in the family of ML languages, there are two ways of building custom types: records and
algebraic data types. A record data type is a user-defined data structure that encapsulates
labeled, possibly mutable, fields. An algebraic data type is an immutable data type each of
whose values is data from other data types wrapped in one of the constructors of the data
type. Algebraic data types can be recursively defined. They are commonly used to represent
abstract classes of a specific kind of data. In meta-programming notably, they allow the
concise definition of Abstract Syntax Tree grammars.
In JunGL we need only one way for constructing custom data types, and that is exactly
for defining the Abstract Syntax Tree structure of a program. Algebraic data types seem
to be the data types of choice. Nonetheless, we also wish to perform destructive updates
CHAPTER 2. DESIGN OF THE LANGUAGE 18
on the program in order to apply a transformation without having to rebuild a whole new
copy of the tree, and so we need record types with mutable fields. In fact, we even wish
to manipulate incomplete program trees. Hence we make fields optional, i.e. fields can be
assigned the value null whose type is the bottom of all types.
Algebraic data types force each alternative of the abstract class to be a constructor,
whereas in Abstract Syntax Trees it is common to have an abstract class directly under
another abstract class. For instance, in While, an expression - which is abstract - can either
be a variable with a mutable name or a literal - also abstract - which in turn can be anything
among true, false and the set of all integers. If one were to describe the grammar for these
abstract syntax trees in Caml [LDG+04], one would mix records and constructors and write
something like:
type va r i a b l e = { mutable name : s t r i n g }
type i n t e g e r = { mutable va lue : i n t }
type l i t e r a l =
| True
| False
| Int of i n t e g e r
type expr e s s i on =
| Var of va r i a b l e
| L i t e r a l of l i t e r a l
In JunGL, whose design is tuned for the specific manipulation of Abstract Syntax Trees,
we have opted for a more concise notation:
type Express ion =
| Var = { name : s t r i n g }
| L i t e r a l = (
| True
| False
| Int = { va lue : i n t }
)
Arguably, the hierarchy of AST nodes is more readable when expressed in JunGL. An
abstract Expression is either a concrete record Var, which holds the name of a variable, or
an abstract data type Literal, which in turn is any of the concrete records True, False and
Int. In the remainder of the thesis, we refer to these custom types as AST data types.
We give in Figure 2.1 the full definition for representing abstract syntax trees of While
programs. We shall see in Section 2.5, when we describe the toolkit around JunGL, that
AST data types can be further annotated, for instance with pretty printing instructions.
CHAPTER 2. DESIGN OF THE LANGUAGE 19
typeProgram = { s ta tements : Statement l i s t }
andStatement =| WhileLoop = { cond i t i on : Expres s ion ; body : Statement }| I f = { cond i t i on : Expres s ion ; thenBranch : Statement ;
e l s eBranch : Statement }| VarDecl = { typeRef : Type ; name : s t r i n g }| Assignment = { var : Var ; expr e s s i on : Expres s ion }| Block = { s ta tements : Statement l i s t }| Pr int = { expr e s s i on : Expres s ion }
andExpress ion =| Var = { name : s t r i n g }| L i t e r a l = (| True | False| Int = { va lue : i n t })
| In f i xOpera t i on = { l e f t : Expres s ion ; operator : In f i xOpera to r ;r i g h t : Expres s ion }
| Pre f ixOperat ion = { operator : Pre f ixOperator ;operand : Expres s ion }
| Parenthe s i s edExpre s s i on = { expr e s s i on : Expres s ion }and
In f i xOpera to r =| And | Or| Add | Sub| Mul | Div| Equal | NotEqual| LessThan | GreaterThan
andPre f ixOperator =| Not | Plus | Minus
andType =| IntType| BoolType
Figure 2.1: Data types for Abstract Syntax Trees in While
CHAPTER 2. DESIGN OF THE LANGUAGE 20
Constructing AST values Having introduced AST data types, we ought to say a brief
word about how we construct particular values. Again, the syntax is a mix between algebraic
data types and records.
new I f {
cond i t i on = new Var { name = "b" } ,
thenBranch = new Pr int {
expr e s s i on = new Int { va lue = 0 }
}
}
builds the While code for if (b) print(0);. Unlike for traditional record types, we do not
need to specify all fields of a concrete AST data type. Here, we do not assign any value to the
else branch of the if for instance, and so its value is simply the null value. Also, since updates
of the tree are possible, one could build the same value through a sequence of instructions:
l e t i fStmt = new I f {} in
i fStmt . cond i t i on ← new Var { name = "b" } ;
i fStmt . thenBranch ← new Pr int {
expr e s s i on = new Int { va lue = 0 }
} ;
i fStmt
As we see, JunGL has no support for code quotations yet. This could be addressed in
future work together with the integration of a GLR parser. We discuss these additions in
Chapter 7.
Summary To wrap up this section on types, we give the grammar of available types in
JunGL:
τ ::= bool∣∣ int∣∣ string∣∣ Node∣∣ τ list∣∣ τ stream∣∣ τ → τ∣∣ τ × · · · × τ∣∣ unit
The type Node includes all AST data types, → is the built-in constructor for function types,
and × the built-in constructor for tuple types. The unit type is just the type of the empty
tuple () - it is similar to type void in C-like imperative languages.
CHAPTER 2. DESIGN OF THE LANGUAGE 21
2.1.2 Pattern matching
Pattern matching is an important feature of functional and term rewriting languages. It
enables the powerful processing of data based on its structure. In fact, it is the only way of
processing data of a constructed data type. Let us come back to our first map example:
l e t rec map f l =
match l with
| [ ] → [ ]
| x:: xs → ( f x ) :: (map f xs )
We process differently the empty list and the list whose head and tail can respectively be
assigned to x and xs . Pattern matching is really the only way to extract the head of a list,
and a JunGL function that does the job for any list would be defined as:
l e t head l =
match l with
| [ ] → e r r o r "empty list"
| x:: → x
If head is called on the empty list, then we raise an error. Otherwise, we yield x whose
value comes from the first element of the list. The character ‘ ’ denotes a don’t-care pattern
that can match virtually anything.
We use similar pattern matching to deconstruct and process tuples and AST data types.
To illustrate briefly, we give the definition of a function that recursively traverses an expres-
sion to collect a list of encountered variables:
l e t rec concat l 1 l 2 =
match l 1 with
| [ ] → l 2
| x:: xs → x:: ( concat xs l 2 )
l e t rec c o l l e c tVa r i a b l e s expr =
match expr with
| Var → [ expr ]
| In f i xOpera t i on { l e f t = l , r i g h t = r } →
concat ( c o l l e c tVa r i a b l e s l ) ( c o l l e c tVa r i a b l e s r )
| Pre f ixOperat ion { operand = e } →
c o l l e c tVa r i a b l e s e
| Parenthe s i s edExpres s i on { expr e s s i on = e } →
c o l l e c tVa r i a b l e s e
| → [ ]
CHAPTER 2. DESIGN OF THE LANGUAGE 22
Pattern matching is very appealing in the context of term rewriting and for program
transformations. So appealing that Scala, which embeds some pattern matching constructs,
was recently suggested to implement some simple refactorings in Eclipse [Fal07].
Another example is Tom [BBK+07], an extension of Java designed to manipulate tree
structures and XML documents. One of its attractions, among many others, is the ability
to do pattern matching in Java. It even provides Associative-Commutative matching, which
not only would be very useful in the context of JunGL, but also would fit nicely with the
logical features we are about to describe.
However, pattern matching in that form is nice but not enough powerful for our appli-
cation of scripting refactoring transformations, where we often need to collect information
about the program tree. Therefore, JunGL also supports generic queries which are more
appropriate for such purposes. As an example, one could express the same earlier function
as the following query:
l e t predicate descendant (? x , ? y ) =
ch i l d (? x , ? y ) | local ? z : descendant (?x , ? z ) & ch i l d (? z , ? y )
l e t c o l l e c tVa r i a b l e s expr =
{ ?v | descendant ( expr , ? v ) & (? v i s Var ) }
In words, we define a recursive predicate descendant that holds for the two logical variables
?x and ?y if ?y is a child of ?x , or if ?y is a child of an intermediate node ?z , which is itself a
descendant of ?x . Then we use that predicate in a comprehension, on the last line, to search
for all nodes ?v that are descendants of expr and are of type Var .
Furthermore, JunGL supports path queries as a convenient shorthand for regular queries.
The above program can hence be abbreviated to:
l e t c o l l e c tVa r i a b l e s expr =
{ ?v | [ expr ] c h i l d ∗ [ ? v : Var ] }
The path query in the comprehension, recognisable by the use of square brackets around node
variables, should be read as “a path from node expr to a node ?v of type Var following, zero
or more times, a direct child edge in the program tree”. Here, the child edges are built-in,
but as we shall see, new edges can easily be defined.
These logical constructs enable the search for complex patterns using a variety of tree
traversals. They present an alternative to usual solutions for traversing a tree using different
search strategies. In functional programming, different kinds of tree traversal are usually
achieved by the use of combinators [Spi00, LV02]. In Tom or in Stratego [BKVV06], built-in
or constructed strategies are used to control tree traversals. Yet, in order to find complex
patterns in the tree, a context may have to be carried over during these search strategies.
For instance, in [BMR07], Balland et al. parameterise a search strategy with a map of labels
CHAPTER 2. DESIGN OF THE LANGUAGE 23
to nodes in order to collect these labels and traverse bytecode instructions based on their
control flow. Another example of context propagation, summarised in [BvDOV06] by the
Stratego people, is the introduction of dynamic rewrite rules for expressing context-sensitive
transformations.
In JunGL, user-defined edges provide a mechanism to turn the tree into a directed graph
by mapping nodes to other nodes in the program tree, thus allowing to refer to contextual
information. That mechanism is very similar to the use of reference attribute grammars,
which has proved very successful for the construction of compilers [EH07]. In the following
section, we introduce the logical features of JunGL for building such a directed graph and
for querying it.
2.2 Logical features
Typically we wish to super-impose some graph structure on top of the object program tree,
run a number of queries on that graph to find out specific information, and then make some
destructive updates to the underlying tree. As we illustrated, a functional language is not
ideal for querying a graph structure; logic languages, in the Datalog tradition, are much
better suited to that task.
2.2.1 Predicates
The notion of predicates in our language effectively makes JunGL a hybrid functional and
logic language. Predicates are build from conjunctions (&), disjunctions (|), negations (!),
a first operator (in a way similar to the cut operator in Prolog), calls to other predicates,
tests and path queries to which we shall dedicate a special section. Furthermore, we allow
recursion inside predicates, under some conditions though, which we shall explain later in
the thesis.
JunGL is therefore akin to early attempts at integrating logic features into functional
languages, such as LogLisp [RS82] or the embedding of Prolog in Haskell proposed by Mike
Spivey and Silvija Seres [SS99]. Importantly, however, we have not found it necessary to
import the full power of a logic language such as Prolog, and in particular there is no use
of unification in the implementation. Our logical features are instead based on Datalog
(essentially Prolog minus data structures as we shall see in Chapter 3), which provides just
the right balance of expressive power with an efficient implementation. With Datalog on finite
structures, in contrast to Prolog, it is impossible to output an infinite stream of successes.
This difference appears to be crucial when it comes to building a graph on top of the program
tree and querying it. In JunGL, we guarantee that all queries terminate even on a cyclic
graph such as the control flow of a program. We shall elaborate on this issue in Section 2.3.
Predicates can be named just like functions, by using the keyword predicate in a let
binding:
CHAPTER 2. DESIGN OF THE LANGUAGE 24
l e t predicate s i b l i n g (?x , ? y ) =
[ ? x : VarDecl ] & [ ? y : VarDecl ] & ?x != ?y & ?x . name == ?y . name
This predicate looks for two sibling variables in the whole program, that is two distinct
variables with equal names.
When integrating a functional and a logic language, the key question is how we use
predicates in functions, and vice versa. In JunGL, one can use predicates in functions via a
stream comprehension. More precisely,
{ ?x | p(? x ) }
will return a stream of all x that satisfy the predicate p. For instance, the following expression
returns all pairs of sibling variables in a loaded While program:
{ (? x , ? y ) | s i b l i n g (?x , ? y ) }
Note again that logical variables such as ?x are prefixed by a question mark to distinguish
them from normal variable names. One can use expressions as arguments in predicates, but
obviously all logical variables in such an expression must be bound.
Logical terms do not have to be named if their value is of no interest. Like in functional
pattern matching, ‘ ’ denotes a used-once free variable that can match anything. It is thus
possible to write:
{ ?x | s i b l i n g (?x , ) }
2.2.2 Lazy edges
The tree representing the program we wish to query does not by itself contain enough infor-
mation to encode non-naive refactorings which require information beyond pure syntax. For
instance, one ought to know where a given variable is declared. Similarly, one might expect
to have access to the control-flow successors of a statement.
The solution we have opted for relies on the ability to super-impose contextual semantic
information on top of the tree representation of the program. Initially, that representation
is just a forest of ASTs for all the compilation units, whose edges simply indicate child and
parent relationships. We allow the addition of further relationships via lazy edge definitions.
By “lazy”, we mean that an edge is only evaluated when it is required. Hence the initial tree
is turned into a directed graph in a demand-driven manner.
To illustrate the definition of edges, we shall describe how the declaration of a variable
in While can be looked up very simply, just by defining an extra lazy edge that relates a
variable reference to its declaration.
First, we create an edge treePred to reflect a special traversal strategy based on tree
predecessors. The definition of an edge always follows the same pattern, that is:
CHAPTER 2. DESIGN OF THE LANGUAGE 25
l e t edge tr eePred n → ?pred =
. . .
Here, the name of the edge is treePred. The variable n captures the source node of the
edge, and ?pred is a logic variable that is to match the target of a possible edge emanating
from n. The body of an edge is then defined as a relation between n and ?pred . We therefore
complete our example as follows:
l e t edge tr eePred n → ?pred =
f i r s t ( [ n ] l i s tP r e d e c e s s o r [ ? pred ] | [ n ] parent [ ? pred ] )
In words, it says that ?pred is the tree predecessor of n either:
1. if n is in a list and ?pred is the direct predecessor of n in that list, or
2. if ?pred is the direct parent of n.
In addition, the operator first is used to select only the first of the two possible matches, thus
returning the parent of n only if n has no list predecessor. If n has neither list predecessor
nor parent, then there is no match for ?pred : it is a failure and there is no treePred edge
emanating from n. On the AST of a program, following transitively treePred edges from a
given node n just builds up a path from n to the root of the AST, where all parents of n and
their list predecessors are visited.
The two alternatives around the union operator ‘|’ are path queries that we shall detail
shortly. Right now, it is enough to understand that the body of an edge definition is just a
predicate that must hold for any target of that edge. This explains the difference of notation
between the source and target nodes in the definition: The question mark in ?pred indicates
a free variable that must be substituted with all possible targets of edges outgoing from the
single node n. Such an asymmetry makes sense in the presence of the operator first. Indeed,
if we were considering a relation treePred with symmetric roles for both the source and target
nodes, then the operator first would apply to the whole relation, and we would get only one
treePred edge outgoing from the first node that has either a left sibling or a parent. Here
first is implicitly parameterised by the variable n. The asymmetry allows us to reason, more
simply, about targets from a single node only.
Armed with the treePred search strategy, it is very easy to define the edge that binds a
variable to its declaration. In our toy language While, it suffices to climb up the tree and
look for the first declaration of a variable whose name matches the name of the variable we
are trying to resolve.
l e t edge lookup r : Var → ?dec =
f i r s t ( [ r ] t r eePred+[?dec : VarDecl ] & r . name == ?dec . name )
CHAPTER 2. DESIGN OF THE LANGUAGE 26
Interestingly, the source r of the edge is here accompanied by the AST data type Var .
This means the edge lookup will only be defined from nodes that are of type Var , i.e. from
variable references. The body of the edge definition then reads as follows: follow one or more
treePred edges from r until a node of type VarDecl is found, with a name equal to the name
of the variable defined in r . The use of first forces the evaluation of results with respect to
the traversal order, and yields only the first match if there is one.
The edges treePred and lookup will only be constructed when we try to access them from a
specific node. This mechanism of lazy edge construction is very convenient when introducing
new tree nodes, as it often relieves us of the burden to laboriously construct all the auxiliary
information on new nodes. Without it, scripts would quickly become prohibitively complex
because we would have to remember to construct all relevant edges when creating new graph
nodes, and also inefficient. All computed information on the AST is handled in this way, so
for example edges for representing the control flow of a program are also represented as lazy
edges. We shall now describe that example together with another feature, namely attributes.
In some cases, it is useful to enrich a node with some value, rather than linking it to other
existing nodes. For that purpose we use attributes. The value of an attribute may be of
any type, and notably be a freshly created node. Indeed, it is sometimes convenient to add
dummy nodes to the original program tree, especially to make the super-imposition of edges
more natural. To illustrate in a precise context, the definition of the control-flow graph of a
program is more readable in the presence of special dummy nodes entry and exit, attached
to the root node of any program.
type Entry
type Exit
l e t attribute entry p : Program = new Entry {}
l e t attribute e x i t p : Program = new Exit {}
Here, we define two new AST data types and two new attributes for representing the
entry and the exit of any node of type Program. The values of these attributes are just a
new Entry node and a new Exit node respectively.
We can now use these attributes to define the control-flow successors of any statement,
again as lazy edges. The following edge definition specifies the control-flow successors of
ordinary statements such as assignments:
l e t edge defaultCFSucc x : Statement → ?y =
f i r s t ( [ x ] l i s t S u c c e s s o r [ ? y ]
| [ x ] parent [ ? y : WhileLoop ]
| [ x ] parent ; defaultCFSucc [ ? y ]
| [ x ] parent ; e x i t [ ? y ]
)
CHAPTER 2. DESIGN OF THE LANGUAGE 27
The edge listSuccessor is a built-in edge that relates a node present in a list in the original
AST (such as a statement in a block) to its successor in the same list. The default control-flow
successor of a statement is therefore the first match of all the following ordered alternatives:
the next statement in the list, or otherwise the direct parent if it is a while loop — to encode
the recursion —, or otherwise the default successor of the parent — typically when you escape
a block or the branches of a conditional — or otherwise, finally, the dummy exit node of the
program.
We now need to give the exact control-flow successors for each kind of statements, and
we do that via the following definitions:
l e t edge c f s u c c x : Statement → ?y = [ x ] defaultCFSucc [ ? y ]
l e t edge c f s u c c x : Block → ?y =
f i r s t ( [ x ] f i r s tC h i l d [ ? y ] | [ x ] defaultCFSucc [ ? y ] )
l e t edge c f s u c c x : I f → ?y =
[ x ] thenBranch [ ? y ]
| f i r s t ( [ x ] e l s eBranch [ ? y ] | [ x ] defaultCFSucc [ ? y ] )
l e t edge c f s u c c x : WhileLoop → ?y =
[ x ] body [ ? y ] | [ x ] defaultCFSucc [ ? y ]
We see here that overriding is allowed in edge definitions. The cfsucc edge definition of
Block overrides that of Statement. The definition to use in order to compute edges emanating
from a given node x is resolved by looking at runtime the type of x . As expected, it is always
the most-specific edge definition that is used. Hence the control-flow successors of a variable
declaration (a node of type VarDecl), an assignment (Assignment) or a print statement
(Print) are all computed by evaluating the first cfsucc edge definition. The three latter edge
definitions are for specific kinds of statements: the successor of a block is either its first child
or, if the block is empty, its default successor; the successors of a conditional statement are
both the then branch and, either the else branch or the default successor of the if; as for a
while loop, its successors are both its body and its default successor.
Note that the definition for if statements is valid for well-formed programs only. Never-
theless, in JunGL, it would be easy to cope with ill-formed programs too, and handle the
control-flow graph of a program that is not syntactically complete. To illustrate briefly, here
is how we can cope with missing then branches in conditional statements:
l e t edge c f s u c c x : I f → ?y =
f i r s t ( [ x ] thenBranch [ ? y ] | [ x ] defaultCFSucc [ ? y ] )
| f i r s t ( [ x ] e l s eBranch [ ? y ] | [ x ] defaultCFSucc [ ? y ] )
CHAPTER 2. DESIGN OF THE LANGUAGE 28
At this stage, we have already given several sample definitions of edges. Looking at the
body of them more closely, we can see quite a few references to edges that were not introduced
through a proper let edge definition. Most of them simply correspond to some labeled field
of an AST data type (e.g. thenBranch, body) or to some additional attribute introduced via
let attribute (e.g. entry). The others, as we have mentioned sometimes, are built-in edges
that relate nodes to their immediate neighbours in all possible directions in the tree. They
are summarised in table 2.1. A child of a node x is any node directly under x or any node
in a list of nodes directly under x . The order of children is given by the position of fields in
the AST data type, plus eventually by the list order for fields that are lists of nodes. The
successor y of a node x whose parent is p, and position is i with respect to all children of p,
is the child of p with position i +1 if it exists. However, y is a list successor of a node x only
with the additional constraint that x and y appear to be in the same list.
Name Points to
parent the parent of the node if any.child all the children of the node if any.firstChild the first child of the node if any.lastChild the last child of the node if any.successor the right-sibling of the node if any.predecessor the left-sibling of the node if any.listSuccessor if the node is present in a list of nodes,
the successor of the node in the same list if any.listPredecessor if the node is present in a list of nodes,
the predecessor of the node in the same list if any.
Table 2.1: Built-in edges in JunGL
In order to understand more precisely edge bodies, we now turn to introducing path
queries.
2.2.3 Path queries
The most common way of constructing predicates is via path queries, also called regular path
queries. Path queries are regular expressions for checking properties about individual paths
(existential queries) or about all paths (universal queries) on a graph representation of a
program. Path queries are of course very well-known in the context of semi-structured data,
but have only been revisited fairly recently for the specific purpose of querying the control flow
of programs by De Moor et al. in [dMLVW03]. Liu et al. then proposed parametric regular
path queries [LRY+04], which slightly increase the expressiveness by allowing additional
information to be collected along single or multiple paths. Even more recently, Liu introduced
an intuitive syntax to use path queries for querying any complex graph [LS06]. Path queries
in JunGL follow the general idea of that syntax. The semantics are however different as our
path queries yield results in a deterministic order.
CHAPTER 2. DESIGN OF THE LANGUAGE 29
Path queries are very intuitive and we have already seen many examples in previous edge
definitions. For instance,
l e t edge tr eePred n → ?pred =
f i r s t ( [ n ] l i s tP r e d e c e s s o r [ ? pred ] | [ n ] parent [ ? pred ] )
There, we have two simple path queries on both sides of the ‘|’ operator. The path compo-
nents between square brackets are conditions on nodes, whereas listPredecessor and parent
match either:
1. the type of an edge emanating from a node, or
2. the type of an attribute attached to a node, or
3. the name of a field defined in a node.
For simplicity, however, we always call “edge” any component between two node blocks.
In addition, we refer to the first node component as the start node, and to the second node
component as the end node.
Each node component consists of a variable (logical or not) whose type is an AST data
type. It can be annotated with a positive or negative AST data type reference (for instance
[?pred:Statement] or [?pred:!Statement], to constraint the possible matches to nodes
that are, or are not, of type Statement.
An edge can be a simple label like above, or a more complex expression. Notably, edges
can be sequentially composed using ‘;’. It is also possible to append a ‘+’ or ‘*’ to an edge l .
The former is simply the transitive closure of the edge relation, meaning that the end node
can be reached from the start node by following one or more matches of l . The latter is the
reflexive transitive closure of the edge relation, which on top of the transitive closure allows
the end node and the start node to be identical.
We often need further expressive power in order to match a complex pattern where each
node on a transitive path has a side condition. For that purpose we allow, like in [LS06], the
use of existential local variables inside an edge expression.
As an illustration, we shall define strict post-dominance between statements in a control-
flow graph, but to better appreciate the definition in JunGL, we first give the precise definition
given by Muchnick in [Muc97]. There, post-dominance and strict post-dominance are defined
as follows:
In the control-flow graph, node p post-dominates node i, written p pdom i, if
every possible execution path from i to exit includes p.
[. . .]
Node p strictly post-dominates node i if p pdom i and p 6= i.
In JunGL, the edge definition for strict post-dominance reads:
CHAPTER 2. DESIGN OF THE LANGUAGE 30
l e t edge postDominates x : Statement → ?y =
[ ? y : Statement ] c f s u c c +[x ] &
! ( [ ? y ] ( local ? z : c f s u c c [ ? z ] & ?z != x)+[ : Exit ] )
That is, x post-dominates ?y if x is a transitive successor of ?y in the control-flow graph and
there is no path from ?y to the exit that does not go through x , i.e. whose intervening nodes
?z are all different from x . Note that we assume there is a path from each node to the exit,
which is reasonable.
The key here is the use of the locally scoped variable ?z , which is substituted with a
different node at each step in the path from ?y to the exit. These local variables improve
greatly the expressive power of path queries, and as we see allow the concise and readable
expression of complex control and data flow properties that one finds in compiler or program
analysis books [Muc97, NNH99]. A detailed description of the syntax of path queries can be
found in Appendix A, where the full grammar of JunGL is exposed.
2.3 Computational model
Now that we have presented the main features of the language, as well as the program tree
structure that is manipulated, one may naturally wonder about the computational model of
JunGL. In this section, we describe how functional and logical features interact with each
other and with the underlying program tree structure.
In particular, we shall highlight the declarative nature of the logical features and explain
how we deal with issues like cycles in the program graph and termination of recursive queries.
We shall also discuss the interaction of lazy queries and destructive updates, a common issue
in query languages with update facilities.
Declarative edge definitions At first sight, some edge definitions in JunGL may seem
to go against a declarative reading. This impression notably comes from the use of the first
operator, as in our earlier example:
l e t edge defaultCFSucc x : Statement → ?y =
f i r s t ( [ x ] l i s t S u c c e s s o r [ ? y ]
| [ x ] parent [ ? y : WhileLoop ]
| [ x ] parent ; defaultCFSucc [ ? y ]
| [ x ] parent ; e x i t [ ? y ]
)
The operator first is indeed reminiscent of a cut operator in impure logic programming.
However, the presence of first does not give any insight on the actual evaluation mechanism
of our queries. We shall see in the coming chapters that all our logical features in fact
CHAPTER 2. DESIGN OF THE LANGUAGE 31
translate to a variant of Datalog, a database query language with a declarative least fixpoint
semantics. Datalog programs can be evaluated in a multiple of ways, either top-down or
bottom-up, and authors of JunGL queries do not need to be aware of their precise evaluation
mechanism. The declarative nature of the logic features of JunGL lies in the existence of
such a hidden mechanism for evaluating and optimising logic queries, which we shall describe
in Chapter 4.
Termination for cyclic graphs Non-termination issues may naturally arise when dealing
with edge definitions that introduce cycles in the program graph. This may for instance be the
case in the above example of the defaultCFSucc edge, which is used for building the control
flow graph of a program on top of its AST. At this point, the original tree structure of the
AST is transformed into an arbitrary, possibly cyclic, graph. How can we then guarantee the
termination of queries for lazily constructed graphs? Very simply, we have a finite number of
initial AST nodes, and by ensuring that we only add edges between those nodes and never
retract some, we are guaranteed to compute a stable view of the final graph.
Indeed, edge definitions are part of the logical features of JunGL, and fully translate
to our variant of Datalog. We shall actually give the precise translation of defaultCFSucc
in Section 4.5.5. Hence, there is no way in JunGL to create an arbitrary edge between two
nodes programmatically. Edges are logical relations between nodes, and evaluated as Datalog
predicates. In other words, they are intensional views on the ground facts of the original AST
of the program. As we shall see in Chapter 3 and 4, the termination of queries is therefore
guaranteed by the Datalog framework we build on. However, AST nodes may be modified,
created or deleted via the functional features of the languages, which leads us to discuss the
tricky issue of update facilities in a query language.
Mixing queries and destructive updates We have shown in the previous sections how
to refer to predicates in stream comprehensions, and how to use functional expressions as
arguments in predicates, or as additional constraints. It may therefore be possible to perform
updates to the underlying AST of the program while evaluating logical edges or predicates.
As it can be foreseen, however, mixing declarative queries with such updates is likely to result
in weird evaluation behaviours, including non-termination.
Implementers of relational databases have been aware for more than thirty years of this
issue, which is commonly referred to as the Halloween problem. A precise account of the
history of the problem and an explanation of its name can be found in [Fit02]. The issue
is well illustrated with the following classical example. Say that for every row in a table,
you insert another row in that same table. If no special care is taken, new inserts may
themselves trigger other inserts, thus leading to non-termination. To prevent this, most
databases implement some kind of snapshot semantics where queries are run on a copy of
the structure to be queried. In the above example, instead of working on the current table
that is being updated, the query would be evaluated on a snapshot of the original table.
CHAPTER 2. DESIGN OF THE LANGUAGE 32
These snapshot semantics are also at the root of the recent W3C recommendation for
XQuery Update Facility 1.0 [CFM+08]. There, the XQuery processing model is extended so
that the result of an expression consists of both a normal XQuery result and a pending update
list, which represents node state changes that have not yet been applied. If the outermost
expression in a query returns a non-empty pending update list, all the changes are implicitly
invoked at that point. In effect, XQuery Update Facility therefore defines an entire query as
one snapshot. Such snapshot semantics at the level of the entire query, however, prevent the
ability to see the results of side effects during the computation of the query. To overcome
this limitation, an XQuery Scripting Extension has been proposed to define a deterministic
sequential order for XQuery expressions [CEF+08]. The snapshot granularity may hence be
reduced, with later expressions seeing the effects of the expressions that came before them.
In JunGL, we have not yet implemented any snapshot semantics. Currently, it is the
responsibility of the script author to ensure that the functions used in queries are side-effect
free. This is the same approach as in the attribute grammar system JastAdd [EH04] in which
attributes are expressed in Java, and hence may also have undesirable side effects. Another
issue, shared with attribute grammar systems, lies in the fact that any update to the the
underlying AST may invalidate previously constructed edges. So far, in our experiments, we
have always managed to mimic snapshot semantics and delay any update to the tree to the
end of the refactoring script, after which all lazy edges are invalidated. However, it would
be much more preferable to maintain edges incrementally on every change. We discuss this
future work in Chapter 7.
Finally, one has to be careful that results of stream comprehensions are returned lazily in
JunGL. Therefore, like in other frameworks such as LINQ [MBB06], special care is required
when performing updates on the results of a query. Again, this could be solved with snapshot
semantics, but we have opted until now for a simpler common workaround: results of the
query can be cached with the built-in function toList for converting a stream to a list, thus
forcing its eager evaluation. An example of its use is given in the scripts in Appendices.
2.4 Other features
Namespaces We use namespaces to avoid name conflicts in the presence of many different
functions or data types. In Figure 2.1, the AST data types could have been defined inside
the namespace While.Ast for instance:
namespace While . Ast {
. . .
}
A previously defined namespace can then be imported through the using construct. The
type Program can be referred to as While.Ast .Program from other namespaces or indeed
directly as Program in a scope using the namespace While.Ast as in:
CHAPTER 2. DESIGN OF THE LANGUAGE 33
using While . Ast {
. . .
}
Foreach In order to iterate on streams, we have added a foreach construct to JunGL, which
is just syntactic sugar for an iter function on streams. This imperative loop construct still
enables pattern matching on the values of a stream at each iteration step. To illustrate, here
is how we would traverse the stream of sibling variables:
foreach (x , y ) in { (?x , ? y ) | s i b l i n g (?x , ? y ) } do
. . .
External calls The problem we often encounter with Domain Specific Languages is that
there is always a need for an interaction we have not envisioned. That is the main practical
difference with embedded DSLs, where you can just make use of the full power of the host
language if required.
In particular, refactoring is an interactive process that often requires guidance from the
users. These interventions go from specifying a name or a set of methods during Rename
or Extract Interface refactorings, to resolving potential conflicts that might occur during a
transformation. The latter case is particularly useful when there is no obvious best solution
to the conflicts, and when we wish to minimise rejection of the transformation. Therefore, we
have added external UI features to JunGL. They are called as normal functions that belong
to particular namespaces.
For the purpose of demonstrating the use of JunGL in a broader context than just refac-
toring, we have also added external functions for building up a small IDE for the object
language one wishes to manipulate. It is indeed possible to plug some program analyses
written in JunGL into the editor of the object language. For instance, the function addEr-
rorFinder in the namespace Editor is used to plug on-the-fly compiler checks into the editor.
To illustrate, we shall now describe the toolkit we have built around JunGL, and show some
further examples for the While object language.
2.5 The toolkit around the language
JunGL is part of a toolkit that aims to be a complete end-to-end solution for prototyping
refactoring transformations on any language. The system consists of four components im-
plemented on the .NET platform: a graph data structure, an interpreter for the scripts that
manipulate this data structure, and two editors for the object language, not the scripts. For a
rich interactive experience, refactoring tools commonly guide users through ‘wizards’. We do
not support such complex UI components but provide basic support for script authors to ask
CHAPTER 2. DESIGN OF THE LANGUAGE 34
for user input. More advanced interaction can be achieved via other calls to external code.
A diagram of the toolkit’s architecture extracted from [VPdM06] is depicted in Figure 2.2.
We briefly describe the main components here.
Figure 2.2: Overview of the toolkit
2.5.1 The graph structure
JunGL manipulates the graph through basic operations defined in a small interface. We
provide a default implementation of this interface in C#. Before the construction of lazy
edges, the graph is a tree whose grammar is defined in JunGL through custom AST data
types. We have given in Figure 2.1 an example of such grammar. For a better integration
in our toolkit, AST data types are further annotatable. For instance, Figure 2.3 shows the
grammar example of Figure 2.1, this time with annotations.
The basic pretty printing annotation @pretty is used to render newly created nodes as
text. It only provides a basic, yet convenient, pretty-printing mechanism. It is not used to
describe the concrete syntax of the object language. Therefore the @parser annotation is
used to specify which parser needs to be called for building the AST of a program. One can
also create any AST from scratch, or update it programmatically. We enforce at runtime
CHAPTER 2. DESIGN OF THE LANGUAGE 35
that each node has one parent at most and that no cycle is introduced accidentally.
ASTs are turned into an actual graph only when lazy edges are evaluated. No edge can be
added imperatively. Each node has a list of edges that relate it to other nodes of the ASTs.
We apply the same caching techniques as the ones found in attribute grammar systems like
JastAdd [EH04]. Once an edge from a node n has been evaluated, it is cached in node n
until further modification of the underlying tree.
To work with a different object language from scratch, one simply provides another gram-
mar via AST data types, along with the new parser. There is no support in JunGL for syntax
definitions but all the work on Generalized LR parsing techniques [Tom87] could be reused
here in order to make JunGL a complete end-to-end solution. However, our architecture
already makes it easy to leverage an existing strongly-typed AST implementation. All one
needs to do is to make the existing AST classes implement the interface that the JunGL
interpreter uses for manipulating trees and graphs. This is particularly convenient if one
wishes to run JunGL in an existing development environment for instance.
2.5.2 The interpreter
The JunGL interpreter follows the usual pattern of an interpreter for a functional language
in a functional language. Indeed, JunGL is implemented in F# [Sym05], a variant of ML that
runs on top of the .NET framework. Because F# is fully integrated in .NET, it allows us to
work across languages. In particular, we can use the C# implementation of the graph in our
F# programs and vice versa. For now JunGL, like most other scripting languages, is only
dynamically typed. In future work, one may want to augment the language with at least
some form of soft typing, to provide more static safety.
The most interesting part of the interpreter is therefore its treatment of logical features,
which we shall explore in detail in the coming chapters.
2.5.3 Editors
In addition to the interpreter itself, we have implemented two editors for programs written
in the object language:
• a text editor to which it is possible to add features implemented as JunGL scripts, and
• a structure editor for visualising the graph that is manipulated by JunGL.
Both editors use the pretty-printing annotations of the AST datatype definitions to render
the AST.
The purpose of the text editor is to demonstrate the use of JunGL in a broader context
than just refactoring. We shall show in the next section a few examples of features that one
can plug in this editor. For instance, definite assignments of variables can be enforced with
a tiny JunGL script and violations marked via red squiggles on the program text. Another
CHAPTER 2. DESIGN OF THE LANGUAGE 36
type
@pretty ("|($statements )|" )@parser ("JunGLAddins :JunGLAddins .Parsers.WhileParser .WhileParser " )Program = { statements : Statement l i s t }
and
Statement =| @pretty ("’while (’ $condition ’) ’ \n \t $body" )
WhileLoop = { cond i t i on : Express ion ; body : Statement }| @pretty ("’if (’ $condition ’) ’ \n
\t $thenBranch \n
[ ’else ’ \n \t $elseBranch ]" )I f = { cond i t i on : Express ion ; thenBranch : Statement ;
e l seBranch : Statement }| @pretty ("$typeRef ’ ’ $name ’;’" )
VarDecl = { typeRef : Type ; name : s t r i n g }| @pretty ("$var ’ = ’ $expression ’;’" )
Assignment = { var : Var ; e xp r e s s i on : Express ion }| @pretty ("’{’ \n \t |($statements )| \n ’}’" )
Block = { statements : Statement l i s t }| @pretty ("’print(’ $expression ’);’" )
Pr int = { exp r e s s i on : Express ion }and
Express ion =| @pretty ("$name" )
Var = { name : s t r i n g }| L i t e r a l = (
| @pretty ("’true ’" ) True| @pretty ("’false ’" ) False| @pretty ("$value" ) Int = { value : i n t })
| @pretty ("$left ’ ’ $operator ’ ’ $right" )In f ixOperat ion = { l e f t : Express ion ; operator : In f ixOperator ;
r i gh t : Express ion }| @pretty ("$operator $operand " )
Pre f ixOperat ion = { operator : Pre f ixOperator ;operand : Express ion }
| @pretty ("’(’ $expression ’)’" )Paren thes i z edExpre s s ion = { exp r e s s i on : Express ion }
and
In f ixOperator =| @pretty ("’&&’" ) And | @pretty (" ’||’" ) Or| @pretty ("’+’" ) Add | @pretty ("’-’" ) Sub| @pretty ("’*’" ) Mul | @pretty ("’/’" ) Div| @pretty (" ’==’" ) Equal | @pretty (" ’!=’" ) NotEqual| @pretty ("’<’" ) LessThan | @pretty ("’>’" ) GreaterThan
and
Pref ixOperator =| @pretty ("’!’" ) Not | @pretty ( "’+’" ) Plus | @pretty ("’-’" ) Minus
and
Type =| @pretty ("’int ’" ) IntType | @pretty ("’bool ’" ) BoolType
Figure 2.3: Data types with annotations
CHAPTER 2. DESIGN OF THE LANGUAGE 37
example is to plug in a function that resolves the declaration of a variable reference, and
highlights it in the program.
The structure editor has a different purpose. By selecting blocks or nodes, we can visualise
the connections to other nodes in the graph, that we have added by defining lazy edges
in JunGL. We have found this tool indispensable in the interactive development of new
refactoring scripts.
2.6 Further examples on While programs
Before moving to the precise semantics of the logical features in JunGL, we first illustrate
many of the features we have just introduced, on small concrete applications for While.
2.6.1 Binding and definite assignment checks
One of the most basic but useful compiler checks is to ban the use of variables that have not
been declared:
l e t checkBinding program =
toL i s t { ?x | [ program ] ch i l d +[?x : Var ] & ! [ ? x ] lookup [ ] }
in
Editor . addErrorFinder Program checkBinding
"W01: not declared"
Given a program, the function checkBinding returns a list of such variables. We use a
stream comprehension to collect nodes of type Var that have no outgoing lookup edge. Then
we call the external function addErrorFinder to plug the checkBinding analysis to our editor.
Each undefined variable will be now underlined with the error message “W01: not declared”.
As a second example, we propose to check a common rule for modern languages: the
definite assignment rule, which enforces each local variable to be assigned before it is used.
We start by defining two new edges. The use edges link statements or expressions to the
variables that are read during their execution. Conversely, the def edges relate statements or
expressions to the variables that are written there.
l e t edge use x : Expres s ion → ?y = [ x ] c h i l d ∗ ; lookup [ ? y ]
l e t edge use x : Assignment → ?y = [ x ] expr e s s i on ; use [ ? y ]
l e t edge use x : Pr int → ?y = [ x ] expr e s s i on ; use [ ? y ]
l e t edge use x : I f → ?y = [ x ] cond i t i on ; use [ ? y ]
l e t edge use x : WhileLoop → ?y = [ x ] cond i t i on ; use [ ? y ]
l e t edge de f x : Assignment → ?y = [ x ] var ; lookup [ ? y ]
CHAPTER 2. DESIGN OF THE LANGUAGE 38
In the toy language While, variables can only be written by an Assignment statement. Ex-
pressions are side-effect free: there are no such things like post-increment and post-decrement
operators. There are no functions calls with reference parameters either. Therefore, the defi-
nitions of use and def for nodes of type Expression are straightforward. There is simply no def
edges from them and the use edges point to the declarations of each variable occurring as a
descendant of the expression. Similarly, we also define these edges at the level of statements.
With those edges now defined, we can write the definite assignment rule as just one path
query:
l e t checkDef in i teAss ignment program =
toL i s t { ? s | [ program ] ch i l d +[?x : VarDecl ]
( local ? z : [ ? z ] c f s u c c & ! [ ? z ] de f [ ? x ] )+ [? s ]
use [ ? x ] }
in
Editor . addErrorFinder Program checkDef in i teAss ignment
"W02: is used without being assigned"
A statement ?s violates the rule if it uses a variable declared at ?x , and there is a control-
flow path from ?x to ?s where no intervening statement ?z defines the variable declared at
?x . Again that path query is the almost direct translation of a definition one may find in a
compiler book.
2.6.2 Rename Variable
We now step to defining our first refactoring transformation, namely Rename Variable for
the While language. There, we mix both the logical features of JunGL to find elements of
interest and check preconditions, and the functional ML-like features to perform destructive
updates on the graph. The full script is just half a page:
l e t renameVariable program node newName =
(∗ Find the element o f i n t e r e s t ∗)
l e t dec = pick { ?d | [ node ] lookup [ ? d ] | equa l s ( node , ? d) } in
(∗ Check precond i t ion s ∗)
i f not dec i s VarDecl then
e r r o r "Please select a variable" ;
i f dec . name == newName then
e r r o r "Please give a different name" ;
l e t f i n dF i r s t x =
pick { ?y | [ x ] t r eePred+[?y : VarDecl ] &
(newName == ?y . name | ?y == dec ) } in
let mayBeCaptured =
CHAPTER 2. DESIGN OF THE LANGUAGE 39
{ ?x | [ program ] ch i l d +[?x : Var ] & ?x . name == newName } in
foreach x in mayBeCaptured do
i f f i n dF i r s t x == dec then e r r o r "Variable capture" ;
l e t needRename =
{ ?x | [ program ] ch i l d +[?x : Var ] lookup [ dec ] } in
foreach x in needRename do
i f f i n dF i r s t x != dec then e r r o r "Variable capture" ;
(∗ Transform ∗)
foreach x in needRename do
x . name ← newName ;
dec . name ← newName
The renameVariable function takes three variables: program, the root of the program on
which to do the transformation, a node inside the program representing either a variable
reference or a variable declaration, and a string newName.
The first step of the refactoring is to find the main element of interest, that is the dec-
laration of the variable to rename. We use the built-in function pick that returns the first
element of a stream (or null if the stream is empty). If node is a variable reference with an
outgoing lookup edge, then dec is the declaration of that variable at the end of the lookup
edge. Otherwise, through the use of the binding predicate equals, we assume we were passed
the declaration itself.
We check that assumption on the following line, and raise an error in the case of an
incorrect user selection. We also check that newName differs from the current name of the
declaration.
We then go on and check the main precondition of the refactoring: the renamed decla-
ration should not conflict with any pre-existing declarations with the same name. Indeed,
the way we have defined the lookup edge for the While language suggests that it is allowed
to hide a variable declaration with another declaration. Although this is unusual, we indeed
assume the following program is a valid While program:
int i ;
i = 0 ;
print ( i ) ;
int i ;
i = 1 ;
print ( i ) ;
Resolving the variable reference i in the last statement finds the closest declaration of i when
climbing up the program, i.e. the one on the fourth line. Under these circumstances, we
need to be careful when renaming a variable. Consider the following example for instance:
CHAPTER 2. DESIGN OF THE LANGUAGE 40
int i ;
i = 0 ;
int j ;
j = 1 ;
print ( i ) ;
print ( j ) ;
If we simply renamed j to i , our resulting program would still be valid but its semantics
would have changed because name bindings would have changed. Indeed, the i in the first
print statement would now bind to the second declaration of i .
Therefore we need to check that the declaration, once renamed, is not going to capture
any existing variable that is used later, like in the above example. In addition, we also need
to check that no existing declaration will capture the renamed variable.
To handle these two cases of variable capture, we define a unique function findFirst that,
given a node x , looks up the first declaration accessible from x that is either the declaration
we wish to rename or an existing declaration called newName already. Then, for the first
case, we consider all variable references that may be captured, because they have already for
name the new name we wish to give, and we check that calling findFirst on each of them
returns the original declaration and not the declaration we wish to rename. Similarly for the
second case, we compute all references to the declaration we are about to rename, and we
check this time that calling findFirst on each of them returns the same declaration (i.e. not
a capturing existing declaration).
This is it for all preconditions. We can now safely perform the transformation itself. This
part of the code is more operational and hence less interesting. For each variable reference
in the stream needRename, we assign the fresh name newName. Of course, on the last line,
we must not forget to rename the declaration itself. In the end, because of potential variable
conflicts, Rename Variable is not that obvious even for a simple language like While. We shall
see in Chapter 6, however, that the same approach scales very well to much more complex
languages.
2.6.3 Slicing
We conclude this series of illustrative examples by a slightly more ambitious application:
program slicing. The concept of program slicing was originally introduced by Mark Weiser
[Wei84]. He claimed a slice to be the mental abstraction people make when they are debugging
a program. A slice consists of all the statements of a program that may affect the values
of some variables at some location of interest. Many applications were foreseen: debugging,
code understanding, reverse engineering and program testing to list a few. Yet, only recently
was suggested the use of slicing for refactoring.
The Untangling refactoring we proposed earlier in [EV04, Ett06] indeed uses slicing. It
is like Extract Method, but instead of selecting a contiguous region of code, the programmer
CHAPTER 2. DESIGN OF THE LANGUAGE 41
selects a single expression. The tool then extracts the backward slice, namely the statements
that may have contributed to the value of that expression.
Slicing can be expressed elegantly in JunGL. More generally, one can define the Program
Dependence Graph [OO84, HRB90, HR92] via path queries, which in turn allows the correct
mechanisation of many different transformations that require reordering of statements.
A Program Dependence Graph is a graph whose nodes represent the statements of the
program like in the control-flow graph, but whose directed edges represent control depen-
dences, data dependences and structure dependences. We now give the definitions of these
three edges in JunGL.
The control dependence edge builds on the concept of post-dominance we have introduced
earlier, and on control-flow predecessors edges, the dual of cfsucc edges:
l e t edge c fp r ed x → ?y =
f i r s t ( [ ? y ] c f s u c c [ x ] | [ x ] parent ; entry [ ? y ] )
l e t edge controlDependentOn x : Statement → ?y =
[ x ] postDominates ∗ ; c fp r ed [ ? y ] & ! ( [ x ] postDominates [ ? y ] )
As we see, cfpred is not defined just as [?y]cfsucc[x]. With such a definition, the
first statement would not have any predecessor because we have not defined any successor
edges emanating from the entry dummy node. Of course, we could have decided to add that
successor edge instead.
In the second edge definition, x is control dependent on ?y if ?y is the control-flow
predecessor of x or any of the statement x post-dominates, but x does not post-dominate ?y
itself. This typically happens when x is in the body of a while loop ?y (here, we assimilate
the while loop with its conditional expression).
We now move on to data dependencies:
l e t edge dataDependentOn x : Statement → ?y =
[ x ] use [ ? v ] & [ ? y ] de f [ ? v ] &
[ ? y ] ( local ? z : c f s u c c [ ? z ] & ! [ ? z ] de f [ ? v ] ) ∗ ; c f s u c c [ x ]
Statement x is data dependent on ?y if x reads a variable ?v , y writes that same variable
?v , and there exists a path between ?y and x with no intervening definition of ?v .
Then, we define structure dependences such as the possible one between a statement and
its enclosing block, or the dependency between a statement and the declarations of variables
it reads and writes:
l e t edge structureDependentOn x : Statement → ?y =
[ x ] parent [ ? y : Block ]
| [ x ] use [ ? y : VarDecl ]
| [ x ] de f [ ? y : VarDecl ]
CHAPTER 2. DESIGN OF THE LANGUAGE 42
Finally, we define an edge that covers all kinds of dependencies. It is just the union of
the three previous ones:
l e t edge dependentOn x : Statement → ?y =
[ x ] controlDependentOn [ ? y ]
| [ x ] dataDependentOn [ ? y ]
| [ x ] structureDependentOn [ ? y ]
It is now straightforward to obtain a slice of a program from a given statement s , as it is
just a well-known reachability problem on the Program Dependence Graph. The following
stream comprehension yields the set of statements composing the slice:
{ ?x | [ s ] dependentOn ∗ [ ? x ] }
This approach might remind somewhat the relational approach of slicing explored by
Klint and Vankov in [Kli05, Van05] with RScript, a language based on relational calculus
for querying and analysing source code. In fact, here, we simply use JunGL edges to super-
impose the dependence graph of the program on top of its AST. Therefore relations are used
to represent that graph, and the transitive closure to compute the reachable statements from
a seed. Klint’s approach is slightly different as relations are used to represent Kill/Gen sets
for the computation of reaching definitions (as originally proposed by Weiser), and recursion
to find a fixpoint for those relations.
2.7 Summary and references
In this chapter we have introduced all the different features of JunGL: functional features
for manipulating a program tree, lazy edges to super-impose contextual information on that
tree, and logical features to query the graph structure resulting from the combination of these
edges and the initial tree.
The benefit of functional features for the construction of compiler-like tools is well-known
[App98]. We support pattern matching, algebraic data types (in a specialised form), and
strict higher-order functions in the tradition of ML [MTHM97, LDG+04]. In addition, we
provide streams which are lazily evaluated lists typically found in Haskell [Bir98].
Mechanising a refactoring requires finding elements in the code and checking static pre-
conditions. A common solution to collect that kind of information is to traverse the pro-
gram tree using different search strategies. That mechanism can either be expressed in
Haskell [Spi00, LV02], or provided as a built-in feature in a transformation system, e.g.
[BKVV06, BBK+07]. Often a context has to be propagated for complex static-semantic anal-
yses though. This is usually achieved by parameterising search strategies, as in [BMR07], or
via dynamic rewrite rules in rewrite systems [BvDOV06].
CHAPTER 2. DESIGN OF THE LANGUAGE 43
Our answer is to provide logical features, and allow the use of predicates in stream com-
prehensions. JunGL therefore resembles earlier approaches of combining functional and logic
programming, e.g. [RS82, SS99]. In JunGL, one may construct predicates as path queries.
Different styles of path queries were proposed for querying programs [dMLVW03, LRY+04].
Here we mainly reuse the syntax proposed in [LS06]. However, as we shall see in Chapter 4,
the new semantics we assign to them account for the order of logical matches.
Path queries are extremely powerful when querying graphs. JunGL provides an original
way for turning the initial syntax tree of a program into a graph that captures static-semantic
information, such as name binding or control flow. One may define lazy edges for linking two
nodes in the tree (e.g a variable reference and its declaration), which will be automatically
constructed when necessary. That mechanism is in a way similar to the use of reference
attribute grammars as in JastAdd [EH04].
Of course, when integrating destructive updates and declarative features, special care
is required to prevent non-termination and other weird evaluation behaviours. This issue
with mixing declarative queries and updates, long known as the Halloween problem in the
database community [Fit02], is usually addressed by implementing some kind of snapshot
semantics, as in recent XQuery extension proposals [CFM+08, CEF+08]. In JunGL, we
have not yet implemented such semantics: it is the responsibility of the script’s author to
adequately combine intuitive declarative queries and convenient imperative updates to the
program tree.
In this chapter, we have also briefly presented the toolkit around JunGL and described
its implementation on the .NET platform using both C# [SH04] and F# [Sym05]. Our im-
plementation is workable for quickly prototyping refactorings. One missing feature though is
the support for syntax definitions and GLR parsers [Tom87].
Finally, we have illustrated all the features of JunGL by defining a naive refactoring and
various static analyses for a toy language. In Chapter 6, we will show that our design actually
scales to similar tasks on mainstream languages. We now move on to presenting Datalog,
the database query language on which we have based the logical features of JunGL. As we
shall see, Datalog is an ideal candidate for querying program trees and graphs.
Chapter 3
Datalog
Datalog is a query language originally put forward in the theory of databases [GM78], which
has drawn a lot of interest in the eighties and early nineties. Datalog programs look syn-
tactically like Prolog and several classes of programs have been characterised. The most
well-known of these classes has a simple declarative semantics, and consists of safe Datalog
programs.
In this chapter, we introduce logic programs and the syntax of Datalog. We highlight
the requirements for a Datalog program to be safe, notably regarding the use of negation,
and explain the semantics for such programs by giving a simple evaluation strategy for safe
Datalog programs using relational algebra and least fixpoint computations. Then we present
ways of optimising the evaluation and different implementations. Finally, we discuss more
general classes of Datalog where the use of negation is less restricted. These classes admit
more expressive queries which, as we shall see later, are useful in the context of this thesis.
3.1 Logic programs and syntax of Datalog
Logic programs A logic program is a finite set of rules. Each rule has a head and a body.
These are written on both sides of the symbol ‘←’, which stands for reverse implication. A
head consists of one literal, while a body may contain several of them. Literals in a body can
appear either positively or negatively. They are also referred to as atoms or subgoals.
A literal is an n-ary predicate applied to an n-tuple of terms. It is written p(t1, . . . , tn)
or sometimes p(~t) for short.
A term can be a constant, a variable or a compound term (i.e. a function symbol with
other terms as arguments). A term without any variable is called a ground term and pred-
icates applied to tuples of ground terms are called ground atoms or facts. When a variable
occurs as an argument of a predicate in a positive atom, that variable is said to occur posi-
tively on the right-hand side and to be bound by that atom.
44
CHAPTER 3. DATALOG 45
In the body of a rule, subgoals are separated with a comma that stands for logical ‘and’.
An empty body is equivalent to true.
To illustrate,
p(X ) ← a(X ), not b(X ).
is a rule in which p(X ) is the head, a(X ) is a positive subgoal, and b(X ) a negative subgoal.
It reads “p(X ) if a(X ) and not b(X )”.
Although all logic programs follow that same structure, there are two rather different
classes of logic programming languages. The first class consists of Turing-complete languages
close to the machine level, where subgoals are regarded as procedure calls and where control
is still very much given by the programmer. Prolog is the most famous representative of that
class [Llo87]. The second class consists of database query languages, such as Datalog, where
programmers have much less control over the execution of their programs. These languages
are therefore often regarded as more declarative.
Datalog programs A Datalog program is a logic program where each term is either a
variable, denoted with X , or a constant (for instance an integer). In contrast with Prolog,
compound terms such as lists are not allowed. It is hence not possible to match directly on
tree patterns.
Another major difference with Prolog is the fact that the order of atoms in the body of
a Datalog clause does not matter, and indeed, answers to a query are not expected to be
given in any deterministic order. We will revise that definition, however, when we introduce
an ordered variant of Datalog in the next chapters. Until then, we use the word ‘Datalog’ to
refer to the standard version where order does not matter.
By “standard version”, we do not mean pure Datalog, which in the literature refers to
definite programs, i.e. programs with Horn clauses only — a Horn clause is a clause with
no negative subgoals. Instead we mean here Datalog with negation, that is most suitable to
express an adequate range of queries.
We should also mention that, in the pure tradition of Datalog, disjunction is achieved
using multiple clauses with the same relation in the head, and the use of ‘;’ for logical ‘or’
is regarded as syntactic sugar. We use it for conciseness though. For instance, we shall
sometimes write
p(X ) ← a(X ); b(X ).
for
p(X ) ← a(X ).
p(X ) ← b(X ).
Similarly, we sometimes apply ‘not’ to a conjunction or a disjunction rather than just a
literal, since these negations can be distributed away.
CHAPTER 3. DATALOG 46
In addition, we shall allow tests and binding equalities. These are not normally found in
pure Datalog, but can be modelled as special kinds of predicates. We use tests to filter out
logical matches, and binding equalities to bind a variable to a constant (or to a variable that
is already bound). To illustrate, the rule
p(X ) ← X = 0, X < 1.
binds X to the value 0, and tests that X is indeed less than 1.
Finally, all the variables occurring in the head of a rule are implicitly governed by a
universal quantifier. Equivalently, we use the convention that those variables that occur in
the body but not in the head are governed by an implicit existential. For example,
p(X ) ← q(X ,Y ).
is equivalent to
p(X ) ← (∃Y · q(X ,Y )).
or to
p(X ) ← q(X , ).
where ‘ ’ stands for a don’t-care variable.
To summarise, Figure 3.1 shows the syntax of Datalog programs we consider.
⟨Program
⟩::=
⟨Rule
⟩+
⟨Rule
⟩::=
⟨Literal
⟩← [
⟨Expr
⟩] .
⟨Expr
⟩::=
⟨Literal
⟩∣∣ ⟨
Expr⟩
,⟨Expr
⟩∣∣ ⟨
Expr⟩
;⟨Expr
⟩∣∣ not
⟨Expr
⟩∣∣ ∃
⟨VariableName
⟩·⟨Expr
⟩∣∣ ⟨
Term⟩
=⟨Term
⟩∣∣ ⟨
Term⟩ ⟨
Operator⟩ ⟨
Term⟩
∣∣ (⟨Term
⟩)
⟨Literal
⟩::=
⟨PredicateName
⟩(
⟨Term
⟩?, )
⟨Term
⟩::=
⟨VariableName
⟩ ∣∣ ⟨Constant
⟩ ∣∣
⟨Operator
⟩::= <
∣∣ ≤∣∣ ≥
∣∣ >
Figure 3.1: Syntax of Datalog programs
CHAPTER 3. DATALOG 47
3.2 Semantics
The semantics of Datalog are explained by regarding predicates as relations defined by enu-
merating the tuples inhabiting them. If p is a predicate, there is a corresponding relation, say
P , such that the fact p(t1, . . . , tn) is true if and only if there is a tuple (t1, . . . , tn) in relation
P . The relation P is sometimes called the extension or interpretation of the predicate p.
In effect, Datalog is usually viewed as a language for defining a larger database from a
smaller one. It defines the contents of new relations based on the contents of the original
relations, in the end producing a single representation. The original relations are often
referred to as ‘extensional database’ or ‘EDB’ predicates, while the new ones are called
‘intensional database’ or ‘IDB’ predicates.
In the remainder of this chapter, we illustrate our explanations with a common example,
namely the transitive closure of a child relation. Its Datalog definition may be as follows:
descendant(X ,Y ) ← child(X ,Y ).
descendant(X ,Y ) ← child(X ,Z ), descendant(Z ,Y ).
In words, Y is a descendant of X if either Y is a direct child of X , or if X has a
direct child Z , which in turn has transitively a descendant Y . Here, descendant is an IDB
predicate, while child is an EDB predicate, whose interpretation shall be the set of pairs
{(1, 2), (1, 6), (2, 3), (2, 4), (4, 5)} as depicted in Figure 3.2.
1
2
3 4
5
6
Figure 3.2: The sample child relation used in our examples
3.2.1 Minimal models and least fixpoints
Interpreting a Datalog program is to assign a collection of facts to that program. Such a
collection is said to be a model for the program if whenever constants are substituted for the
variables, the rules become true. Although the same Datalog program may admit different
models, it is both intuitive and commonly accepted to define the meaning of a Datalog
program through a minimal model.
CHAPTER 3. DATALOG 48
A model is minimal if any strict subset either is missing an EDB fact or fails to be a
model because there exists a substitution of constants for variables that makes the body of
some rule in the program true but the head false.
A Datalog program should have a well-defined minimal model in order to be assigned a
non-ambiguous meaning. It is well-known that Horn rules have a well-defined minimal model
that is the smallest model which contains all logical consequences of the rules. The existence
of that smallest model is indeed guaranteed by the Knaster-Tarski theorem. Before stating
that theorem, we shall define a few notions.
A fixpoint of a function f is any value x , for which f (x ) = x . In the computation of a
program model, the function for which we compute a fixpoint is a step inference function,
commonly known as the immediate consequence operator, that takes an interpretation and
infers a new one with possibly new facts about the program. That function can be expressed
in terms of relational algebra operators. We describe later in this section how. Furthermore,
a fixpoint of a function that is included in any other fixpoint of that function is called least
fixpoint. A least fixpoint of the inference function for a set of rules corresponds to a minimal
model for that set of rules.
A partially ordered set consists of a set together with a binary relation⊆ that describes, for
certain pairs of elements in the set, the requirement that one of the elements must precede
the other. A lattice is a partially ordered set in which every pair of elements has a least
upper bound and a greatest lower bound. The power set of all different interpretations form
a complete lattice.
A function f is monotonic with respect to a partial order if, whenever x ⊆ y, f (x ) ⊆ f (y).
Typically, negation in a Datalog program is nonmonotonic but all other operations are. We
can now state Knaster-Tarski theorem: If L is a complete lattice and f : L→ L is a monotonic
mapping, then f has a least fixpoint.
Importantly, the existence of a least fixpoint is also guaranteed for monotonic mappings
on complete partially ordered sets (CPOs) and any finite partially ordered set is a CPO
[DP02]. This is crucial, as in the remainder of the thesis we shall work with finite partially
ordered sets that do not form a complete lattice.
A definite Datalog program (with no negation that is) is just a composition of monotonic
operations, and therefore it admits a least fixpoint, which is its minimal model. When the
rules of a program include negated subgoals, however, the minimal model of the program is
rarely well-defined.
For Datalog programs with negation, the database community therefore developed some
preferred models, based on the concept of negation as failure [Cla78]. Negation as failure
basically says that if a ground atom p cannot be proved, then it is allowed to infer not p.
Thus, if instantiated rules of a program with negation can be decomposed into modules that
do not mutually depend negatively on themselves, we can evaluate the minimal model of
these modules one at a time and give a precise meaning to a Datalog program with negation.
This condition depends both on the program and on the EDB input data, but it can be
CHAPTER 3. DATALOG 49
approximated statically to depend on the program only and not on the data. Such a static
approximation is, together with more obvious syntactic constraints, what defines the class of
safe Datalog programs.
3.2.2 Safe Datalog
Datalog programs are safe if and only if the conditions below on range-restriction and strat-
ification are satisfied. Range-restriction is to guarantee that each computed IDB is finite,
while stratification refers to the static approximation mentioned above.
Range-restriction Every variable on the left-hand side of a clause must occur positively
on the right-hand side, and every variable on the right-hand side must occur positively. This
forbids definitions like
p(X ,Y ) ← q(X ).
which leaves Y unconstrained. It also rules out
p(X ) ← X < 0.
where X < 0 is not a literal but a test, and
p(X ) ← not q(X ).
Such queries would be undesirable because, to evaluate them, we would have to enumerate
the infinite set of all integers, lesser than 0 in the first case, and for which q does not hold in
the second case. By contrast,
p(X ) ← not q(X ), r(X ).
is fine, because while q(X ) is negated, we also have a positive occurrence of X under r .
In the literature on Datalog, queries satisfying this criterion are often called range-
restricted.
Stratification Negation must not be used in recursive cycles. For instance, we wish to
avoid
p(X ) ← not p(X ).
because, again, such recursions do not have a least fixpoint.
Formally, this requirement can be stated in terms of the dependency graph between
predicates. The nodes in this graph are relations defined in the Datalog program. There are
two kinds of edges, positive and negative, defined as follows. When p has a clause where q
appears positively (not under a negation) on the right hand side, there is a positive edge from
p to q. If q appears negatively on the right-hand side of a clause for p, there is a negative
CHAPTER 3. DATALOG 50
edge from p to q. We require that there are no cycles in the dependency graph that contain
a negative edge. Datalog programs that satisfy this property are called stratified because
there is an algorithm working in terms of ‘layers’ or ‘strata’ for evaluating such programs.
The idea is that when a program is stratified, we can find an order for the predicates so we
can evaluate a predicate p only after we have evaluated all predicates on which p depends
negatively. We shall detail that algorithm in an instant.
Benefits of safe Datalog Assuming the primitive relations are finite, safe Datalog has a
number of highly desirable properties:
• All relations defined are finite, whether recursive or not.
• Recursion can be implemented with straightforward fixpoint iteration, and so the
declarative and operational semantics coincide. This fixpoint iteration always termi-
nates.
To appreciate the difference with Prolog, consider
p(X ) ← p(X ).
In Prolog, p(X ) is a non-terminating query. In Datalog, it just defines the empty relation,
because that is the smallest relation satisfying the above clause.
In the same vein, consider a variant of our transitive closure example, where descendant
appears before child in the second disjunct:
descendant(X ,Y ) ← child(X ,Y ).
descendant(X ,Y ) ← descendant(X ,Z ), child(Z ,Y ).
Using the standard goal-oriented SLD resolution [Llo87], the query descendant(1,Y )
would not terminate in Prolog because the left-to-right evaluation would loop forever on
the subgoal descendant(1,Z ). In Datalog, however, the above definition has precisely the
expected meaning, with no unpleasant surprises during the query evaluation. To overcome
the issue in Prolog and evaluate correctly the two examples above, a special technique called
tabled resolution has been proposed. We discuss it later in this chapter.
3.2.3 Mapping predicate calculus to relational algebra
We said earlier that the step inference function of a set of rules can be expressed in terms of
relational algebra operators. We illustrate informally here how such a mapping works. This
is particularly interesting as it highlights (together with the evaluation of strata to follow) a
simple operational semantics of Datalog programs.
The key observation is that, when predicates are represented as relations, each logical op-
erator in predicate calculus has a counterpart in set-based relational algebra. An introduction
to relational algebra can be found in any database book, e.g. [RG02].
CHAPTER 3. DATALOG 51
For instance, the natural join is a counterpart of logical ‘and’. That is, if relations R and
S are the interpretations of predicates p and q respectively, then the natural join of R and
S , written R ./ S , is the relation representing the interpretation of the predicate p ∧ q. The
natural join is not a primitive operation in relational algebra, as it can be expressed using
cross-product (×), selection (σ), and projection (π). It is, however, a handy counterpart of
conjunction: it can be used even when two relations have no attribute in common: in the
case of p(x , y) ∧ q(z ), R ./ S is simply equivalent to the cross-product R × S .
Predicate calculus Relational algebra
p(X ,Y ) ∧ q(Y ,Z ) R ./ S (join)
p(X ,Y ) ∨ q(X ,Y ) R ∪ S (union)
p(X , 0) σY=0(R) (selection)
p(X ,Y ) ∧ ¬q(X ,Y ) R − S (set difference)
∃Y . p(X ,Y ) πX (R) ≡ πY (R) (projection)
Table 3.1: Logical operators and their relational counterparts
Table 3.1 summarises informally the relational counterparts of each logical operator used
in Datalog. Relations R and S still stand for the interpretation of p and q respectively. Most
of the time, it is convenient to express projection in terms of the fields that are projected
out rather than the fields on which the relation is projected. As the table shows, we use the
dual operator π for such purpose.
It follows that rules in Datalog can be seen as mathematical functions expressed in terms
of relational algebra operators. For instance, the recursive descendant rule can be defined by
an equation of the following form:
Descendant = Child ∪ π0,2(Child ./1=0 Descendant)
Note that, for simplicity and to avoid issues with renaming, we refer to each column of a
relation via its index in the relation rather than via its name.
3.2.4 Evaluation of strata
We shall now describe how to lift the above function to a set of mutually dependent rules in
order to evaluate safe Datalog programs.
Stratification guarantees that for any rule, any atom references a relation that is either
in a lower stratum, or in the same stratum. Furthermore, relations in the same stratum can
only be referenced positively. That said, the grammar for the relational algebra expression
CHAPTER 3. DATALOG 52
corresponding to a safely stratified Datalog rule in stratum i is therefore:
Ri ::= ∅ empty relation
| U universal relation
| Ri−1 another relation in a lower stratum
| Ri another relation in stratum i
| Ri × Ri cross product
| Ri ∪ Ri union
| πX1,..,Xk(Ri) projection
| σtest(Ri) selection with arbitrary test
| not(Ri−1) negation
Note that we do not mention the natural join in this grammar as it can be expressed with
the other operators. Also, we favor the not operator in place of the set difference although
these operations are equivalent: R − S = R ./ (not(R)) and not(R) = U −R. The universal
relation U refers to a relation of a desired arity, say n. It contains all possible n-tuples one
can built with the domain D of values found in the EDB relations.
Strata of a safe Datalog program are strongly connected components of the predicate
dependency graph, and each stratum contains a number of mutually dependent predicates,
which we interpret as relational algebra expressions according to the grammar defined above.
A stratified program can hence be modelled as a list of strata [s0, . . . , sN ] sorted in topo-
logical order such that for any i and j , if a relation in si refers to a relation in sj then j ≤ i .
Furthermore, each stratum si consists of the relations {R1, . . . ,Rki}. We denote with ni,j
the arity of the j th relation in si . We can then model each individual Rj as the step function
that takes our current interpretations for all the relations in the stratum, and returns a new
interpretation for Rj :
fRj: PDni,1 × · · · × PDni,ki → PDni,j
In order to compute fRj(X1, . . . ,Xki
), we simply interpret the relational algebra primitives
in the usual way over sets of tuples of values. Due to stratification, the functions fRjare
monotonic.
Now we are in a position to lift this to define the step function fi of the entire stratum
si . For brevity, we write X = (X1, . . . ,Xki).
fi : PDni,1 × · · · × PDni,ki → PDni,1 × · · · × PDni,ki
fi(X ) = (fR1(X ), . . . , fRki
(X ))
Each step function fi is monotonic since each of its components is monotonic. Moreover,
its domain and codomain coincide. By Knaster-Tarski theorem, it has a least fixpoint.
Consequently we define
[[si ]] = lfp(fi) (3.1)
CHAPTER 3. DATALOG 53
However, the value of [[si ]] depends on the values [[sj ]] for previous strata (j < i), so the
computation must start with s0. In other words, we first compute the relations denoted by
the bottom level (containing extensional predicates), and continue upwards in such a way
that we only evaluate [[si ]] = lfp(fi ) after all [[sj ]] with j < i , which means that the denotations
of relations in lower strata are available to fi . After evaluating this for all strata, we get a
model for our Datalog program, which is its meaning.
3.3 Evaluation strategies
In the evaluation of strata given above, the different strata of the clause dependency graph are
evaluated one at a time in topological order starting from the lowest strata — the ones that
refer to extensional predicates only — to the highest stratum where lies the query predicate.
Such an evaluation strategy is said to be bottom-up. By contrast, a strategy is said to be
top-down if it starts from the query itself using for instance a goal-oriented strategy that
resembles the SLD resolution found in Prolog implementations. A benefit of the top-down
resolution is that it usually infer the facts that are actually needed for the correct answer to
the query, whereas a bottom-up evaluation usually infers facts that are irrelevant and indeed
ignored in the computation of the final answer. In this section, we detail the two opposite
approaches, illustrate the issue of computing irrelevant facts and describe two techniques to
bridge the gap between the two approaches.
3.3.1 Top-down vs bottom-up
Bottom-up iterative computations on relations is the implementation strategy that directly
follows from the least fixpoint semantics of safe Datalog and the observation that each pred-
icate can be represented as a finite relation which simply consists of the tuples satisfying
that predicate. The main strength of the bottom-up approach is precisely that it works
with relations. It can therefore benefit from efficient implementation of relational opera-
tions, and notably hash joins. However, it is well-known that the performance of such an
approach is usually impeded by the unnecessary computation of irrelevant facts during the
query evaluation [BMSU86, Vie86].
The unnecessary computational overhead has two origins. The first one is in the com-
putation of a fixpoint for each stratum of the clause dependency graph. Indeed, at each
iteration of a naive fixpoint computation, we do not only infer new facts but also all facts
that have already been inferred in previous iterations. Table 3.2 shows an example of such
an expensive redundancy.
This kind of overhead can easily be overcome for linear recursive rules (i.e. rules that have
at most one recursive call in their body) using a less naive iteration, known as a seminaive
fixpoint computation. At each iteration step, rather than taking the full currently inferred
relation, we can restrict the input of the step function to the only facts that were freshly
CHAPTER 3. DATALOG 54
Iteration Input Output
1 ∅ {(1, 2), (1, 6), (2, 3), (2, 4), (4, 5)}2 {(1, 2), (1, 6), (2, 3), (2, 4), (4, 5)} {(1, 2), (1, 6), (2, 3), (2, 4), (4, 5),
(1, 3), (1, 4), (2, 5)}3 {(1, 2), (1, 6), (2, 3), (2, 4), (4, 5), {(1, 2), (1, 6), (2, 3), (2, 4), (4, 5),
(1, 3), (1, 4), (2, 5)} (1, 3), (1, 4), (2, 5), (1, 5)}4 {(1, 2), (1, 6), (2, 3), (2, 4), (4, 5), {(1, 2), (1, 6), (2, 3), (2, 4), (4, 5),
(1, 3), (1, 4), (2, 5), (1, 5)} (1, 3), (1, 4), (2, 5), (1, 5)}
Table 3.2: Input and output of the step function at each iteration of the naive fixpointcomputation for the rule descendant .
inferred at the previous iteration. While in a naive fixpoint computation, the same facts are
inferred again and again, only new deductions are made at each step of a seminaive one.
Table 3.3 shows the effect of this optimisation. The computation stops when no new fact can
be inferred.
Iteration Input Output
1 ∅ {(1, 2), (1, 6), (2, 3), (2, 4), (4, 5)}2 {(1, 2), (1, 6), (2, 3), (2, 4), (4, 5)} {(1, 3), (1, 4), (2, 5)}3 {(1, 3), (1, 4), (2, 5)} {(1, 5)}4 {(1, 5)} ∅
Table 3.3: Input and output of the step function at each iteration of the seminaive fixpointcomputation for the rule descendant .
The second origin of the computational overhead is more inherent to the bottom-up
approach. To illustrate, consider the following query:
q(Y ) ← X = 2, descendant(X ,Y ).
In the bottom-up framework, the whole descendant relation is computed before retaining
only those pairs whose first element is 2. This clearly leads to unnecessary computations
since, in this particular query, it would be sufficient to just compute the descendants of node
2 as Figure 3.3 suggests.
A way to reduce the number of useless computations is to adopt a top-down approach.
One solution notably is to embed Datalog in a Turing-complete logic programming language
like Prolog. We have already mentioned however that the standard goal-oriented SLD reso-
lution does not ensure correct results, as it may lead to non-termination, by trying to solve
recursively the same subgoal again and again. To overcome that problem, it is crucial that
tabled resolution is used [War92].
The main idea of tabled resolution, sometimes also called tabling for short, is to memorise
intermediate subgoals and their answers that have been computed previously. More precisely,
if a subgoal is identical to or subsumed by a previous one, it is solved using answers computed
for the previous subgoal, instead of re-evaluating the rules of the program. This makes
CHAPTER 3. DATALOG 55
1
2
3 4
5
6
Figure 3.3: Smallest set of nodes that need to be considered for solving the query q(Y ) ←X = 2, descendant(X ,Y ).
the evaluation of Datalog queries finite and avoids redundant computation due to repeated
subgoals in the search space. Such tabling contrasts with a bottom-up approach in the
sense that it is a tuple-at-a-time resolution method while the latter admits a set-at-a-time
resolution. Although in the end the effect is the same as a fixpoint computation, when
the subgoals of a rule are correctly ordered, many fewer irrelevant facts are computed with
tabling. For instance, by applying a left-to-right top-down resolution to our sample query, X
is first bound to node 2 and descendants are then computed from that node only. However,
if the subgoals of a rule are not correctly ordered, the search space may become very big. As
in any top-down approach, the efficiency of tabling is highly sensitive to predicate ordering.
In conclusion, the advantage of the bottom-up approach is that it works with sets, while
the advantage of the top-down approach is that it may infer less facts because the context of
the query is propagated inside calls.
Two techniques have been proposed to make the two approaches converge. One, known
as the Query-Subquery approach, is a set-based top-down resolution method. The other,
called the magic-set transformation, mimics the behaviour of the top-down approach in a
bottom-up framework.
3.3.2 Query-Subquery and magic sets
It is conventionally accepted that, in order to optimise a relational algebra query, selection
should be done before joins. Indeed, by pushing selection as early as possible in the evaluation
of a query, the relations that are manipulated later usually get smaller thus leading to better
performance. The Query-Subquery approach and the magic-set transformation technique
follow that idea and extend it to recursively defined programs.
Both method rely on the choice of a sideways information-passing strategy, also called a
sip strategy. This determines the context of each literal, i.e. the formulas that are evaluated
before that literal. The most common sip strategy is the left-to-right one, where the context
CHAPTER 3. DATALOG 56
of each literal consists of all the formulas that appeared on the left of the literal. The sip
strategy basically indicates the flow of data between predicates.
The Query-Subquery approach The Query-Subquery approach (QSQ) [Vie86] uses the
framework of SLD resolution, but a set at a time thus enabling optimised relational algebra
operations. The idea is to constrain each predicate call by propagating bindings from one
atom to the next with respect to a given sip strategy. Each IDB literal of the original
program is adorned with a pattern to indicate which of its variables are considered bound
by its context, i.e. by the part of the rule that is evaluated before the literal. For instance,
following a left-to-right sip strategy, we adorn our sample descendant query as follows:
descendantbf (X ,Y ) ← child(X ,Y ).
descendantbf (X ,Y ) ← descendantbf (X ,Z ), child(Z ,Y ).
q f (Y ) ← X = 2, descendantbf (X ,Y ).
The pattern bf on the rule descendantbf means that it is used in a context where its first
argument is bound and its second argument is free. We shall use the general notation Rγ
where R is the name of the adorned rule, and γ is the actual pattern of the rule, consisting
of as many bs and fs as the arity of R.
From that initial adornment, each rule Rγ is assigned a set of additional temporary
relations, which do not appear in the program but are used during the evaluation. These
supplementary relations, of the form sup Rγk , identify for each position k in the rule body
the interesting variable bindings. Interesting variables are the ones already bound at the
respective position, and either used in the remainder of the body or from the head.
To illustrate, the adorned rules above have the following supplementary relations. Note
that we distinguish a supplementary relation of the first disjunct of descendantbf from one
of the second disjunct by adding a prime to the latter:
descendantbf (X ,Y )← child(X ,Y ).
↑ ↑
sup descendantbf0 (X ) sup descendantbf1 (X ,Y )
descendantbf (X ,Y )← descendantbf (X ,Z ), child(Z ,Y ).
↑ ↑ ↑
sup descendantbf0
′(X ) sup descendantbf1
′(X ,Z ) sup descendantbf2
′(X ,Y )
q f (Y )← X = 2, descendantbf (X ,Y ).
↑ ↑ ↑
sup q f0 sup q f
1 (X ) sup q f2 (Y )
The relations sup descendantbf0 (X ), sup descendantbf0
′(X ) and sup q f
0 represent the known
context at the entry of their corresponding rule body. The relation sup descendantbf1
′(X ,Z )
CHAPTER 3. DATALOG 57
stores the context between descendantbf (X ,Z ) and child(Z ,Y ). Similarly, sup q f1 (X ) stores
the context between X = 2 (which can be seen as an EDB relation) and descendantbf (X ,Y ).
Finally, sup descendantbf1 (X ,Y ), sup descendantbf2
′(X ,Y ) and sup q f
2 (Y ) store the final
context of their corresponding rule. Observe that sup descendantbf2
′(X ,Y ) has no reference
to Z because Z is no longer needed. For the same reason, sup q f2 (Y ) only refers to Y since
X is not an argument of the query.
In addition, each set of rules Rγ is assigned two relation variables: inst Rγ (whose arity
is the number of bs in γ) and ans Rγ (whose arity is the same as R). Relation inst Rγ stores
the global input of Rγ , while ans Rγ stores the final result of Rγ .
A subquery for Rγ is then run as follows. Let T be the tuples in inst Rγ . Add T to
sup Rγ0 . At each position k , if the next atom is an EDB relation E , join sup Rγ
k with E and
store it in sup Rγk+1. Otherwise if the next atom is an IDB relation I δ, add to inst I δ the
new tuples found in sup Rγk , join sup Rγ
k with ans I δ and store it in sup Rγk+1. When the
final position n is reached, add sup Rγn to ans Rγ .
Initially, all inst Rγ and ans Rγ relations are empty, except for the input of the query
which is set to true. The final answer is computed by running all subqueries in turn until a
fixpoint is reached for all inst Rγ and ans Rγ .
For instance, on our above example with the sample child relation of Figure 3.2, three
iterations are needed. During the first iteration, q f is entered with the context true, so
sup q f0 (whose arity is 0) is also true. Joining true with the EDB-like relation X = 2 re-
sults in sup q f1 (X ) having a single tuple (2). This tuple is added to inst descendantbf , but
ans descendantbf being so far empty, we conclude that sup q f2 (Y ) is empty and ans q f too.
In the second iteration, inst descendantbf now contains the tuple (2). Consequently, when we
process the first disjunct of descendantbf , we add (2, 3) and (2, 4) to sup descendantbf1 (X ,Y )
and to ans descendantbf . Then, when we process the second disjunct of descendantbf ,
we now also get (2, 3) and (2, 4) in sup descendantbf1
′(X ,Z ), and therefore obtain (2, 5)
in sup descendantbf2
′(X ,Y ) which we add to ans descendantbf as well. Back to the query
rule again, we hence get all tuples (3), (4) and (5) in ans q f . Finally, in the third iteration,
we find out that inst Rγ and ans Rγ are stabilised, and that we have therefore computed
the final solution for ans q f .
In this example, and for any other definite program, the scope of the fixpoint iteration can
be the whole set of rules. For stratified programs, however, we need to compute the fixpoint
after each call to a rule in a lower stratum, in order to ensure the evaluation of a relation is
complete when we take its complement. We shall refer to this set-based top-down evaluation
method when we explore the evaluation mechanism of JunGL queries in Chapter 5.
Magic sets The idea of the magic-set transformation [BMSU86, BR87] is to express the
ingredients of a top-down approach in Datalog itself, by rewriting the original program to
make the context explicit. Starting from the same adornment as in the Query-Subquery
approach, the context of each IDB atom is isolated into a magic relation and added to it as
CHAPTER 3. DATALOG 58
a filter. For instance, the magic version of our example is:
descendantbf (X ,Y ) ← magic descbf (X ), child(X ,Y ).
descendantbf (X ,Y ) ← magic descbf (X ), descendantbf (X ,Z ), child(Z ,Y ).
magic descbf (X ) ← X = 2.
q f (Y ) ← X = 2, descendantbf (X ,Y ).
Like for any top-down approach, where it is possible to improve the efficiency of a query by
reordering its subgoals, the efficiency of the magic-set technique depends on the sip strategy
that is used. Although the transformed program contains more joins, it succeeds in restricting
the computation of descendants to those of node 2 only, thus mimicking a top-down resolution.
If we were to find the descendants of a leaf in a deep tree, that transformation would boost
the query considerably.
In fact, magic sets try to achieve statically what the Query-Subquery technique does dy-
namically. The transformation is, however, not always optimal. It may sometimes introduce
unsafe recursion in an originally safely stratified program, and the cost for breaking these
unwanted cycles is the computation of more irrelevant facts. This actually highlights the
expressiveness limitation of safe Datalog. We shall discuss less restrictive classes of Datalog
programs, after briefly mentioning some existing implementations of safe Datalog.
3.3.3 Existing implementations
Safe Datalog can be implemented in a variety of ways. Most implementations either adopt a
bottom-up approach that manipulates sets (thus favoring magic sets to the Query-Subquery
approach), or follow the top-down tuple-at-a-time route with memoization.
Bottom-up implementations Early bottom-up implementations of Datalog were pro-
posed as part of deductive database systems, e.g. LDL [TZ86], Glue-Nail [PDR91], and
CORAL [RSSS94]. These systems have partly focused on complementing purely declarative
languages with some imperative constructs for manipulating relations, but they do support
various program transformations proper to fully declarative languages such as magic sets and
pushing forward projections to optimise queries.
Not surprisingly, an important design decision in implementing the bottom-up approach
is the choice of a representation for sets. One obvious route is to delegate that part to a
relational database. That way, we can easily leverage a scalable and persistent backend. EDB
relations are simply stored in the database, and Datalog queries compiled to procedural SQL.
This strategy is the one used in the code querying system CodeQuest [HVMV05, HVdM06].
CodeQuest implements a limited version of the magic-set transformation, named closure
fusion, that aims at optimising transitive closure only.
CHAPTER 3. DATALOG 59
Another bottom-up implementation strategy is to represent relations via binary decision
diagrams (BDDs). The work of John Whaley and Monica Lam has demonstrated that such
a Datalog implementation is particularly suitable for evaluating queries that correspond to
advanced dataflow analyses [WACL05]. Indeed, the relations involved in a whole-program
dataflow analysis are sometimes so big that they cannot be efficiently manipulated by a
standard database system. By contrast, a BDD is a compressed data structure that can
efficiently represent a large relation and BDD operations take time proportional to the size
of that compressed data structure, not to the number of tuples in the relation.
Top-down implementations XSB is perhaps the most well-known example of a logic
programming system that offers this alternative approach to deductive database [SSW94]. It
extends Prolog’s SLD resolution with tabling, and actually, also adds to SLD a scheduling
strategy and delay mechanisms. The whole resolution method is known as SLG [CW96] and
can handle not only stratified Datalog but also more general logic programs that we discuss
now.
3.4 General logic programs
The stratified criterion of safe Datalog programs is quite strong, and unfortunately, there are
common and natural examples of queries that one cannot express in safe Datalog.
Perhaps the most celebrated example in the Datalog literature is a predicate inspired
from a stalemate game:
win(X ) ← move(X ,Y ), not win(Y ).
The rule says that X is a winning position if there is a move from X to Y and Y is
a losing position. It is not statically stratified because of the negative literal not win(Y ).
Nonetheless, if the relation move is acyclic, any query about win has a unique least model.
To illustrate, consider the domain of positions V = {1, 2, 3} and let move be the following
acyclic relation:
1 2 3
A way to resolve the query win(X ) is to instantiate the rule in all possible ways with the
position of our domain (and remove known false subgoals). That is:
win(1) ← move(1, 2), not win(2).
win(1) ← move(1, 3), not win(3).
win(2) ← move(2, 3), not win(3).
CHAPTER 3. DATALOG 60
As we see, this set of instantiated rules is now correctly stratified and admits a well-
defined model. A program that can be instantiated that way into a stratified set of rules
is said to be locally stratified [Prz88]. Whether a program is locally stratified depends on
the data in its EDB predicates and therefore cannot be decided by looking at the program.
In opposition to static stratification, we say that local stratification is a dynamic criterion.
Clearly, if a program is statically stratified, then it is locally stratified.
However, it is sometimes possible to guarantee that a program will be locally stratified by
imposing some conditions on the database. Here for instance, the win query would actually
be locally stratified for any data as long as the move relation is acyclic. As we shall show
later in this thesis, it frequently happens that a program is not statically stratified but is
locally stratified given that some of the EDB relations it refers to are well-founded.
The local stratification criterion is a bit fragile though, because it does depend on the
structure of the program. Consider a variant of the above rule that is expected to have the
exact same model:
win(X ) ← play(X ,Y ), not win(Y ).
play(X ,Y ) ← move(X ,Y ).
If we instantiate that program in all possible ways, and because we know originally nothing
about play, we end up with a set of instantiated rules, a part of which being:
win(1) ← play(1, 2), not win(2).
win(2) ← play(2, 1), not win(1).
That part is not stratified, as win(2) depends negatively on win(1) and vice-versa. However,
it suffices to compute first the minimal model of the module that deals with the play rule to
realise that some subgoals, like play(2, 1), are false. These subgoals can then be pruned away
from the rule win in order to obtain a correctly stratified set of instantiated rules again.
Kenneth Ross made this observation and proposed a new class of Datalog programs with
negation, called modularly stratified programs [Ros94]. A program is modularly stratified if
and only if its mutually recursive components are locally stratified. Naturally, if a program
is locally stratified, then it is modularly stratified. Again, like for local stratification, we can
impose some restrictions on the EDB relations to guarantee that a program is modularly
stratified. We shall come back to the concept of modular stratification, which is more robust
than local stratification, in Chapter 5.
To conclude our brief overview of the different classes of general logic programs, we should
mention that other semantics have been proposed to deal with any general logic program with
no restriction whatsoever, notably the well-founded semantics [vRS91] and the stable model
semantics [GL88]. These two semantics are three-valued semantics: literals may be true,
false or undefined. A noticeable point is that when a program has a total semantics (i.e. a
model where any fact is either true or false), the well-founded and the stable model semantics
coincide and that happens for a larger class of programs than the class of modularly stratified
CHAPTER 3. DATALOG 61
programs. The diagram in Figure 3.4 taken from [Ull94] summarises the containment of all
semantics classes for Datalog programs.
No negation
Statically stratified
Locally stratified
Modularly stratified
Two-valued well-founded semantics
Stable Well-founded
Figure 3.4: Containment of the different classes of Datalog programs.
3.5 Summary and references
In this chapter, we have presented Datalog [GM78] with an emphasis on the class of (stati-
cally) stratified programs, which has a clear least fixpoint semantics.
Datalog can be embedded in a Turing-complete logic programming system, such as XSB
[SSW94, CW96], where subgoals are regarded as top-down procedure calls and treated one
by one. In such a case, one shall not use the standard SLD resolution of Prolog [Llo87], which
may lead to non-termination and redundant computations, but instead recourse to tabled
resolution to avoid the infinite expansion of the search tree [War92].
Another implementation route relies on the fact that predicates can be seen as relations,
and logical operators as relational algebra operations [RG02]. The computation in that
case proceeds bottom-up treating one recursive stratum after another, each stratum being
a set of recursive rules that do not depend negatively on themselves. This is the approach
taken in many systems, e.g. [TZ86, PDR91, RSSS94, HVMV05, WACL05]. That approach,
however, may suffer from the unnecessary computation of irrelevant facts during the query
evaluation. The magic sets transformation is a well-know technique that tries to overcome
that problem [BMSU86, BR87]. The idea is to rewrite Datalog programs to materialise the
querying context of each predicate. The bottom-up resolution method with magic sets was
CHAPTER 3. DATALOG 62
shown to be more efficient for definite programs than the tuple-at-a-time top-down approach
[Ull89].
An alternative, called the Query-Subquery approach [Vie86], is to evaluate programs top-
down but a set at a time, thus enabling optimised relational algebra operations. The idea is
similar to that of magic sets but the calling context of predicates is propagated at runtime.
We shall actually see in Chapter 5 that logical parts of JunGL scripts are evaluated using a
variant of that technique.
We have also noted in this chapter that some queries cannot be expressed in stratified
Datalog. The database community has introduced larger classes of Datalog programs with
negation [Ull94], namely the class of locally stratified programs [Prz88] and of modularly
stratified programs [Ros94], as well as the more general well-founded semantics [vRS91] and
stable model semantics [GL88]. A relevant class in the context of this thesis is that of modu-
larly stratified programs. Modular stratification can in the general case only be determined
at runtime since it depends on the input of the program. Nonetheless, strong enough con-
ditions on EDB relations, such as acyclicity, can guarantee the modular stratification of a
program.
We now turn to describing precisely how logical features translate to relational equations
reminiscent of the set-based evaluation of Datalog. As we shall see, the rationale of returning
results in a meaningful order shall lead us to departing from the usual Datalog semantics
and introduce an ordered variant of Datalog that works over sequences rather than sets.
Chapter 4
Ordered semantics of the logical
features
Our language is a functional language in the style of ML with embedded logical features.
The functional constructs have the expected ML semantics, but the evaluation of predicates
differs from the usual logical languages. It is not Prolog like since queries are guaranteed to
terminate (unless you call a non-terminating or impure function as a non-binding test inside
a query). It is not quite normal Datalog either as we have the additional requirement of
maintaining results in a sequential order. We discuss in this chapter the rationale behind
that requirement, and introduce a novel variant of Datalog which operates over duplicate-free
sequences rather than sets. We then give the semantics of the logical features in JunGL by
translating predicates, edges and path queries constructs to this ordered variant of Datalog.
4.1 Why order matters
Programs are ordered trees. A block is a list of statements; a method has a list of parameters.
The order of statements in a block encodes the meaning of the program, and that order
obviously needs to be maintained during behaviour-preserving transformations. For import
clauses or class members, the order does not encode any meaning, and a permutation of such
elements would not change the behaviour of the program. In the context of a source-to-source
transformation tool, however, preserving at best the layout of the original program is crucial.
The order in which elements occur in the code appears in fact to be relevant in almost all
cases.
Nevertheless, that order, also known as the document order in the XML community,
could be reconstructed at the end of each query. It is indeed straightforward to work out an
appropriate indexing scheme to rebuild an ordered tree from a flat set of elements. Each query
could be internally evaluated as a set, and results would then be returned in the document
63
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 64
order.
The problem is that, often, the document order is not the order intended by the user. If
we look at one of our first edge definitions again, we realise the intent there is to find the
‘closest’ matching variable declaration:
l e t edge lookup r : Var → ?dec =
f i r s t ( [ r ] t r eePred+[?dec : VarDecl ] & r . name == ?dec . name )
By ‘closest’, we mean the element that is reachable with the minimum number of iteration
steps when navigating along a particular edge (here the treePred edge). In that case, the
indexing scheme approach would not have worked out properly. The whole predicate and
path queries evaluation mechanism ought to preserve results in an order that is intuitive to
the user. In the remainder of this chapter, we explain precisely what the result order is, and
how it is computed.
4.2 Duplicate-free sequences
The idea for encoding the order is to base the semantics of the logical features in JunGL
on relational operations over sequences of tuples that do not contain duplicates. In this
section, we first introduce some notations and functions related to duplicate-free sequences
and formally define the relational operators over these sequences.
4.2.1 Notations
Tuples We consider n-tuples over a finite domain of elements D. Each n-tuple is of the
form t = (x1, . . . , xn) ∈ Dn .
We use the notation X to denote all the columns of an n-tuple, and X .i to refer to its i th
column. In addition, we shall use {X1, . . . ,Xk} to denote an arbitrary set of columns among
X . In that case, each Xi (with i ≤ k ≤ n) is a unique reference to a column in X (e.g. X1
could refer to the last column of a 4-tuple).
Sequences A sequence sn = 〈t0, . . . , tN−1〉 is an ordered set of n-tuples. Like in sets,
duplicates are not allowed and we shall therefore represent a sequence by a total injective
function:
sn : [0 .. N − 1]→ Dn
The arity n of each tuple is the arity of the sequence, while N is the finite length of the
sequence that we also write |sn |. If N = 0, sn is the empty sequence that we shall write ε.
We use X sn to refer to the columns of sn . Naturally, we have |X sn | = n. Furthermore,
we refer to the range of a sequence sn with the usual notation, ran(sn), and by definition,
we also have |ran(sn)| = |sn | = N .
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 65
Finally, we use Seq to denote any set of sequences, and we write seq S for the set of all
sequences built from the elements of the set S . Notably, the set of all sequences of any length
over Dn is seq Dn . We also use seq kS to refer to the set of all sequences of at most k elements
in S . In particular, seq 1Dn is the set of all sequences over Dn of at most one tuple.
Some utility functions We introduce, for later use in the thesis, a function to turn a
sequence into a set:
setify : seq Dn → PDn
setify(sn ) = ran(sn)
We shall also need a function head that takes the head element of a non-empty sequence:
head : seq Dn \ {ε} → Dn
head(sn) = sn(0)
Haskell provides a similar function on lists. In fact, Haskell is ideal for expressing manip-
ulation on lists, and we shall hence use its model in our coming definitions for the sake of
readability. Note also that for brevity we sometimes omit the arity subscript of a sequence
and refer to sequences just with r or s .
4.2.2 Relational operations
For each standard relational operator on sets, we seek to define in Haskell an equivalent
operator on duplicate-free sequences. We assume a type Column for column references and
a type Tuple for tuples, as well as two basic functions to drop some columns of a tuple or to
project a tuple on certain columns:
tupleDrop :: [Column]→ Tuple → Tuple
tupleKeep :: [Column]→ Tuple → Tuple
In contrast to Chapter 3 where relations were sets of tuples, we wish to work now with
an ordered data structure, namely streams:
type Sequence = Stream Tuple
We still need to enforce, however, that no duplicates are present in the sequences we
manipulate. We shall therefore be careful to use the traditional nub function to rule out
duplicates. Its definition is:
nub :: Sequence → Sequence
nub [] = []
nub (x : xs) = x : [ y | y ← nub xs , y 6= x ]
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 66
In the remainder, we simply give Haskell definitions for relational operations over se-
quences.
Union The union of two sequences is the concatenation of the two, in which duplicates
have been removed. In Haskell:
∪seq :: Sequence → Sequence → Sequence
r ∪seq s = nub (r ++ s)
Projection The projection of a sequence of tuples of arity n on some columns X1, . . . ,Xk
(with k ≤ n) is the sequence where all tuples have been projected to these columns and
where duplicates have been discarded:
πseqX1,...,Xksn :: Sequence → Sequence
πseqX1,...,Xksn = (nub ·map (tupleKeep [X1, . . . ,Xk ])) sn
For convenience, we also introduce a projection-out operator that projects out some
columns of the tuples:
πseqX1,...,Xksn :: Sequence → Sequence
πseqX1,...,Xksn = (nub ·map (tupleDrop [X1, . . . ,Xk ])) sn
Selection We may filter a sequence of arity n in two ways: either by selecting all tuples
for which two columns Xi and Xj (with i , j ≤ n) share identical values, or by keeping only
the tuples in which a column Xi has the value d (with i ≤ n).
In Haskell, the selection with field equality is:
σseqXi=Xjsn :: Sequence → Sequence
σseqXi=Xjsn = filter f sn
where f x = (tupleKeep [Xi ] x == tupleKeep [Xj ] x )
Similarly, the selection with an arbitrary column test is defined as:
σseqXi=dsn :: Sequence → Sequence
σseqXi=dsn = filter f sn
where f x = (tupleKeep [Xi ] x == [d ])
Cross product In the cross product, or cartesian product, of two sequences of possibly
different arity rm and sn , the first tuple of rm is mapped to all the elements of sn , then the
second tuple of rm is mapped to all the elements of sn , and so on. Using list comprehensions:
rm ×seq sn :: Sequence → Sequence → Sequence
rm ×seq sn = [ x ++ y | x ← rm , y ← sn ]
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 67
Notice that we can omit the call to nub here as it is clear the list comprehension cannot yield
any duplicate if both rm and sn were themselves duplicate-free.
For reasoning later in the thesis, we shall use an equivalent definition in a combinatorial
style based on map and concat , plus the explicit call to nub. That is:
rm ×seq sn :: Sequence → Sequence → Sequence
rm ×seq sn = (nub · concat ·map (x → map (y → x ++ y) sn)) rm
Finally note that we shall sometimes use the exponential form rn as a shorthand for
r ×seq . . .×seq r︸ ︷︷ ︸n times
.
Negation The negation is expressed in terms of sequence difference. Of course, it may be
that there is no initial sequence to subtract from. In that case, since we work with a closed
world assumption, we can subtract the negated sequence sn from a universe sequence (of
similar arity) built out of our domain D. This implies that elements of the domain are also
ordered into a sequence. We denote that initial sequence with Dseq . Therefore, in the general
case, negation is formally expressed as:
notseq sn :: Sequence → Sequence
notseq sn = [ x | x ← Dnseq , x /∈ sn ]
If there is a sequence of greater arity (i.e. m ≥ n) to subtract from, it is usually less
expensive to express it directly as:
rm ∩seq (notseq sn) :: Sequence → Sequence → Sequence
rm ∩seq (notseq sn) = [ x | x ← rm , (tupleKeep X sn x ) /∈ sn ]
First We introduce an unusual operator that has no counterpart in set-based relational
algebra. Like projection, first is parameterised by some columns X1, . . . ,Xk (with k ≤ n).
The operator groups a sequence sn on these columns and takes the head of each subsequence:
firstX1,...,Xksn :: Sequence → Sequence
firstX1,...,Xksn = nub [ head (filter (f x ) sn) | x ← sn ]
where f x y = (tupleKeep [X1, . . . ,Xk ] x == tupleKeep [X1, . . . ,Xk ] y)
The following small example over a sequence with columns X illustrates that definition:
firstX .1〈(1, 2), (1, 3), (2, 3), (1, 5)〉 = 〈(1, 2), (2, 3)〉
firstX .2〈(1, 2), (1, 3), (2, 3), (1, 5)〉 = 〈(1, 2), (1, 3), (1, 5)〉
Note that, if we do not group on any column, first simply yields the head of the whole
sequence:
first sn = nub [ head sn | x ← sn ] = head sn
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 68
For completeness, we also define the remaining classical operators of relational algebra in
terms of the above primitives.
Intersection Intersection is still expressed using cartesian product, selection and projec-
tion. We use X and Y as a shorthand for X rnand Y sn :
rn ∩seq sn :: Sequence → Sequence → Sequence
rn ∩seq sn = πseqX .1,...,X .n
(σseqX .1=Y .1,...,X .n=Y .n
(rn ×seq sn))
To wit, the preserved order is the order of elements as they appear in the left-hand side
sequence rn . The previous definition is equivalent to the direct one:
rn ∩seq sn = [ x | x ← rn , y ← sn , x == y ]
Natural join Similarly, the natural join operation that combines information from two
sequences into a possibly bigger one can be expressed using cartesian product, selection and
projection. It is parameterised by the indexes of the columns on which to join, more precisely
by k pairs of indexes (X1,Y1), . . . , (Xk ,Yk) where the first and second elements of each pair
refer respectively to a column of rm and sn . The definition of join is:
rm./seq(X1,Y1),...,(Xk ,Yk )sn :: Sequence → Sequence → Sequence
rm./seq(X1,Y1),...,(Xk ,Yk )sn = πseqY1,...,Yk
(σseqX1=Y1,...,Xk=Yk(rm ×seq sn))
In words, we select the tuples of the cartesian product of rm and sn that have identical values
for the specified pairs of columns, and project out the redundant columns. In the remainder
of the thesis, we shall often omit the columns on which to join. For conciseness, we indeed
assume that sequences have labeled columns and that we join on columns that share the
same labels.
Sequential composition Finally, we shall refer to the sequential composition of two se-
quences. It is obtained by joining two sequences on the last column of the first sequence and
the first column of the second sequence, and projecting out the two intermediate columns.
We define it directly as follows:
rm ;seq sn :: Sequence → Sequence → Sequence
rm ;seq sn = πseqX .n,Y .1(σseqX .n=Y .1
(rm ×seq sn))
Again, the order is guided first by the sequence on the left-hand side. For instance:
〈(1, 2), (1, 3)〉;seq 〈(3, 5), (2, 4)〉 = 〈(1, 4), (1, 5)〉
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 69
4.3 Stratified Ordered Datalog
We shall now introduce a novel variant of Datalog which works on these duplicate-free se-
quences rather than usual sets to guarantee that results are returned in a deterministic order.
Quite naturally, we refer to this version of Datalog as Ordered Datalog. Ordered Datalog has
the same constructs as normal Datalog plus the operator first. However, just like negation
in normal Datalog, it looks like some of our relational operations on sequences (other than
negation) are nonmonotonic and prevent the correct computation of a least fixpoint. In this
section, we explore the stratification restriction that one shall impose on Ordered Datalog
to guarantee the existence of a least fixpoint. We shall notably study the monotonicity of
our relational operators over sequences, and see that stratified Ordered Datalog is just a
refinement of stratified Datalog with an additional order on the results.
4.3.1 Non-termination
To illustrate the problem of non-termination, we shall consider a simple example. Take the
domain made of two elements 1 and 2 and the initial sequence r = 〈(1, 2), (2, 1)〉 whose
setified version is depicted by the graph in Figure 4.1. We wish to compute the transitive
1 2
Figure 4.1: Setified graph representation of 〈(1, 2), (2, 1)〉
closure of r , and can think at four straightforward different relational definitions for it:
i. r+ = r ∪ r+; r
ii. r+ = r ∪ r ; r+
iii. r+ = r+; r ∪ r
iv. r+ = r ; r+ ∪ r
For each of these versions, we sketch each step of the fixpoint computation:
First version: r+ = r ∪ r+; r
0: r+0 = ε
1: r+1 = 〈(1, 2), (2, 1)〉
2: r+2 = 〈(1, 2), (2, 1), (1, 1), (2, 2)〉
3: r+3 = 〈(1, 2), (2, 1), (1, 1), (2, 2)〉 = r+
2
A fixpoint is reached.
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 70
Second version: r+ = r ∪ r ; r+
0: r+0 = ε
1: r+1 = 〈(1, 2), (2, 1)〉
2: r+2 = 〈(1, 2), (2, 1), (1, 1), (2, 2)〉
3: r+3 = 〈(1, 2), (2, 1), (1, 1), (2, 2)〉 = r+
2
Again, a fixpoint is reached.
Third version: r+ = r+; r ∪ r
0: r+0 = ε
1: r+1 = 〈(1, 2), (2, 1)〉
2: r+2 = 〈(1, 1), (2, 2), (1, 2), (2, 1)〉
3: r+3 = 〈(1, 2), (2, 1), (1, 1), (2, 2)〉
4: r+4 = 〈(1, 1), (2, 2), (1, 2), (2, 1)〉 = r+
2
5: . . . , and so on, no fixpoint being reached.
Fourth version: r+ = r ; r+ ∪ r
0: r+0 = ε
1: r+1 = 〈(1, 2), (2, 1)〉
2: r+2 = 〈(1, 1), (2, 2), (1, 2), (2, 1)〉
3: r+3 = 〈(1, 2), (1, 1), (2, 1), (2, 2)〉
4: r+4 = 〈(1, 1), (1, 2), (2, 2), (2, 1)〉
5: r+5 = 〈(1, 2), (1, 1), (2, 1), (2, 2)〉 = r+
3
6: . . . , no fixpoint is reached.
The evaluation does not terminate for the two latter versions. The intuition behind the
non-termination lies in the order in which we yield results. In the former versions, we return
paths of minimal length first: in front, the paths of length 1 ((1, 2) and (2, 1)), then the paths
of length 2 ((1, 1) and (2, 2)), and that is it as paths of greater lengths will have already been
yielded. In the latter versions however, we wish to give first the paths of maximal length,
and the evaluation, in the presence of cyclic data, thus enters an infinite loop while looking
at the biggest paths.
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 71
Our domain of elements being finite, such a non-termination can only come from a non-
monotonic relational operator when handling sequences rather than sets. To understand the
restrictions for Ordered Datalog programs to be safely evaluated through fixpoint computa-
tions, we need to pinpoint operators on duplicate-free sequences that are nonmonotonic.
4.3.2 Chasing nonmonotonic ordered operators
Evidently, the monotonicity property of each operator depends on the inclusion order we
choose for sequences. We shall briefly study here two alternatives: under subsequence order
and under prefix order. Because union and cross product are binary operators, we introduce
for each of them two unary operators defined by fixing one of the arguments (either on the
left or on the right). Thus, for every sequence y ∈ Seq, we define the left and right union
operators ⊕y and y⊕ such that:
∀x ∈ Seq · ⊕y(x ) = x ∪seq y
∀x ∈ Seq · y⊕(x ) = y ∪seq x
Similarly, for every sequence y ∈ Seq, we define the left and right cross product operators
⊗y and y⊗.
For our study of monotonicity, it is worth noticing a few distributive laws that can be
derived from the definitions of the operators we gave in the previous section. This approach is
similar to the work by Seres and Spivey on the algebra of logic programming [SSH99, Ser01]
but very much less complete — we are only interested in proving the monotonicity of our
operators. We note the following distributive laws over union:
πseq(r ∪seq s) = πseq(r) ∪seq πseq(s) (4.1)
σseq(r ∪seq s) = σseq(r) ∪seq σseq(s) (4.2)
y⊕(r ∪seq s) = y⊕(r) ∪seq y⊕(s) (4.3)
⊗y(r ∪seq s) = ⊗y(r) ∪seq ⊗y(s) (4.4)
All these laws are easily shown by using the Haskell model of our relational operations
over sequences. To illustrate, we shall prove here that left cross product distributes over union
(4.4). We first give a few useful laws involving nub, which are easily shown by induction on
lists. For any well-typed function f and any pair of lists r and s , we have:
nub ·map f · nub = nub ·map f (4.5)
nub · concat · nub = nub · concat (4.6)
nub (r ++ s) = nub (nub r ++nub s) (4.7)
Then, to prove the distributive law itself, we reuse the definition of cross product in
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 72
combinatorial style that we gave in the previous section:
(r ∪seq s)×seq y
= {definitions of cross product and union}
(nub · concat ·map f ) (nub (r ++ s)) where f t = map (t ′ → t ++ t ′) y
= {law (4.6)}
(nub · concat · nub ·map f · nub) (r ++ s)
= {law (4.5)}
(nub · concat · nub ·map f ) (r ++ s)
= {distributivity of map over ++ }
(nub · concat · nub) (map f r ++map f s)
= {law (4.6)}
(nub · concat) (map f r ++map f s)
= {distributivity of concat over ++ }
nub (concat (map f r)++ concat (map f s))
= {law (4.7)}
nub ((nub · concat ·map f ) r ++(nub · concat ·map f ) s)
= {definitions of cross product and union}
(r ×seq y) ∪seq (s ×seq y)
All other distributive laws can be proved in the same manner. We now focus on the
monotonicity of our operators on two different partial orders.
Monotonicity under subsequence order The subsequence order is the usual order to
consider on sequences (whether they are duplicate-free or not, infinite or not).
Definition 4.1 (Subsequence) A subsequence of some sequence is a new sequence which
is formed from the original sequence by deleting some of the elements without disturbing the
relative positions of the remaining elements.
Definition 4.2 (Subsequence order) The subsequence order is a binary relation ⊆ on Seq
such that for all r , s ∈ Seq, r ⊆ s if and only if r is a subsequence of s.
For instance, we have:
〈2, 3, 5〉 ⊆ 〈1, 2, 3, 4, 5, 6, 7〉
The subsequence order appears however to be inappropriate for the evaluation of Ordered
Datalog. The counterexample below shows that projection is indeed nonmonotonic with
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 73
respect to that order.
Counterexample: Take two sequences r = 〈(3, 4), (1, 3)〉 and s = 〈(1, 2), (3, 4), (1, 3)〉. From
the definition of subsequence, we have r ⊆ s . Now, consider the projection of each sequence
on their first column. Because we rule out duplicates in the projected sequence, we have
πseq1(r) = 〈3, 1〉 and πseq1(s) = 〈1, 3〉, hence πseq1(r) * πseq1(s).
Projection is implicitly used in almost all Datalog programs, and its use cannot be re-
stricted as such. This is unfortunate because all the other operators (except first and not of
course) would appear to be monotonic under subsequence order. Anyway, it is best to look
at another inclusion order.
Monotonicity under prefix order Another order that makes sense, and which is perhaps
more intuitive, is the prefix order.
Definition 4.3 (Prefix) A sequence r is a prefix of a sequence s if s consists of the sequence
r followed by zero or more other elements. That is, for all n such that r(n) is well defined,
r(n) = s(n).
Definition 4.4 (Prefix order) The prefix order is a binary relation v on Seq such that for
all r , s ∈ Seq, r v s if and only if r is a prefix of s.
On this order, the monotonicity of projection, selection, right union and left cross product
is easily drawn from the distributive laws we gave at the beginning of the section.
Proof: Let the function f represent any unary operator among projection, selection, right
union and left cross product. Suppose r , s ∈ Seq such that r v s . From the definition of
prefix, there is a smallest sequence t such that s = r ∪seq t .
f (s)
= {by definition of prefix}
f (r ∪seq t)
= {by distribution of f over union (4.1), (4.2), (4.3), (4.4)}
f (r) ∪seq f (t)
w {by definition of union}
f (r)
Hence f is monotonic. �
On the other hand, the remaining operators negation, left union and right cross product
are nonmonotonic. We note, however, that in the special case where the sequence y is of
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 74
length at most one, the corresponding right cross product y⊗ is monotonic. The case is
obvious for y = ε. Using the same proof as above, the case for y = 〈t〉 is also straightforward
if we show that 〈t〉 ×seq (r ∪seq s) = (〈t〉 ×seq r) ∪seq (〈t〉 ×seq s). That equality is apparent
if we reduce its two sides independently.
Finally, in Ordered Datalog, we allow the extra operator first. This operator has no
counter-part in normal Datalog. Nonetheless, if we were to draw an analogy, we would
suggest a non-deterministic operator choose that restricts each relevant group of tuples to
one of its elements. Unlike choose which would have to be handled like a nonmonotonic
aggregate in normal Datalog, first is similar to projection and hence monotonic — under
prefix order only.
We can hence summarise the monotonicity of our primitive sequence-based relational
operators as follows:
Operator Monotonicity
Projection (πSeq) !Selection (σSeq) !Right Union (y⊕) ! (for all y ∈ Seq)
Left Union (⊕y) # (for some y ∈ Seq)
Right Cross Product (y⊗) # (for some y ∈ Seq with |y| > 1)
Left Cross Product (⊗y) ! (for all y ∈ Seq)
Negation (not) #First (first) !
All other useful operators are derived from these primitive operators. In particular, the
intersection, join and sequence operators are expressed with cross product, projection and
selection. Consequently, their left versions are monotonic but not their right versions.
Following the concept of stratified normal Datalog, we conclude that if an Ordered Datalog
program is stratified in such a way that there is no use of negation, left union and right cross
product inside recursion, then it can be evaluated by computing the fixpoint of each stratum
one after the other in topological order. More formally, an Ordered Datalog program is safely
stratified if any rule Ri in a stratum si complies to the following grammar:
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 75
Ri ::= ε empty sequence
| Dseq universal sequence
| Ri−1 rule in a lower stratum
| Ri rule in the same stratum
| Ri−1 ∪seq Ri union
| Ri ×seq Ri−1 cartesian product
| πseq [X1,..,Xk ](Ri) projection
| σseqXi=Xj(Ri) selection with field equality
| σseqXi=d(Ri) selection with arbitrary test
| notseq(Ri−1) negation
| first(Ri) first
In the end, we realise that statically stratified Ordered Datalog has a limited form of
recursion. Indeed the restriction on cartesian product is quite impeding: it notably rules out
non-linear recursion. We shall see in the next chapter how to accept more general queries and
overcome the restrictions on union and cross product. We first wish to show that Ordered
Datalog programs are consistent with their counterparts in normal Datalog.
4.3.3 A refinement of stratified Datalog
Interestingly, stratified Ordered Datalog can be seen as a data refinement of normal stratified
Datalog where finite sets are refined to finite duplicate-free sequences. We can prove, like
Figure 4.2 shows, that we get the same results by transforming an Ordered Datalog program
into a normal Datalog program, and evaluating it with the set-based semantics, as by eval-
uating the original Ordered Datalog program with respect to the sequence-based semantics
and then removing the order from the results.
Sequence Sequence
Set Set
setify setify
fseq
f
Figure 4.2: Data refinement
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 76
We restrict the proof to stratified Ordered Datalog programs which contain no use of
first, since that operator has no counterpart in normal Datalog, and we reuse the fixpoint
formalism that we have introduced for the evaluation of strata in Chapter 3.
There we have defined a step function fRjfor each Datalog rule Rj in a stratum si and
lifted them to a step function fi for the entire stratum si . We could then define in (3.1) the
minimal model of each stratum si as the least fixpoint of its step function. We recall the
definition here:
[[si ]] = lfp(fi)
Ordered Datalog programs are evaluated stratum per stratum like normal stratified Dat-
alog program. We write fseqRjfor the step function of a corresponding Ordered Datalog rule
Rj and fseq ifor the step function lifted to the whole stratum si . By analogy with [[si ]], we
denote the least model of that stratum with:
〈〈si 〉〉 = lfp(fseq i) (4.8)
Our data refinement proof hence reduces to the proof that for any stratum si we have:
[[si ]] = setify(〈〈si 〉〉)
where setify is the function obtained by lifting setify up at the level of tuples of sequences:
setify : seq Dni,1 × · · · × seq Dni,ki → PDni,1 × · · · × PDni,ki
setify((X1, . . . ,Xk)) = (setify(X1), . . . , setify(Xk ))
Similarly, we shall write ∅ for (∅, . . . , ∅) and ε for (ε, . . . , ε).
Proof: We first note that each set-based relational operator ⊕ relates to its counterpart on
sequences ⊕seq with:
⊕ · setify = setify · ⊕seq
This is no surprise as we have defined our relational operators over sequences to follow the
semantics of their set-based counterparts.
Now by swapping each operator of a rule Rj one by one — e.g. ⊕1 · ⊕2 · setify =
⊕1 · setify · ⊕2seq = setify · ⊕1seq · ⊕2seq — we infer the same equality for the step function
of each Rj :
fRj· setify = setify · fseqRj
Finally, we can lift up the result to any entire stratum si and obtain:
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 77
fi · setify = setify · fseq i(4.9)
The end of the proof is then straightforward. By definition, there exists an n such that
[[si ]] = f ni (∅), and an n ′ such that 〈〈si 〉〉 = fseq
n′
i(ε). The values f n
i (∅) and fseqn′
i(ε) being
fixpoints, we can take N = max (n,n ′) so that:
[[si ]] = f Ni (∅)
〈〈si〉〉 = fseqNi
(ε)
Finally we can prove the equality:
[[si ]] = f Ni (∅)
= f Ni (setify(ε))
= {by N applications of (4.9)}
setify(fseqNi
(ε))
= setify(〈〈si 〉〉) �
The fact that stratified Ordered Datalog is a data refinement of stratified Datalog is im-
portant. Indeed stratified Datalog programs have proved to have a very intuitive semantics
and we now know that we follow the same intuitive semantics in Ordered Datalog. Further-
more, if we do not need query results in order, we can simply treat any first-free query as a
normal Datalog query.
We shall now give the semantics of the logical features in JunGL by translating predicates,
edges and path queries constructs to Ordered Datalog.
4.4 Data model
Before we explain the semantics of the logical features, we need to describe precisely the
underlying data structure that is being queried in JunGL. While introducing the design of
JunGL, we have stressed the fact that the representation of the object program is initially a
simple AST (or collection of ASTs). The tree is then further decorated through the definition
of edges, which turns it into a graph. We treat that initial tree as a collection of EDB relations
(i.e. a database representing the program), and we handle the super-imposed graph defined
by the various edges as IDB relations (i.e. a view on top of the program tree).
In this section, we describe the data model of the initial program tree and introduce some
useful functions for querying it.
Domain The database on top of which queries are evaluated consists of a collection of
ASTs that is stored in memory. Hence the values that are manipulated are mostly nodes but
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 78
not only. A node may indeed admit a field that is not a node. For instance, in the AST of a
While program (whose grammar is given in Figure 2.1), nodes of type Var have a field name
of type string.
We denote the full domain of values that can be queried with D, and the set of all nodes
among it with Node. Although D includes nodes, booleans, strings, integers, lists and tuples,
we must stress that it is finite. It does not include the set of all possible strings, or all possible
lists, but only the elements that are currently held in memory. As we will see shortly, we
ban the creation of new values during query evaluation. Notably, there is no mechanism for
binding a logical variable to a fresh constant. The set D is fixed during each query evaluation.
As we wish to return the results of each query in a deterministic order, the arrangement
of elements in the original domain D obviously matters. We hence need to assign an order
to values in memory. Let Dseq be the sequence over our domain that reflects that order, and
Nodeseq the subsequence that gives the order of node values only.
Types In the presentation of the semantics to follow, we need to refer to the precise data
type of each AST node. We denote the set of all types with Type and the subset of all
AST data types with NodeType. Each node has a type τ ∈ NodeType, and we introduce the
following function to retrieve the type of a node:
type :: Node → NodeType
We shall need a well-founded relation to reflect the type hierarchy of the AST data types.
We write τ ≺ τ ′ when τ is a proper subtype of τ ′, and τ � τ ′ if and only if τ = τ ′ or τ ≺ τ ′.
Fields We also need to refer to the labeled fields of each node. We call FieldName the set
of all field names, and we introduce a function to return all the field names of a node type:
fields :: NodeType → P FieldName
Furthermore, we need a function to retrieve the value of a field for a given node:
fieldValue :: Node → FieldName → D
Finally, we introduce a special function children that, given a field name `f and a node
n, returns the sequence of children of n present in field `f . Precisely:
children :: FieldName → Node → seq Node
children `f n = let v = fieldValue n `f in
if v is a node then 〈v〉
else if v is a list of nodes then sequenceOf v
else ε
The function sequenceOf simply turns a list into a sequence while preserving the order of
elements. We use it to make it explicit that we yield a sequence here. Note that the same
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 79
node cannot occur twice in a list of children nodes, and therefore we do not need to call nub.
If node n has no field `f , or if the field `f of n is neither a node nor a list of nodes, then
children n `f fails and returns the empty sequence.
Built-in tree navigation We also have the following navigation functions that relate
nodes to other nodes in the original tree representation of the program. These functions
correspond to the built-in edges described in Table 2.1.
parent :: Node → seq 1Node
child :: Node → seq Node
firstChild :: Node → seq 1Node
lastChild :: Node → seq 1Node
successor :: Node → seq 1Node
predecessor :: Node → seq 1Node
listSuccessor :: Node → seq 1Node
listPredecessor :: Node → seq 1Node
4.5 Translating predicates, edges and path queries
We shall now explain how we evaluate predicates, edges and path queries over the above
data model. First, we introduce the syntactic constructs for building up predicates, edge
bodies and path queries. Next, we give the semantics of each of the constructs by translating
them to relational equations over duplicate-free sequences that are given, for now, the least
fixpoint interpretation of stratified Ordered Datalog programs. We shall see in Chapter 5
that, in fact, we support more general ordered Datalog programs, but the translation scheme
we give here is general and shall remain the same.
4.5.1 Abstract syntax
For clarity, we do not use the exact parse tree of JunGL presented in Appendix A, but give
a core abstract syntax for predicates and path queries. Differences are minor though, and we
explain them briefly below.
First, we require any existential predicate local to have at most one single identifier.
This is not a restriction, since
local ?x ?y : p (?x , ?y )
is just syntactic sugar for
local ?x : local ?y : p (?x , ?y )
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 80
Furthermore, we represent simple tests (such as an equality ?x = ?y + 1) as a pure
function from identifiers to booleans. Indeed, tests never bind logical variables; they are just
used to filter tuples. Having only pure functions guarantees that we do not update the tree
structure during query evaluation: facts about the tree structure are extensional only, i.e.
known before the evaluation of any query.
In addition, the only terms we consider are logical variables. It might look like complex
non-ground terms are allowed in JunGL, but that is simply not the case. There is indeed no
unification in our evaluation mechanism and, through the simple use of predicate calls, it is
impossible to bind a logical variable to a freshly built value. Any complex expression, such as
a function call, a list constructor or an arithmetic expression, that appears in the argument
of a predicate call can be replaced by a fresh logical variable. In that case, the replacement
comes with an additional filter conjunct on the side of the predicate call, in order to enforce
the equality of the fresh variable to the expression that has been extracted. To illustrate,
p (? x + 1 , ?y )
translates, in our core language of logical features, to
local ? z : p (? z , ?y ) & f ? z ?x
where f is resolved in the environment to the following function:
f :: Integer → Integer → Bool
f z x = (z == x + 1)
Finally, we omit namespaces and assume we evaluate attributes beforehand so that we
can simply consider them as additional fields in AST structure. We also assume we have an
environment ρ for resolving names which is already extended with definitions of all predicates,
edges, functions and AST data types. We shall use the function resolve to lookup a definition
for a given name in that environment. Furthermore, when translating the body of an edge,
we assume we have access to the name of the variable capturing the source node of the edge.
That name is obtained as a singleton by calling the function sourceVar on our environment
ρ. If we are not currently translating an edge body, sourceVar returns the empty set.
In the end, we really focus on the semantics of predicates and path queries and their core
abstract grammar reads as follows:
i : LogicalIdentifier
`p : PredicateName
`e : EdgeName
`f : FieldName
`τ : TypeName
`λ : FunctionName
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 81
p : Predicate
p ::= true∣∣ false
∣∣p | p
∣∣ p & p∣∣ ! p
∣∣local i : p
∣∣first p
∣∣`p ( i1, . . . , in )
∣∣`λ ( i1, . . . , in )
∣∣pp
pp : PathPredicate
pp ::= np ( ep np )?
np : NodePredicate
np ::= [ i [ : [ ! ] `τ ] ]
ep : EdgePredicate
ep ::= `f∣∣
`e∣∣
( cep )∣∣
ep ; ep∣∣
ep +∣∣
ep *
cep : ComplexEdgePredicate
cep ::= ep [ pp ]∣∣
pp ep∣∣
local i : cep∣∣
cep & p
Note that we distinguish basic edge predicates from more complex ones. We mirror that
way an important syntactic restriction: a complex edge predicate, which can be a path
predicate without any starting node, or any ending node, needs to be bracketed to be a basic
edge predicate. In brackets, the complex edge predicate can be further exploited through
its (possibly reflexive) transitive closure. As an illustration, we recall a definition for data
dependency introduced in Chapter 2:
[ ? y ] ( local ? z : c f s u c c [ ? z ] & ! [ ? z ] de f [ ? v ] ) ∗ ; c f s u c c [ x ]
We are now ready to give the semantics of predicates and path queries by translating each
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 82
syntactic construct appearing in the above grammar to a relational equation. Each equation
shall be expressed with the relational operators over sequences that we have introduced at
the beginning of the chapter.
4.5.2 Relational equations
We introduce five functions Sp , Spp , Snp , Sep , and Scep to denote, respectively, the sequences
resulting from the evaluation of a predicate, a path predicate, a node predicate, an edge
predicate and a more complex edge predicate:
Sp : Predicate → seq D∗
Spp : PathPredicate → seq D∗
Snp : NodePredicate → seq Node
Sep : EdgePredicate → seq D∗
Scep : ComplexEdgePredicate → seq D∗
We use the notation [[. . . ]] to indicate the syntactic structure to which we give a meaning.
Bits of pure syntax are written in teletype font, whereas terms in italic fonts stand for other
constructs. The meaning of each construct depends on the meaning of these other terms.
We say the semantics are given by induction on the syntactic structure of the program.
To be fully precise, the evaluation functions should all be parameterised by the envi-
ronment ρ but we keep it implicit most of the time. When needed, we will simply write
S[[. . . ]]ρ.
Usual predicate constructs We start by giving the definition of Sp for each usual pred-
icate construct. Not surprisingly, this translation follows the usual mapping of predicate
calculus to relational algebra mentioned in 3.2.3 and extends it with the noteworthy first
operator present in JunGL.
We assume for ease of read that each column of a sequence is labeled with the name of
the corresponding variable in its predicate counterpart. In particular, the conjunction of two
predicates p and q is evaluated as a join of their relational interpretations on the columns
that have the same labels. Also note that we omit the usual renaming of columns that is
needed when evaluating a predicate call `p(i1, . . . , in).
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 83
Sp [[true]] = 〈()〉
Sp [[false]] = ε
Sp [[p | q]] = Sp [[p]] ∪seq Sp [[q]]
Sp [[p & q]] = Sp [[p]] ./seq Sp [[q]]
Sp [[!p]] = notseq(Sp [[p]])
Sp [[local i : p]] = πseq i(Sp [[p]])
Sp [[first p]] = firstS (Sp [[p]]) where S = sourceVar ρ
Sp [[`p(i1, . . . , in)]] = Sp [[resolve ρ `p ]]
Sp [[`λ(i1, . . . , in)]] = σseq f(Dn
seq) where f = resolve ρ `λ
Sp [[pp]] = Spp [[pp]]
Path predicates We now focus on the definition of Spp . Importantly, the sequence re-
sulting from the evaluation of a path predicate is not necessarily a binary sequence. Indeed,
some other logical variables may be bound inside a path predicate: the most obvious case
is when one binds some intermediate nodes inside the path. To illustrate briefly, here is a
stream comprehension in JunGL that is based on a path predicate of arity three:
{ (? x , ? y , ? z ) | [ ? x : Program ] ch i l d ∗ [ ? y : I f ] c ond i t i on [ ? z ] }
We search for paths from a node ?x of type Program to the guard ?z of a conditional state-
ment, and we are also interested in the value ?y of the If statement.
For path predicates, the translation to relational equations is as follows:
Spp [[np]] = Snp [[np]]
Spp [[pp ep np]] = (Spp [[pp]];seq Sep [[ep]]);seq Snp [[np]]
This last definition is valid since we have defined sequential composition to work on sequences
of arbitrary arity, and not just on binary sequences. In Spp [[pp]];seq Sep [[ep]] for instance, the
join occurs on the last column of Spp [[pp]] and the first column of Sep [[ep]], so that the ending
node of the path predicate pp is equal to the starting node of the edge ep.
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 84
Node predicates We shall now define Snp :
Snp [[[i]]] = Nodeseq
Snp [[[i:`τ]]] = σseq f(Nodeseq) where f n = true ⇔ type n � resolve ρ `τ
Snp [[[i:!`τ]]] = σseq f(Nodeseq ) where f n = true ⇔ type n � resolve ρ `τ
The first construct simply binds the logical variable i to any node. The second construct
binds i to any node whose type is a subtype of the type designated by `τ in our environment.
If this condition was to be translated to Datalog, that could just be a non-binding test.
Similarly, the third construct binds i to any node whose type is not a subtype of the type
designated by `τ .
Edge predicates Here we give the meaning for all different ways of constructing edge
predicates. The two first constructs require all our attention, since it is at that point that
fields and edges are viewed as relations. In both cases the strategy is the same: we build a
binary sequence where each node in Nodeseq is potentially mapped to the target nodes of the
edge.
In the case of fields notably, Sep [[`f ]] is the sequence of pairs that map each node to
its children nodes in field `f : a tuple (x , y) is in the sequence Sep [[`f ]] if and only if y ∈
children `f x . Furthermore, the order of tuples in Sep [[`f ]] is given both by the order of nodes
in Nodeseq , and by the arrangement of children nodes in the ASTs. The precise definition of
the built sequence is:
Sep [[`f ]] = buildSequence (children `f )
The function children retrieves the children nodes of a node — we have defined it in Section
4.4 — and the function buildSequence is defined as follows:
buildSequence :: (Node → seq Node)→ seq Node
buildSequence f = concat ·map (λn. map (λv . [n; v ]) (f n)) Nodeseq
In words, we map f to all nodes in the sequence Nodeseq , get a sequence of sequences that
we flatten to a single sequence using concat. Note that buildSequence takes a function f that
returns sequences of arity one, but itself returns a binary sequence.
Similarly, we wish to give the meaning of a call to an edge predicate. When resolving
the edge with label `e , there are however two possibilities. Either the edge has been defined
by the user, i.e. declared via a let edge definition, or it is built-in for navigating the tree.
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 85
In the latter case, we simply build a sequence from the appropriate tree navigation function
and hence relate each node of Nodeseq to its expected neighbours.
Sep [[‘parent ’]] = buildSequence parent
Sep [[‘child ’]] = buildSequence child
Sep [[‘firstChild ’]] = buildSequence firstChild
Sep [[‘lastChild ’]] = buildSequence lastChild
Sep [[‘successor ’]] = buildSequence successor
Sep [[‘predecessor ’]] = buildSequence predecessor
Sep [[‘listSuccessor ’]] = buildSequence listSuccessor
Sep [[‘listPredecessor ’]] = buildSequence listPredecessor
The definition is a bit more tricky, however, when `e resolves to a user-defined edge. The
complexity is two-fold. First, the body of an edge definition is actually a predicate that may
refer to other user-defined predicates and edges, and notably to itself recursively. While field
accesses and built-in edges are easily regarded as EDB predicates (which are evaluated by
constructing the binary sequences described above), user-defined edges are inherently IDB
predicates and can be mutually recursive. Second, we have seen in Chapter 2 that one may
actually give different overriding definitions of the same edge for different source node types.
That means predicate dispatch must happen at runtime to determine which edge body should
be evaluated to retrieve the right targets. For now, we leave aside the detailed explanation
on how we encode predicate dispatch and simply write:
Sep [[`f ]] = Sp [[dispatch (resolve ρ `e)]]
It is however important to stress again the potential presence of recursion in the relational
equation we give here. The predicate body given by dispatch (resolve ρ `e) may indirectly
refer back to the edge predicate Sep [[`f ]] for instance. In that case, relational equations
must be solved using the least fixpoint interpretation of Ordered Datalog programs we have
explained earlier.
We now move on to the meaning of the remaining constructs for edge predicates, where
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 86
the presence of recursion is explicit:
Sep [[ep ; eq]] = Sep [[ep]];seq Sep [[eq]]
Sep [[ep +]] = µX · Sep [[ep]] ∪seq X ;seq Sep [[ep]]
Sep [[ep *]] = ρseq i=j(Nodeseq
2) ∪seq Sep [[ep +]]
Sep [[(cep)]] = Scep [[cep]]
More complex edge predicates For the sake of expressiveness of our language, we allow
more complex edge predicates which notably provide scopes for local logical variables that
may be bound to multiple different values across successive repetitions of an edge. One
can also name one end of an edge (that hence looks like a starting or ending path) to
further constraint that end in repetitions. Here is how these constructs translate to relational
equations over sequences:
Scep [[ep]] = Sep [[ep]]
Scep [[ep pp]] = Sep [[ep]];seq Spp [[pp]]
Scep [[pp ep]] = Spp [[pp]];seq Sep [[ep]]
Scep [[local i : cep]] = πseq i(Scep [[cep]])
Scep [[cep & p]] = Scep [[cep]] ./seq Sp [[p]]
Binding equality To be complete, we should also mention the presence of two special built-
in predicates. In JunGL ‘==’ can only be used as a non-binding filter and we have therefore
introduced the binary predicate equals to provide binding equality. A naive translation of a
call to equals is given by:
Sp [[equals(i , j)]] = ρseq i=j(Dseq
2)
In addition, it is often convenient to bind a variable with the values of a pre-computed
sequence. The predicate isIn, whose sequence argument s must be bound, translates to:
Sp [[isIn(i , s)]] = s
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 87
4.5.3 Ordered Datalog rules
We have exposed the translation of logical constructs to relation equations over duplicate-
free sequences, which can be interpreted as Ordered Datalog programs. For readability
purposes, we now propose to write these relational equations as Ordered Datalog rules in the
usual syntax of Datalog. We shall consider several EDB predicates for accessing the EDB
relations we have mentioned earlier. Notably, we call node the predicate whose interpretation
is Nodeseq , parent the predicate whose interpretation is (buildSequence parent), and so on for
the other navigation edge predicates. As for field accesses, we shall use field name to denote
the predicate whose interpretation is (buildSequence (children ‘name’)). Furthermore, we
introduce fresh predicate names for recursively-defined IDB relations. Hence, given our
translation to relational equations, the tiny JunGL query
{ ? c | f i r s t ( [ 1 ] c h i l d +[?c ] ) }
can be written as an Ordered Datalog query where we now adopt lowercase for variable
names, like in the original JunGL program:
child plus(x , y) ← child(x , y); ∃z · child plus(x , z ), child(z , y).
query(c) ← first(child plus(1, c)).
Note how the ‘+’ appended to child is translated to the recursive rule child plus.
Because we work with sequences, order at intermediate steps of the query evaluation is
preserved. If we evaluate this query on our sample child relation of Chapter 3, the result is
always 2.
4.5.4 Encoding dynamic edge dispatch
We have seen in Chapter 2 that edge definitions can be overridden. One can indeed define for
some source type τ an edge `e that is already defined for another source type τ ′. If τ ≺ τ ′,
we say the edge definition `e is overridden for τ . Here we illustrate how predicate dispatch
[EKC98, Mil04] is used for dynamic edge dispatch.
We shall consider a precise example to support our explanation, namely three AST data
types A, B and C such that B ≺ A and C ≺ A, as well as an edge e defined for nodes of all
these types.
type A =
| B
| C
l e t edge e x :A→ ?y = p(x , ? y )
l e t edge e x :B→ ?y = q (x , ? y )
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 88
l e t edge e x :C→ ?y = r (x , ? y )
The idea is to introduce some special unary predicates node A, node B and node C to
enforce a node variable to be of a specific type. We can then express the fact that p should
be called only if x is of type A, but neither of type B nor of type C , whereas q is called only
if x is of type B , and r is called only if x is of type C . To wit, the translation of the predicate
dispatch to an Ordered Datalog disjunct is as follows:
edge e(x , y) ← node(x ), (
node A(x ), not node B(x ), not node C (x ), p(x , y)
; node B(x ), q(x , y)
; node C (x ), r(x , y)
).
Note the presence of the predicate node(x ) at the beginning of the body. This is to force
the tuples of the edge to be returned in an order that follows the order of nodes in Nodeseq .
Otherwise, tuples would be returned in an order following the type of their first element:
first As that are neither Bs nor C s, then Bs and finally C s.
The predicate node A is simply defined with the following rule where the second disjunct
is a non-binding test on the type of the variable x :
node A(x ) ← node(x ), type x � A.
The predicates node B and node C are defined in the same way. It is easy to see that
the interpretation for node B is a subsequence of the interpretation for node A. Similarly for
node C regarding node A.
Also, because each node has a precise single type at runtime, we know that the interpre-
tations for node B and node C are disjoint. Therefore, we are guaranteed that only one of
the calls p(x , y), q(x , y) and r(x , y) is actually relevant for a specific x .
4.5.5 A full translation example
We conclude the section by presenting a more complete example of how the logical parts of
a JunGL program translate to an Ordered Datalog program. We shall draw our example
from Chapter 2. It consists of several ingredients for querying the control-flow graph of a
While program, namely a predicate for checking post-dominance plus some relevant edges.
The hierarchy of the different AST data types is the one given in Figure 2.1.
l e t edge defaultCFSucc x : Statement → ?y =
f i r s t ( [ x ] l i s t S u c c e s s o r [ ? y ]
| [ x ] parent [ ? y : WhileLoop ]
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 89
| [ x ] parent ; defaultCFSucc [ ? y ]
| [ x ] parent ; e x i t [ ? y ]
)
l e t edge c f s u c c x : Statement → ?y = [ x ] defaultCFSucc [ ? y ]
l e t edge c f s u c c x : Block → ?y =
f i r s t ( [ x ] f i r s tC h i l d [ ? y ] | [ x ] defaultCFSucc [ ? y ] )
l e t edge c f s u c c x : I f → ?y =
[ x ] thenBranch [ ? y ]
| f i r s t ( [ x ] e l s eBranch [ ? y ] | [ x ] defaultCFSucc [ ? y ] )
l e t edge c f s u c c x : WhileLoop → ?y =
[ x ] body [ ? y ] | [ x ] defaultCFSucc [ ? y ]
l e t predicate postDominates (?x , ? y ) =
[ ? y : Statement ] c f s u c c +[?x : Statement ] &
! ( [ ? y ] ( local ? z : c f s u c c [ ? z ] & ?z != ?x)+[ : Exit ] )
The full translation to follow is, in our opinion, much less readable than the original
program but it is a good intermediate representation for building an evaluator. Note that,
for the sake of readability, we have even omitted some useless calls to the predicate node.
edge defaultCFSucc(x , y) ← node(x ), node Statement(x ), firstx (
listSuccessor(x , y)
; parent(x , y), node WhileLoop(y)
; (∃z · parent(x , z ), edge defaultCFSucc(z , y))
; (∃z · parent(x , z ), field exit(z , y))
).
edge cfsucc(x , y) ← node(x ), (
node Statement(x ), not node Block(x ),
not node If (x ), not node WhileLoop(x ),
edge defaultCFSucc(x , y)
; node Block(x ),
firstx (firstChild(x , y); edge defaultCFSucc(x , y))
; node If (x ), (
field thenBranch(x , y);
; firstx (field elseBranch(x , y); edge defaultCFSucc(x , y))
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 90
)
; node WhileLoop(x ),
(field body(x , y); edge defaultCFSucc(x , y))
).
cfsucc plus(x , y) ← edge cfsucc(x , y); ∃z · cfsucc plus(x , z ), edge cfsucc(z , y).
local cfsucc(i , z , x ) ← ∃z · edge cfsucc(i , z ), z 6= x .
local cfsucc plus(i , j , x ) ← local cfsucc(i , j , x );
∃k · local cfsucc plus(i , k , x ), local cfsucc(k , j , x ).
postDominates(x , y) ← node Statement(y), cfsucc plus(y, x )
node Statement(x ),
not ∃z · local cfsucc plus(y, z , x ), node Exit(z ).
One may wonder at that point where the lazy computation of edges comes in. Indeed, we
have stressed in Chapter 2 that edges are evaluated lazily, and not exhaustively computed
for all nodes in our program tree. Nonetheless, if we were to compute this Ordered Datalog
program with the usual bottom-up approach, we would have to compute all edges for all
nodes. We shall see in the coming chapter that, for this reason and less obvious ones, we
have in fact based the resolution of queries on the Query-Subquery approach.
A trained reader of Datalog programs may have also noticed that the rule local cfsucc is
not range-restricted since the variable x is not positively bound in its body. To overcome the
problem in a bottom-up framework, it would be sufficient to append an additional conjunct
node(x ) to bind x to all possible nodes. In a top-down framework however, the program is
just fine as it is because the third argument of local cfsucc is bound anyway at all call sites.
It can also be noticed that the subgoal node(x ) in edge defaultCFSucc(x , y) is useless given
the presence of node Statement(x ) afterwards, and it could be optimised away. One may
want to tackle this kind of optimisations in future work.
4.6 Summary and references
In this chapter, we have introduced a novel variant of Datalog, called Ordered Datalog,
whose least fixpoint semantics is based on duplicate-free sequences rather than sets. In order
to state the conditions under which an Ordered Datalog program is stratified, we have first
given a Haskell model of relational operators on such duplicate-free sequences. Then, we have
studied the monotonicity of our new relational operators with respect to a particular partial
order, namely prefix order. For that purpose, we have notably derived useful distributive
laws of our operators from our Haskell model, using an approach similar to the work by Seres
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 91
and Spivey on the algebra of logic programming [SSH99, Spi00, Ser01], but in a much less
exhaustive way.
Indeed our modelling of the first and orelse operators with sequences builds on a long
tradition of algebraic approaches to search. For instance, function composition based on a
monad with an extra plus operation and a zero element can be instantiated with either the
Maybe or the List monad, providing different models of nondeterminism: the plus operation
is the logical ‘;’ and it corresponds to ‘if-then-else’ when used with the Maybe monad; our
operator first can be seen as head in the case of using the List monad. To our knowledge, the
first to explain the semantics of functional strategic programming in these terms was Spivey
in [Spi90]. The same ideas were then extended further in the full algebraic account of logic
programming we have just mentioned.
One difference between that pioneering work and our own, however, is the way recursion
is treated. When using a shallow embedding of logic programming via these monads in a
language like Haskell, one inherits the semantics of recursion from the host language. As
we have argued in this chapter, the desired semantics of Datalog is instead one based on
a dedicated partial order on the given monad. The ρ-calculus introduced by Cirstea and
Kirchner [CK01] does not suffer that drawback, as it has an answer-set semantics supporting
various kinds of choice as well as an analogue of first. By contrast, the Stratego language
originally developed by Visser [BKVV06] could be thought of as mostly based on the Maybe
monad, supporting only the simple success/failure-based model.
Next, we have explained how to translate all logical features of JunGL to this novel ordered
variant of Datalog. The translation includes edge definitions, predicates and path queries for
querying the graph representation of a program. One notable feature of the translation is the
use of predicate dispatch to deal with potentially overridden definitions of edges. Predicate
dispatch has been proposed before to naturally unify and generalise several common forms
of dynamic dispatch, including traditional object-oriented dispatch [EKC98, Mil04].
The use of Datalog in software engineering tools has been explored before, both for
expressing precise program analyses [Rep93, DRW96, WACL05] and in the general context
of code queries [CMR92, HVMV05]. Liu et al. also proposed in [LS06] to translate path
queries into Datalog. The crucial difference here, however, is that we have introduced Ordered
Datalog and described a translation to this variant in order to maintain results in a meaningful
order.
One may therefore wonder why we did not simply embed XPath queries [Wad99b] into
JunGL, or even recourse to XQuery [W3C07] to encode our refactoring transformations. In
XPath-based languages, results have indeed a well-defined order that matters too for the
reconstruction of XML documents. Furthermore, XQuery has been considered before as a
meta-programming language and has proved to be fairly scalable and effective. Magellan,
an open static analysis framework to enable cross-artifact information retrieval, indeed offers
the possibility to write code queries in XQuery [EMOS04, EGM+06].
The semantics of XPath-based languages, however, only refer to the initial document
CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 92
order [Wad99b, Wad99a]. This allows many optimisations as it is always sufficient to work
out an adequate indexing scheme to tag the position of nodes in the original tree document.
Yet, in our context, the fact that we sometimes wish to retrieve results in an order that is
not the document order rules out the adoption of XPath. In addition, as explained in [LS06],
XPath allows segments of queries to be skipped, but does not allow the expression of repeated
matching segments.
The case is different for XQuery. There, although the result of a path expression is
still returned in document order, the result of a For-Let-Where-Return expression can be
determined both by an eventual Order-by clause and by the expressions in its For clauses.
Hence the result of an XQuery query may reflect not only the implicit XML document order
but also the explicit order imposed in the query. In fact, edges in JunGL are quite comparable
to navigator functions in XQuery. Those extend the idea of axes, in the terminology of XPath,
to relate arbitrary nodes in the graph — in XPath, axes are restricted to navigation on the
tree structure only. Thus, it is possible to emulate edges by defining functions in XQuery.
Apart from the syntax that would be particularly verbose in that case, the main issue is
that, unlike XPath expressions, user-defined functions in XQuery admit arbitrary types of
recursion. To handle possibly cyclic queries on graphs, one needs to introduce adequate
guards to prevent the query evaluator from entering an infinite loop.
Instead, we have based the semantics of our logical features on an ordered variant of
Datalog, which like normal Datalog has a clear least fixpoint semantics and enables the
natural expression of complex cyclic queries. So far, we have presented this variant in its
stratified form, which limits the kind of recursion that we can handle. We shall now explain,
however, how to accept more general queries.
Chapter 5
Evaluating more general ordered
queries
In the previous chapter, we have explained how to translate logical features in JunGL to
Ordered Datalog, a variant of Datalog that operates over duplicate-free sequences rather
than sets, and studied the precise conditions under which Ordered Datalog programs are safe
— i.e. under which the existence of a least fixpoint is guaranteed for stratified programs.
However, stratified Ordered Datalog appears to be not enough expressive for our application
of scripting refactoring transformations. In this chapter, we highlight the need for more
general queries, and introduce a broader class of stratified programs, that is sufficiently
expressive for our needs but smaller than the class of modularly stratified programs presented
in Chapter 3. Furthermore, we shall describe the evaluation of this broader class in a demand-
driven manner on a top-down stream-based framework. Finally, in the last part of the chapter,
we shall discuss how to express some of our Ordered Datalog queries in normal Datalog.
5.1 On accepting more queries
Consider the edge definition in JunGL to encode an ancestor relationship between nodes:
l e t edge ance s to r x → ?y =
[ x ] parent ; ance s to r [ ? y ] | [ x ] parent [ ? y ]
Following the translation of path queries to relational equations given in Chapter 4, this edge
is equivalent to the Ordered Datalog predicate:
edge ancestor(x , y) ← node(x ), parent(x , z ), edge ancestor(z , y), node(y)
; node(x ), parent(x , y), node(y).
93
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 94
Unfortunately, although the edge ancestor predicate is safely stratified in normal Datalog, it
is not in Ordered Datalog for two reasons. A recursive call to edge ancestor appears both on
the right-hand side of a cross product and on the left-hand side of a union.
Out of context, it is difficult to understand why the order expressed in edge ancestor is
important, and why we simply do not rewrite the predicate to a safely stratified rule by
swapping the two disjuncts and moving ancestor(z , y) up in front of all conjuncts:
edge ancestor(x , y) ← node(x ), parent(x , y), node(y)
; edge ancestor(z , y), node(x ), parent(x , z ), node(y).
We shall therefore consider a more concrete example. In the process of encoding static-
semantic information for different languages, we have found a recurrent scenario where such
a recursion on the left of the union operator is needed. When we try to find the first match
of a series of alternatives, it may actually be the case that one of the disjuncts (and not the
last one) involves a recursion. To illustrate, we recall here an edge that we have defined back
in Chapter 2:
l e t edge defaultCFSucc x : Statement → ?y =
f i r s t ( [ x ] l i s t S u c c e s s o r [ ? y ]
| [ x ] parent [ ? y : WhileLoop ]
| [ x ] parent ; defaultCFSucc [ ? y ]
| [ x ] parent ; e x i t [ ? y ]
)
We have translated it to Ordered Datalog in Chapter 4:
edge defaultCFSucc(x , y) ← node(x ), node Statement(x ), firstx (
listSuccessor(x , y)
; parent(x , y), node WhileLoop(y)
; (∃z · parent(x , z ), edge defaultCFSucc(z , y))
; (∃z · parent(x , z ), field exit(z , y))
).
The last but one alternative in the definition of the edge defaultCFSucc relies recursively
on defaultCFSucc itself. Such a scenario is very common in the definition of contextual
semantics information for mainstream languages. We will see in Chapter 6 that looking up
entity references in Java for instance is a typical case where we first try to resolve something,
and if that fails we try something else.
Now a relevant question is whether such a query, where we wish to choose the first
matching alternative, could be expressed in stratified normal Datalog. Unfortunately, we
would need to guard the last disjunct with a check that the third disjunct does not have any
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 95
proper solution, in order to prevent getting solutions for both the third and last alternatives
in our final result. Yet, if we negate the third disjunct, we end up with negation inside
recursion which is not allowed in stratified Datalog programs either.
Stratified Datalog is not expressive enough to encode the contextual semantic information
that we need for expressing refactoring transformations. And neither is stratified Ordered
Datalog. Therefore we need to look at accepting a more general class of logic programs.
In our introductive overview of Datalog in Chapter 3, we have mentioned the class of
modularly stratified programs, and the example of the win rule that we recall here:
win(X ) ← move(X ,Y ), not win(Y ).
Remember that if move is acyclic, by instantiating the rule in every possible way such that
move subgoals are true, we obtain a stratified program. The situation is similar for our
definitions of edge ancestor and edge defaultCFSucc: the rules are modularly stratified if the
relation parent is acyclic, which is indeed the case.
Unfortunately, modularly stratified programs cannot be evaluated within the usual set-
based bottom-up framework of safe Datalog. To overcome the problem, Ross proposed in
[Ros94] a variant of Datalog with extra operators to track dependencies between atoms.
Modularly stratified programs can be transformed to that extended Datalog and evaluated
through a succession of bottom-up fixpoints. Another solution is to use a goal-oriented
top-down resolution method with tabling and delaying such as SLG [CW96].
Here we wish to suggest an evaluation strategy reminiscent of the Query-Subquery ap-
proach. Unlike Ross’s solution, it uses the standard operators of Datalog. Also, it contrasts
with SLG by being set-based thus leveraging efficient implementation of relational opera-
tions. Therefore, we now introduce the new notions of partial interpretation and partial
stratification, which apply both to normal Datalog programs and Ordered Datalog programs.
5.2 Beyond stratified Ordered Datalog
5.2.1 Partial instantiation
We adopt the same terminology of a complete program component as in [Ros94], except that
we assume that there are never two rules with the same head in a program, since we can
express union with ”;” instead.
Definition 5.1 Let F be a program component (i.e. a subset of the rules) of a logic program
P. We say F is a complete component if for every predicate p appearing in the head of a
rule in F , if p is recursive through a predicate q, then the rule in P with head q is in F .
If the predicate p appears in the head of a rule in F then we say p belongs to F. If the
predicate q appears in the body of a rule in F , but does not belong to F, the we say q is used
by F.
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 96
Furthermore, we write HeadVars(F ) for the set of head variables found in a program
component F . To avoid name conflicts, we annotate each head variable with the name of the
rule it occurs in. For instance, in the set F of rules below, HeadVars(F ) = {xp , yp , xq , yq}:
p(x , y) ← q(x , y), not (∃y.q(x , y)).
q(x , y) ← r(x , y).
We can now define partial instantiation (which is akin to the idea of partial evaluation of
logic programs, for instance mentioned in [War92]).
Definition 5.2 (Partial instantiation) Let F be a program component, V a subset of
HeadVars(F ) and D a domain of values, i.e. a set of constants. The partial instantiation
IV
D (F ) of F with respect to V and D, is the set of rules obtained by substituting constants
from D for all variables in V in every possible way.
We rewrite each partially instantiated rule (i.e. rules that have a head variable in V ) by
moving the head variables that have been instantiated to the name of the rule. Furthermore,
for each set of instantiations of the same rule R, we introduce a new rule, called the dispatch
rule of R, whose head is the same as in R and whose body is the union of the instantiated rules
of R in which each disjunct has been amended with a binding equality for the instantiated
head variables.
We illustrate that definition on our Ordered Datalog example of edge ancestor , which
we recall is not statically stratifed. With F = {edge ancestor}, D = {1, 2, 3} and V =
{xedge ancestor}, IV
D (F ) reads as follows:
edge ancestor1(y) ← node(1), parent(1, z ), edge ancestor(z , y), node(y)
; node(1), parent(1, y), node(y).
edge ancestor2(y) ← node(2), parent(2, z ), edge ancestor(z , y), node(y)
; node(2), parent(2, y), node(y).
edge ancestor3(y) ← node(3), parent(3, z ), edge ancestor(z , y), node(y)
; node(3), parent(3, y), node(y).
edge ancestor(x , y) ← x = 1, edge ancestor1(y),
; x = 2, edge ancestor2(y),
; x = 3, edge ancestor3(y).
In the case of Ordered Datalog, the order of the disjuncts in edge ancestor(x , y) obviously
matters. For our particular application, we shall apply partial instantiation to the first
argument of edge predicates only. We have explained in Section 4.5.4 that the resulting
order of an edge predicate follows the order of nodes in Nodeseq . Furthermore, the law (4.4)
tells us that for any sequence s :
Nodeseq ×seq s =⋃
t∈Nodeseq
seq(〈t〉 ×seq s)
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 97
Therefore it suffices to make the introduced disjunct follow the same order as in Nodeseq to
preserve the general order of the query.
5.2.2 Partial stratification
We now turn to define the partial reduction of a component.
Definition 5.3 (Partial reduction) Let F be a program component, S be the set of predi-
cates used by F, V a subset of HeadVars(F ). Suppose furthermore that S is fully defined by
a model M and that D is the domain of values appearing in M and as constants in F .
Form the partial instantiation IV
D (F ) of F with respect to V and D. Replace any call to
a dispatch rule R in IV
D (F ) by a call to a specialised version of R where all disjuncts that are
known to be irrelevant at the call site with respect to M have been pruned away.
We call the obtained rules RVM (F ), the partial reduction of F modulo M with respect to
V .
This definition of reduction differs from the definition of reduction in [Ros94] because
we obtain rules that are not fully instantiated (i.e. RVM (F ) contains some free variables).
Again, we illustrate that definition with the partial reduction of edge ancestor with respect
to D = {1, 2, 3} and V = {xedge ancestor}. For the example we define M to be:
{node(1),node(2),node(3), parent(1, 2), parent(1, 3)}
We therefore obtain the following set of rules for RVM (F ):
edge ancestor1(y) ← node(1), parent(1, z ), edge ancestor{2,3}(z , y), node(y)
; node(1), parent(1, y), node(y).
edge ancestor2(y) ← node(2), parent(2, z ), edge ancestor∅(z , y), node(y)
; node(2), parent(2, y), node(y).
edge ancestor3(y) ← node(3), parent(3, z ), edge ancestor∅(z , y), node(y)
; node(3), parent(3, y), node(y).
edge ancestor{2,3}(x , y) ← x = 2, edge ancestor2(y),
; x = 3, edge ancestor3(y).
edge ancestor∅(x , y) ← false.
Note that the set of Ordered Datalog rules RVM (F ) is now statically stratified (thanks to the
parent relation being well-founded). This leads to the definition of partial stratification.
Definition 5.4 (Partial stratification) Let ≺ be the dependency relation between com-
plete components. We say the program P is partially stratified with respect to a set of head
variables V if, for every component F of P,
• there is a total well-defined model M for the union of all components F ′ ≺ F, and
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 98
• the partial reduction of F modulo M with respect to V is statically stratified.
The class of partially stratified programs is smaller than the class of modularly stratified
programs (i.e. any partially stratified program is modularly stratified), but it highlights an
interesting evaluation mechanism that follows the top-down strategy of the Query-Subquery
approach. We can generate the partial reduction of each component one partial subgoal at a
time, but evaluate each reduction in a set-based framework.
That is exactly the strategy we use in JunGL, and we therefore define the set of JunGL
programs we admit as follows. Let J be a JunGL program and P be the Ordered Datalog
program obtained by translating the predicates, edges and path queries of J as explained
in Chapter 4. Take V the set of all the first head variables of the edge predicates in P .
If P is partially stratified with respect to V , then we accept J as a valid JunGL program.
Less formally, the idea in the case of JunGL is to evaluate edge predicates one source node
at a time in a top-down manner, but to compute the targets of each specific source node
using a set-based evaluation, or rather a sequence-based evaluation in the context of Ordered
Datalog. If no specialised edge predicate (i.e. instantiated for a specific node) depends on
itself through a nonmonotonic construct, then P can be safely evaluated.
Of course, the main difficulty remains in generating the correct reductions of the edge
predicates on the fly. The generation is correct and fairly straightforward when, at each call
site of a dispatch edge rule, the first parameter of the call is already bound. As we are about
to see, it is however more complex if the source parameter is not yet bound.
We shall now explain the importance of laziness in the construction of edges and how
we achieve it using two complementary mechanisms, namely the top-down evaluation of
predicates and the use of streams. We will then come back to the generation of partial
reductions for edge predicates, and see how it fits in the top-down evaluation.
5.3 Demand-driven evaluation
Demand-driven evaluation is crucial for a language that aims at expressing refactorings be-
cause transformations are often run in an interactive setting. In fact, much of the needed
contextual semantic information and many of the transformations are fairly local. Demand-
driven evaluation takes advantage of that locality and enables to run transformations in
acceptable time, whereas a full analysis of the program would simply be inconceivable. It is
clear enough that we should not adorn the object program tree with all possible edges, given
that most of the time only a few of these will actually be required during a refactoring.
We have explained in Chapter 4 how we translate logical features in JunGL to Ordered
Datalog programs. Edges notably translate to edge predicates using predicate dispatch.
Therefore, in the end, edges are simply seen as pairs of nodes inhabiting the interpretation of
their corresponding edge predicates. A query that is run, for instance, to find some elements
in the program or to check preconditions of a transformation is also an Ordered Datalog
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 99
program that refers to a certain number of edge predicates. If we were to evaluate the whole
Ordered Datalog program bottom-up, we would have to compute all the edges of a particular
kind that are referred from the query. On the other hand, a top-down framework minimises
the computation of irrelevant facts, i.e. of useless edges. In a top-down framework, edges
are evaluated only when their value is needed.
5.3.1 Top-down sequence-based evaluation
We have presented in Chapter 3 two top-down approaches. One is a memoised version of
SLD resolution and works a tuple at a time. The other, the Query-Subquery approach, is
set-based and benefits from efficient algorithms for relational algebra operations, notably
hash joins.
The idea of the latter approach, we recall, is similar to the more popular transformation
of magic sets. The aim is to minimize the computation of irrelevant facts by pushing the
calling context of each predicate inside calls. A sideways information-passing strategy is used
to drive the propagation of the context. It merely consists in an appropriate ordering of the
subgoals in rule bodies. In our setting of Ordered Datalog, we cannot arbitrarily reorder
subgoals. The left-to-right order is the meaningful order and that is the one we use.
We shall apply the Query-Subquery approach adapted to Ordered Datalog (i.e. to work
with sequences) on a small example and show that it indeed reduces the computation of
irrelevant edges. The example in question is taken from Chapter 2 and refers to the abstract
grammar of Figure 2.1. The function assignedVariable returns the declaration of the variable
assigned in a statement a (the function pick actually returns the first element of the stream
or null if the stream is empty):
l e t edge tr eePred n → ?pred =
f i r s t ( [ n ] l i s tP r e d e c e s s o r [ ? pred ] | [ n ] parent [ ? pred ] )
l e t edge lookup r : Var → ?dec =
f i r s t ( [ r ] t r eePred+[?dec : VarDecl ] & r . name == ?dec . name )
l e t edge de f x : Assignment → ?y = [ x ] var ; lookup [ ? y ]
l e t a s s i gnedVar i ab l e a =
pick { ?d | [ a ] de f [ ? d ] }
The translation rules of Chapter 4 transform the logical part of this JunGL snippet to the
following Ordered Datalog program that we adorn with binding information to support our
coming explanations.
edge treePredbf (n, pred) ← node(n), firstn(
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 100
listPredecessor(n, pred)
; parent(n, pred)
).
treePred plusbf (x , y) ← edge treePredbf (x , y)
; ∃z · treePred plusbf (x , z ), edge treePredbf (z , y).
edge lookupbf (r , dec) ← node(r), node Var(r), firstr (
treePred plusbf (r , dec), node VarDecl(dec),
r .name == dec.name
).
edge def bf (x , y) ← node(x ), node Assignment(x ),
∃z · field var(x , z ), edge lookupbf (z , y).
qbf (a, d) ← edge def bf (a, d).
We shall comment step by step the top-down evaluation on the following While program
@p. We annotate nodes to be able to refer to them in our explanations.
[
[ int i; ]@a
[ [ i ]@u = 0; ]@b
[ while ([ [ i ]@v≤ 10 ]@t )
[ {
[ [ int i; ]@e
[ [ i ]@x = [ i ]@y + 1; ]@f
} ]@d
]@c
[ print([ i ]@z ); ]@g
]@p
As a side remark, the program is incorrect, because i in @y is used before being assigned.
We consider the precise query qbf (@f , d). Hence, the initial calling context for the rule
edge def bf (x , y) is C(x ) = 〈@f 〉. Inside that rule, the same context is propagated down to
field var(x , z ), like in the Query-Subquery approach, by joining it first with node(x ) then
with node Assignment(x ). After field var(x , z ), the context becomes C(x , z ) = 〈(@f , @x )〉
(@x is indeed the field labeled var of node @f ) and we are now faced with a call to
edge lookupbf (z , y).
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 101
The process there is similar but with the context C(r) = 〈@x 〉 and we reach the call to
treePred plusbf (r , dec) with an unchanged context again. Now the call is slightly different
because treePred plusbf is recursively defined. We need to introduce a fixpoint computation.
We start with the first iteration. The left-hand side disjunct calls edge treePredbf and
returns the sequence 〈(@x , @f )〉 because @x has no list predecessor but has for parent @f .
In the second disjunct, however, the nested call back to treePred plusbf fails at this stage.
The first iteration treePred plusbf thus returns 〈(@x , @f )〉 only. Next, the second iteration
evaluates the left-hand side disjunct in the same way, but this time also succeeds on the
right-hand side because treePred plusbf is not empty anymore. We end up with the new
match (@x , @e) where @e is the list predecessor of @f . We continue these iterations until
no new tuple is found. The final resulting sequence for the call treePred plusbf (r , dec) in
edge lookupbf is:
〈(@x , @f ), (@x , @e), (@x , @d), (@x , @c), (@x , @b), (@x , @a), (@x , @p)〉
Each result found for a particular calling context is cached with its context so that no
predicate is evaluated twice with the same context. The cache corresponds to the inst Rγ
and ans Rγ relations in the Query-Subquery approach (see Section 3.3.2 for more details).
In particular, we do not evaluate edge treePredbf (n, pred) with the same value for n more
than once during the fixpoint computation.
With the results from treePred plusbf , we continue our computation inside the body of
first in edge lookupbf . This filters the sequence to keep 〈(@x , @e), (@x , @a)〉. Applying first
reduces it to 〈(@x , @e)〉, in effect discarding the farthest match.
The result then propagates back to the top of the program and the final result of the
query is 〈@e〉, the closest declaration of i indeed.
We see through this detailed description of the evaluation that we have not computed
too many irrelevant edges. Notably, we have computed the lookup of @x only. It is true,
however, that most of the transitive closure of treePred from v was computed then discarded
with first. We shall actually see that, when using streams, this does not even happen. But
first, we wish to look at a different query and raise a problematic case in our adoption of the
Query-Subquery approach.
5.3.2 The issue with first
We now turn to the converse JunGL query that finds all the references of a specific declaration:
l e t r e f e r e n c e s d =
pick { ? r | [ ? r ] lookup [ d ] }
This time, the adornment of the equivalent Ordered Datalog program looks like this:
edge treePredbf (n, pred) ← node(n), firstn(
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 102
listPredecessor(n, pred)
; parent(n, pred)
).
treePred plusbf (x , y) ← edge treePredbf (x , y)
; ∃z · treePred plusbf (x , z ), edge treePredbf (z , y).
edge treePredbb(n, pred) ← node(n), firstn(
listPredecessor(n, pred)
; parent(n, pred)
).
treePred plusbb(x , y) ← edge treePredbb(x , y)
; ∃z · treePred plusbf (x , z ), edge treePredbb(z , y).
edge lookupfb(r , dec) ← node(r), node Var(r), firstr (
treePred plusbb(r , dec), node VarDecl(dec),
r .name == dec.name
).
q fb(r , d) ← edge lookupfb(r , d).
We now consider the query q fb(r , @a) and focus our attention on the body of first in
edge lookupfb . Both r and dec being bound by our context, the conjunction evaluates to:
〈(@u, @a), (@v , @a), (@x , @a), (@y, @a), (@z , @a)〉
Applying first on the first column of each pair leaves the sequence unchanged. Therefore the
final result of the query is the sequence 〈@u, @v , @x , @y, @z 〉, which is clearly not what we
expect. Only @u, @v and @z resolve to the declaration of i in @a. The references @x and
@y resolve to @e.
The problem comes from an unauthorised step that we take during the Query-Subquery
propagation of the context. Indeed, we are not always allowed to propagate the context inside
first. A conjunct c can be moved inside a first if and only if each variable that is bound by
c is considered by the operator first for grouping. Formally,
c(~x), firstS (p(~x , ~y)) = firstS (c(~x ), p(~x , ~y)) iff ~x ∈ S
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 103
In particular, we are not allowed to push the binding for the variable dec in the first of
edge lookupfb :
c(r , dec), firstr (treePred plus(r , dec), node VarDecl(dec), · · · )
6= firstr (c(r , dec), treePred plus(r , dec), node VarDecl(dec), · · · )
If we do not assume dec to be bound inside the first of edge lookupfb , the query evaluates
correctly.
This of course has consequences on the demand-driven nature of the evaluation of our
queries. It basically means that, in this case, we have to lookup the definitions of all references
in our program. However, it is often possible to reduce considerably the amount of useless
computations. The idea comes from the following observation:
first~x (p(~x , ~y)) = p(~x , ~y), first~x (p(~x , ~y))
If ~y is bound but not ~x , we can first evaluate p to bind ~x , and move that binding inside
the first during the evaluation. This is expressed by the same equation as above but with
binding information and an extra predicate c just to make the context explicit:
c(~y), first~x (pff (~x , ~y)) = c(~y), pfb(~x , ~y), first~x (pbf (~x , ~y))
Although we have not implemented that optimisation, it is clear that it reduces the
number of useless computations in many of our queries. For instance, in our example query
where we search for references to a specific declaration, only references that can reach that
declaration through a chain of treePred edges will be considered in the computation of lookup.
5.3.3 Streams
The use of streams allows us to specify a search problem in a nice compositional way: generate
a stream of successes, and use the operator first on streams to take the first answer — no
further elements will be computed. We employ a technique originally due to Mycroft and
Jones, who were the first to model the operational semantics of logic programs in terms of
streams [JM84]. The same technique was used by Spivey and Seres in their embedding of
Prolog in Haskell [SS99]: there, they used the lazy lists of Haskell to conveniently represent
streams.
In contrast, JunGL is implemented on the .NET platform. For implementing sequences,
we took our inspiration from Cω [BMS05], a language developed at Microsoft Research, where
streams are generated using the same iterator constructs that are available in C# 2.0. An
iterator function is a function that returns an ordered sequence of values by using a yield
statement to return each value in turn. When a value is yielded, the state of the iterator
function is preserved and the caller is allowed to execute. The next time the iterator is
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 104
invoked, it continues from the previous state and yields the next value. Iterators are a special
kind of well-known coroutines, which generalise subroutines to allow multiple entry points
and suspending and resuming of execution at certain locations. It is commonly accepted
that coroutines are well-suited for implementing familiar program patterns such as iterators,
infinite lists and pipes.
To illustrate the use of iterators in C#, we give the interface details of the function Union
that takes two source sequences (modeled as IEnumerable<T>) and yields a new sequence
that is the lazy union of the two:
static IEnumerable<T> Union<T>(
GetKey<T> getKey ,
IEnumerable<T> source1 ,
IEnumerable<T> source2
)
The parameter getKey is a delegate to a function that takes a T and returns a key. One may
wonder why we need such a parameter. This is in fact because we do not simply append
the two sequences but we also filter any duplicates. Two elements are similar if they have
the same key. We have defined similarly all the sequential relational operators of Chapter 4.
Another more complex example is Join:
static IEnumerable<V> Join<T, U, V>(
GetKey<T> getInnerKey ,
GetKey<U> getOuterKey ,
Function<T, U, V> append ,
IEnumerable<T> inner ,
IEnumerable<U> outer
)
There we need three delegates: one to get the key for elements of the inner sequence, another
to get the key for elements of the outer sequence, and a last one to append two source
elements into a result (like ++ in our Haskell definitions). For efficiency reasons, we have
not implemented a nested loop but a hash join. We have also found useful to have a function
to memorise a stream. The function returns a generator that saves all the elements as they
are first discovered, so that any new iteration on the same sequence will directly return the
elements previously discovered.
All these definitions would benefit from the new features of C# 3.0 and LINQ [MBB06]
with very few changes. Our implementation is indeed strikingly similar to some parts of the
LINQ API which has also its roots in Cω. For instance, there, Join is defined with:
static IQueryable<TResult> Join<TOuter , TInner , TKey , TResult >(
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 105
this IQueryable<TOuter> outer ,
IEnumerable<TInner> inner ,
Express ion<Func<TOuter , TKey>> outerKeySe lector ,
Express ion<Func<TInner , TKey>> innerKeySe lector ,
Express ion<Func<TOuter , TInner , TResult>> r e s u l t S e l e c t o r
)
The principal difference is that it does not accept functions like we do, but expression trees of
the functions. This is to allow the runtime interpretation of the trees for instance to generate
SQL code and delegate the query to a database system.
We now turn back to the evaluation of our Ordered Datalog programs. We translate each
query to a pipeline of operations on streams. This pipeline may of course contain recursion
if the query that we represent contains recursively defined predicates. To illustrate, we have
drawn in Figure 5.1 the recursive pipeline of treePred plus .
treePred ;seq
∪seq
treePred plus
left
right
left right
Figure 5.1: Example of recursively defined pipeline
In the case of a recursive pipeline, results are yielded before the end of the whole com-
putation. Calling treePred plus with a special context returns a sequence. Retrieving one
element of that sequence triggers the first fixpoint iteration. When all the elements of the
first iteration are discovered, asking for a next element triggers the second iteration, and so
on. The benefits of such a pipeline is clear when we use first on a sequence and all elements
we group on are already known (i.e. bound). When we have found all the first tuples for
each of the elements we group on, we do not need to explore the sequence any further. This
reduces the number of irrelevant computations.
5.4 Generating partial reductions
We shall now turn to a different aspect of the implementation, which interestingly relies on
the top-down evaluation mechanism explained in the previous section. The class of partially
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 106
stratified programs we have introduced earlier is indeed defined through partial reductions of
strongly connected components, and we explain here how we generate them. We recall that
a partial reduction of a recursive rule R(~x , ~y) with respect to ~x is a partial instantiation of
R for all values of ~x where any recursive call to R in a specialised version Ri(~y) is further
restricted to a context that prevents a call back to Ri .
We shall take again the example of the descendant edge. We make it not statically strati-
fied (in the context of Ordered Datalog) on purpose in order to support our explanations. We
give here its translation to Ordered Datalog together with a query and binding information:
edge descendantbf (x , y) ← node(x ), child(x , z ),
edge descendantbf (z , y), node(y)
; node(x ), child(x , y), node(y).
qbf (x , y) ← edge descendantbf (x , y).
Now suppose that we wish to evaluate qbf (@p, y) where @p still refers to our sample program
of the previous section. When the context C(x ) = 〈@p〉 reaches the call to edge descendantbf ,
the rule edge descendantbf (x , y) (because it is not statically stratified) is called with the spe-
cialising context x = @p. This is equivalent to generating on the fly the partial instantiation
of edge descendantbf (x , y) for x = @p:
edge descendant f@p(y) ← node(@p), child(@p, z ),
edge descendantbf (z , y), node(y)
; node(@p), child(@p, y), node(y).
There, the context after the call child(@p, z ) in the first disjunct contains only the children
of @p, that is C(z ) = 〈@a, @b, @c, @g〉. Consequently, we refine the partial instantiation
edge descendant f@p(y) and transform the call to edge descendantbf (z , y) to a union of four
calls in different specialising contexts:
( edge descendant f@a (y); edge descendant f@b(y)
; edge descendant f@c(y); edge descendant f@g(y) )
Fortunately, there is no presence of @p in that context. So edge descendant@p is safely
stratified. Because child is acyclic, there will actually never be @p in the calling context
of a recursive call. If there were, we would raise an error at runtime. The Query-Subquery
approach thus allows us to generate the partial reductions of edge predicates on the fly as
we propagate down our binding context.
However, it is not always so simple and following the approach just sketched might wrongly
reject some programs that are partially stratified. Take the following variant of our example
where we swap the two middle conjuncts in the first disjunct:
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 107
edge descendant f@p(y) ← node(@p), edge descendantff (z , y),
child(@p, z ), node(y)
; node(@p), child(@p, y), node(y).
The issue there is that, before the recursive call to edge descendantff (z , y), the context
for z contains all nodes, and notably @p. We would end up with an error, although any tuple
with z = @p would later be discarded because child(@p, @p) is false.
To overcome that problem, we take inspiration from SLG. In order to handle possible
loops through negation, SLG supports a delaying operation of subgoals to dynamically adjust
a rule, along with a simplification operation to resolve away delayed subgoals when their
truth value becomes known [CW96]. When generating partial reductions, we should allow a
nonmonotonic recursive call to the same specialised version of an edge predicate (for instance
edge descendant f@p(y)), but directly return a singleton sequence with a fake node and a special
marker saying that this ground atom is unsafe (i.e. unknown). We denote such sequence
with 〈⊥⊥〉. The superscript ⊥ is the marker. We could then propagate that marker to any
fact that is inferred using an unsafe fact. At the end of the evaluation, if the result contains
an unsafe fact, then we raise an error.
More formally, we should change the Haskell definition of a sequence we gave in Chapter
4 to:
type Sequence = Stream (Tuple ×Bool)
where the second element of each pair is true if and only if the tuple is unsafe. We also need
to change the functions tupleDrop and tupleKeep in the obvious way for preserving the state
of the input tuple, and make the concatenation of two tuples unsafe if one of them is unsafe:
++ :: (Tuple ×Bool)→ (Tuple ×Bool)→ (Tuple ×Bool)
(t1, b1)++(t2, b2) = (t1 ++ t2, b1 ||b2)
The definitions of σseq also need to be modified to look at the value of tuples only:
σseqXi=Xjs :: Sequence → Sequence
σseqXi=Xjs = filter f s
where f (t , b) = (tupleKeep [Xi ] t == tupleKeep [Xj ] t)
σseqXi=ds :: Sequence → Sequence
σseqXi=ds = filter f s
where f (t , b) = (tupleKeep [Xi ] t == [d ])
With these new definitions, all but one relational operators on sequences now propagate
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 108
correctly the unsafe marker. For instance,
〈(1, 2), (1, 3)⊥〉 ∪seq 〈(1, 2), (1, 3)〉 = 〈(1, 2), (1, 3)⊥, (1, 3)〉
〈(1, 2), (1, 3)〉;seq 〈(3, 5), (2,⊥)⊥〉 = 〈(1,⊥)⊥, (1, 5)〉
πseqX .1 〈(1,⊥)⊥, (1, 5)〉 = 〈1⊥, 1〉
σseqX .2=2〈(1, 2), (1, 3)⊥, (1, 3)〉 = 〈(1, 2)〉
Note that we keep copies of the same tuple if one is unsafe and the other is not. Two tuples
are indeed considered equal if they have the same marker state, except in a filter operation.
Negation as failure, however, needs a more important change:
notseq sn :: Sequence → Sequence
notseq sn = [ (t ,marker t) | t ← Dnseq , (t , false) /∈ sn ]
where marker t = (t , true) ∈ sn
If t is unsafe in sn then we cannot deduce anything about it in the complement of sn and it
is marked as unsafe there too.
Back to our example, the recursive call to edge descendantff (@p, y) returns 〈(@p,⊥)⊥〉
now. The context after the union of all specialised calls is then C(x , z , y) = 〈(@p, @p,⊥)⊥, · · ·〉
but the first unsafe tuple is then filtered away when intersecting with the result of child(@p, z )
that does not contain (@p, @p).
We have described how we adapt the Query-Subquery approach to evaluate partially
stratified programs. Most parts of this evaluation mechanism are actually not specific to
Ordered Datalog, and we shall discuss now whether we could, in fact, have based logical
features on sets.
5.5 Back to sets
5.5.1 Motivation
With partial stratification, we have overcome the restrictions of safe Ordered Datalog. Or-
dered Datalog programs have however other pitfalls compared to normal Datalog. In par-
ticular, they are not good candidates for optimisations because their subgoals cannot be
arbitrarily permuted. In that sense normal Datalog appears to be more declarative than
Ordered Datalog. Furthermore, having logical parts of the script expressed in normal Dat-
alog would allow us to integrate JunGL in a tool chain and notably reuse existing efficient
implementations of Datalog. Finally, Datalog seems to be a better choice for reasoning about
the scripts because it is closer to first-order logic. We therefore discuss in this section how
we could fall back to logical features that are based on sets rather than on sequences but still
support a large range of useful scenarios.
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 109
During our experiments with JunGL, we have actually found out that many parts of the
scripts do not rely on any order at all. Notably the order is hardly relevant when checking
the preconditions of a refactoring or when computing dataflow properties. Having said that,
we know from Section 4.3.3 that it is perfectly fine to evaluate these parts as normal Datalog
programs. By simply annotating the queries where order does not matter, we could benefit
from the advantages of Datalog over Ordered Datalog mentioned above.
There are many cases however where the order is of course relevant. The order of an
Ordered Datalog query is expressible as a normal Datalog program if we give up on strat-
ification. The order of a sequence is indeed just a binary relation on tuples. By flattening
each pair of tuples to tuples of double arity, we can represent the order of a sequence-based
predicate as a set-based predicate. However, we would need to encode the behaviour of our
sequence-based relational operators with set-based relational operators. Beside the fact that
such encoding would be very verbose, the presence of the function nub (which, we recall, en-
forces that no duplicate is present in a sequence) in almost all our definitions is challenging.
It can be encoded with negation, but this may lead to unsafe recursion. In the next section,
we address a recurrent scenario where the order is always needed, namely when the operator
first is used. We notably propose a convenient set-based construct to replace it.
5.5.2 The B operator
A way to get rid of the construct first is to introduce a new binary operator B (pronounced
orelse), which tries to satisfy its right-hand side predicate if and only if its left-hand side
predicate fails. To illustrate, we update some of our earlier examples in Chapter 2 with that
new operator. The treePred edge definition now reads:
l e t edge tr eePred n → ?pred =
[ n ] l i s tP r e d e c e s s o r [ ? pred ] B [ n ] parent [ ? pred ]
In words, if n has a list predecessor, then ?pred is the list predecessor of n, or else ?pred
possibly matches the parent of n. Similarly, for defaultCFSucc, we have:
l e t edge defaultCFSucc x : Statement → ?y =
[ x ] l i s t S u c c e s s o r [ ? y ]
B [ x ] parent [ ? y : WhileLoop ]
B [ x ] parent ; defaultCFSucc [ ? y ]
B [ x ] parent ; e x i t [ ? y ]
Finally, cfsucc is defined as follows:
l e t edge c f s u c c x : I f → ?y =
[ x ] thenBranch [ ? y ]
| ( [ x ] e l s eBranch [ ? y ] B [ x ] defaultCFSucc [ ? y ] )
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 110
All three definitions with B are elegant and even more readable than the original ones.
Note, however, that the equivalence of the definitions is pending on the kind of edges that
are used. Here, the new definitions are equivalent to the previous ones because the edges
that are used have only one target at most: a node has one list predecessor at most, one
parent at most, one list successor at most, and so on. Therefore we are guaranteed to match
at most one node as if we were using first.
We now propose to give the translation of B to Datalog. As an example, consider the
treePred edge again. We want to match the parent of n only when n has no list predecessor.
We could hence write the edge as follows:
l e t edge tr eePred n → ?pred =
[ n ] l i s tP r e d e c e s s o r [ ? pred ]
| ( ! [ n ] l i s tP r e d e c e s s o r [ ] ) & [ n ] parent [ ? pred ]
Note again the asymmetric role of the two variables n and ?pred . We reflect this asymmetry
in the definition of B at the level of Datalog by annotating the operator like for first:
a(x , y) Bx b(x , y) = a(x , y); not a(x , ), b(x , y)
In JunGL however, we can omit the annotation. Indeed we assume that it is implicitly
given through the existing asymmetry between the source and the target variables of an
edge. In the end, the notation is very elegant, and we have decided to add it to our language.
There, we have made it work on sequences but with the idea that it can be evaluated as
normal Datalog when order does not matter. This is a win over first, which itself does not
have any counterpart in normal Datalog.
Unfortunately, the queries that use first to take the first success of a list of alternatives
cannot always be directly expressed with B. For instance, the problem is more complex in
the case of the lookup edge also defined in Chapter 2:
l e t edge lookup r : Var → ?dec =
f i r s t ( [ r ] t r eePred+[?dec : VarDecl ] & r . name == ?dec . name )
To use B, the idea is to unroll the transitive closure as in:
l e t edge lookup r : Var → ?dec =
[ r ] t r eePred [ ? dec : VarDecl ] & r . name == ?dec . name
B [ r ] t r eePred ; tr eePred [ ? dec : VarDecl ] & r . name == ?dec . name
B . . .
Therefore, we would need to introduce an auxiliary recursive predicate:
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 111
l e t predicate lookupFrom (? from , ? r , ? d) =
[ ? from ] treePred [ ? d : VarDecl ] & ? r . name == ?d . name
B(? from , ? r ) [ ? from ] treePred [ ? p ] & lookupFrom (?p , ? r , ? d )
l e t edge lookup r : Var → ?dec =
lookupFrom ( r , r , ? dec )
We find this definition harder to express and harder to read than our original definition using
first. A possibility though would be to introduce, for such a use pattern, yet another operator
that would translate to the appropriate auxiliary predicate.
5.6 Summary and references
In this chapter, we have seen that stratified Datalog programs (whether ordered or not) were
not expressive enough for our application of scripting refactoring transformations. Indeed,
the conditions on static stratification were found to be too restrictive to successfully express
the computation of static-semantic information with JunGL. The limited expressiveness of
stratified Datalog has been brought out at many occasions in the Datalog literature, e.g.
[Ull94, Prz88], but more rarely in the context of a particular application.
To augment the expressiveness of JunGL, we have therefore introduced the broader class
of partially stratified Datalog programs. Partial stratification is the idea that a Datalog
program, when partially instantiated and reduced with respect to some of its head variables,
becomes stratified. Partial instantiation is akin to the old idea of partial evaluation of logic
programs [War92]. The class of partially stratified Datalog is a subset of the class of modularly
stratified programs [Ros94], but it highlights an interesting evaluation mechanism that follows
the set-based top-down strategy of the Query-Subquery approach [Vie86]. We can indeed
perform the partial reduction of components (that are not initially statically stratified) at
runtime, i.e. when the calling context of each relevant predicate is precisely known. Unlike
the solution proposed in [Ros94] for evaluating modularly stratified programs bottom-up, our
approach uses standard relational operators. Furthermore, in contrast to SLG, it is set-based
thus allowing to leverage efficient implementation of relational operations.
When generating the partial reduction of a partially stratified component, however, cycles
through nonmonotonic constructs may still occur. This is due to the fact that the reduction
is sensitive to the order of subgoals. To overcome that issue, we propose to temporarily allow
such cycles but to mark tuples inferred from them as unsafe. This proposal is inspired from
the technique of delaying subgoals in SLG resolution [CW96].
Apart from allowing the evaluation of partially stratified components, the Query-Subquery
approach has two other benefits for the evaluation of JunGL scripts. First, it enables the
demand-driven computation of edges. Second, it allows caching of intermediate results. In-
deed, if a calling context of an edge predicate is identical to or subsumed by a previous one,
CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 112
the edge predicate is solved using answers already computed. In the end, this is roughly
similar to the caching technique used in attribute grammar systems like JastAdd [EH04].
Another interesting point in the implementation of the logical features is the use of
streams. As we shall see in the next chapter, it is convenient to specify a search prob-
lem in a compositional way, generate a stream of successes, and use the operator first on
streams to take the first answer. The technique of modelling the operational semantics of
logic programs in terms of streams was first proposed by Mycroft and Jones [JM84], and
exploited by Spivey and Seres in their embedding of Prolog in Haskell [SS99].
Our implementation of streams follows the one originally proposed in Cω, where streams
are typically generated using iterators of C# 2.0 [ecm06]. A key aspect in Cω is that streams
are always flattened so as to coincide with XPath and XQuery sequences. Of course, in
Ordered Datalog, we manipulate flat sequences too. Many of the research ideas of Cω have
reappeared in the recent LINQ framework [MBB06]. If we had started the implementation of
JunGL slightly later, we would probably have used the LINQ API rather than implementing
relational operations on streams ourselves.
Finally, we have explored how we could base the semantics of the logical features in JunGL
on sets rather than sequences, notably to facilitate reasoning about the transformations.
Thanks to our formalism, most parts of the scripts — the ones in which order does not
matter — would actually require no change. Indeed, if we ignore the order, we have shown
earlier that the stratified evaluation of an Ordered Datalog program leads to the same results
as the stratified evaluation of its normal Datalog counterpart. Yet, other parts of the scripts,
which rely on the operator first, would benefit from a new set-based operator B. To close the
gap with normal Datalog even more, we have actually introduced that operator in JunGL.
Finally, the picture is less clear for the remaining parts. Although it is certainly possible to
express any desired order in (non-stratified) normal Datalog, that approach would be quite
verbose. In future work, one may wish to explore the best translation of any Ordered Datalog
program to normal Datalog and its consequences on stratification.
Having exposed the design, the semantics and the implementation of JunGL, we are now
ready to put it to test. We shall show, in the next chapter, that JunGL enables the clear
and concise specification of complex real refactoring transformations.
Chapter 6
Scripting refactorings
In this chapter, we wish to validate the design of our language and show that our approach
scales to the expression of refactorings for mainstream languages. We present the implemen-
tation of three of the most frequently used refactorings, which besides are very different in
nature: Rename Variable, Extract Interface and Extract Method. Rename Variable deals with
name binding and scoping. Extract Interface alters the type hierarchy of a program. Finally,
Extract Method manipulates the control and data flow of a program.
We shall describe these refactorings for subsets of mainstream object-oriented languages
like Java or C#. We do not fully support a single language, but we show how to handle the
language features that present a challenge in the correct mechanisation of these transforma-
tions.
6.1 Rename Variable
The automation of Rename Variable is far beyond a simple search-and-replace mechanism,
because it requires variable binding information and the ability to detect potential conflicting
declarations of variables with a similar name.
Conflicting declarations To understand more precisely the intricacies of renaming, let
us consider the following Java code:
class A {
int i ;
public int g e t I ( ) {
int j = 0 ;
return i ;
}
}
113
CHAPTER 6. SCRIPTING REFACTORINGS 114
One may want to rename the local variable j to i , although the instance member i is used
in the same context. In Eclipse or Visual Studio, post-transformation checks are performed
to ensure that variable bindings have not changed, and in particular no inadvertent variable
capture occurred. In the example, the transformation would be rejected a posteriori — in
Visual Studio, after the tool has offered a view of how the transformation applied.
In a past version of IntelliJ IDEA (5.0 precisely), the above refactoring resulted without
any warning in code where the occurrence of j had simply been changed to i . In such a case,
the code still compiles but i in the second statement of the method is no longer bound to the
instance member, but to the freshly renamed local variable. This situation is certainly the
worst in a refactoring process since your code remains compilable, but now has a different
meaning. JetBrains fixed that bug in IntelliJ IDEA 5.1 shortly after we reported it.
Aim and outline of the script Using JunGL, we wish to detect such conflicts before
the actual transformation, and also attempt to resolve them. We shall make the reasonable
assumption that Rename Variable is correct if all name bindings are preserved by the trans-
formation. That is, any reference to a declaration d should refer to the same declaration d
after the transformation.
In the specification of the refactoring, we hence aim that the freshly renamed declaration
does not conflict with any pre-existing declaration and that none of the pre-existing decla-
rations conflicts with the renamed declaration. We shall allow shadowing of a declaration
only if its references are not endangered or if all of them can be qualified appropriately to
make sure that they still refer to the same shadowed declaration. In the above problematic
example for instance, we could remove the ambiguity by changing i in the return statement
to this .i in order to refer to the instance member, even in the presence of a new local variable
i .
The remainder of this section is organised as follows. First, we present an object language
that is both simple for the clarity of our explanations and challenging for the automation of
Rename Variable. That language indeed follows closely the complex name lookup rules of the
Java language. We then describe how to express in JunGL the computation of name lookup
for that language. Finally, we present two versions of the Rename Variable refactoring. One
checks for conflicts, but rejects the transformation if any variable capture occurs. The other is
an extension of the former that tries to minimise rejection by recomputing a non-ambiguous
access for the captured references.
6.1.1 The object language
We consider a subset of Java inspired from the language used in [EH06]. We support packages,
top-level and nested classes, fields declarations, class intialisers, local variable declarations
(as the only kind of statements) and any type or variable reference. In addition, we include
super, this and cast expressions. We call NameJava that particular new subset of Java.
CHAPTER 6. SCRIPTING REFACTORINGS 115
As we did before for the toy language While, we can give the abstract grammar of Name-
Java via the following JunGL data type definitions:
type
Program = { compUnits : CompUnit l i s t }
and
CompUnit = { packageName : s t r i n g ; c l a s sDe c l s : C las sDec l l i s t }
and
BodyDecl =
| MemberDecl = (
| ClassDec l = { name : s t r i n g ; super :Name ;
bodyDecls : BodyDecl l i s t }
| Fie ldDec l = { f i e ldType :Name; name : s t r i n g ; expr : Expr }
)
| I n i t i a l i z e r = { block : Block }
and
Block = { stmts : Stmt l i s t }
and
Stmt = (
| Loca lVar iab leDec l = { varType :Name ; name : s t r i n g ; expr : Expr }
)
and
Expr =
| ThisOrSuperOrName = (
| Name = (
| SingleName = { name : s t r i n g }
| DotName = { l e f t : Expr ; r i g h t : ThisOrSuperOrName }
)
| This
| Super
)
| Parenthes isedExpr = { expr : Expr }
| Cast = { castType :Name ; expr : Expr }
In words, a program is a list of compilation units. Each compilation unit has a package
declaration and a list of class declarations. A class declaration ClassDecl has a name and an
optional extends clause that refers to the name of its superclass. That optional superclass
name is potentially qualified. The data type ClassDecl has therefore a field labeled super of
type Name which is indeed a simple name or a qualified name.
A ClassDecl has also a list of body declarations, each of which is either a class initialiser
(i.e. a block of local variable declarations) or a member declaration. A member declaration
CHAPTER 6. SCRIPTING REFACTORINGS 116
is in turn either a class declaration (thus allowing nested classes) or a field declaration. Field
and local variable declarations have the same structure: they admit a type name, a variable
name and an initialiser expression.
Finally, an expression is either a parenthesised expression, a cast, a super reference, a this
reference, a simple name reference or a qualified name reference. Note that we only use one
data type DotName to represent qualified names or expressions. This grammar therefore al-
lows programs that are not valid NameJava programs. However, such a single representation
is convenient to treat similar cases at once and we use a less permissive grammar for parsing
NameJava programs anyway.
Naturally, we make NameJava follow the same name lookup rules as in Java [jls05].
NameJava therefore exhibits most of the intricacies of name resolution in Java that present
a challenge in the context of Rename Variable. For instance, in the program of Figure 6.1,
the local variable l is initialised with the value of the field f in A.B . In the initialisation
of m, the different access C .this .f also resolves to the field f in A.B . The reference f in
the initialisation of n refers to the field f of the directly enclosing class D . Finally, one can
package a ;class A {
class B {int f ;
}class C extends B {
int g ;class B {}class D extends A.B {
int f ;class x {}int x ;{
int l = super . f ;int m = C. this . f ;int n = f ;int o = ( (A.B)C. this ) . f ;int p = x ;x x ;
}}
}}
Figure 6.1: A NameJava program
CHAPTER 6. SCRIPTING REFACTORINGS 117
refer to a member of a class via a fairly complex qualifier: ((A.B)C .this).f also refers to the
field f in the superclass A.B of the enclosing class C . Note that we could not simply write
((B)C .this).f in that case as B would resolve to the class B in C . Furthermore, in contrast
to C#, it is possible in Java to give the same name to two members of the same class if one
is a field, and the other a class. In our example, it is perfectly fine to define both the class x
and the field x as members of the class D . The context of a reference is then used to resolve
its correct declaration. In the initialisation of p, x refers to the field x of D . On the following
line, however, the local variable x is declared of type x , which is the class in D .
6.1.2 Name lookup
Now that we have introduced our object language informally, we shall describe how we specify
in JunGL the computation of name bindings. Precisely, we give several edge definitions for
relating a type or variable reference to its declaration. The reference might be of the form of
a simple name or a qualified name. Therefore we give an edge definition for both alternatives:
l e t edge lookup x : SingleName → ?y =
f i r s t ( [ x ] lookupAl l [ ? y ] & getName x == getName ?y )
l e t edge lookup x :DotName → ?y = [ x ] r i g h t ; lookup [ ? y ]
The first edge definition from SingleName node x retrieves all visible declarations ?y in
a precise order and takes the first one with a name that matches the name in x . The second
definition extends the lookup mechanism to DotName nodes. Resolving the declaration
referred by a qualified name x simply reduces to resolving the declaration from the qualified,
right subtree of x . This is because when looking up a single name, we account in fact for
its surrounding context. Indeed, the declarations visible at a single name x depend on the
specific sort of reference that is expected at the position of x (e.g. a variable reference
or a type reference), and also obviously on the presence of a qualifier for x . The former
constraint is handled in the definition of lookupAll, while the latter is treated in the definition
of lookupAllWithDotContext:
l e t edge l ookupAl l x : SingleName → ?y =
[ x ] lookupAllWithDotContext [ ? y ] &
( isVariableName (x ) & ( [ ? y : Fie ldDec l ] | [ ? y : Loca lVar iab leDec l ] )
B isTypeName (x ) & [ ? y : Clas sDec l ]
B isPackageOrTypeName(x ) & ( [ ? y : Clas sDec l ] | [ ? y : CompUnit ] )
B isAmbiguous (x ) )
l e t edge lookupAllWithDotContext x : SingleName → ?y =
onTheRightOfDot (x ) &
CHAPTER 6. SCRIPTING REFACTORINGS 118
[ x ] parent ; l e f t ; typeLookup ; lookupAllMembers [ ? y ]
| ! onTheRightOfDot (x ) & [ x ] lookupAl lDec l s [ ? y ]
| ! onTheRightOfDot (x ) & [ x ] lookupAllPackages [ ? y ]
The lookupAll edges of x is computed by filtering the lookupAllWithDotContext edges of x
with the information on the kind of reference that is expected at x . If x is expected to be
a variable, then we keep only field and local variable declarations in the stream of possible
lookups. If x is expected to be a type, we keep class declarations only. If x can be a package
or a type reference, then we keep both class declarations or compilation units (we represent
a package by the set of its compilation units). Finally, if x is in an ambiguous context, then
we keep all declarations.
We do not present here the predicates isVariableName, isTypeName, isPackageOrType-
Name and isAmbiguous. Their definition is straightforward and can be found in Appendix
B. We shall however illustrate their behaviour. In the expression ((A.B)C .this).f , f must
be resolved as a variable name, C as a type name, B as a type name too, and A as package
or type name.
More interesting is the account for context in the definition of lookupAllWithDotContext.
There, the stream of declarations depends on whether the reference is qualified or not. If it
is, we resolve the static type of the receiver and we look up its members. If it is not, we first
return all visible declarations from that unqualified context, and then all packages.
The definition of the edge typeLookup is simple in a language with few kinds of expres-
sions. Details can be found again in the full script for Rename Variable in Appendix B. We
shall rather focus here on the definitions of lookupAllMembers, lookupAllDecls and lookupAll-
Packages in turn.
All members The edge lookupAllMembers is defined for ClassDecl and CompUnit nodes:
l e t edge lookupAllMembers x : Clas sDec l → ?y =
[ x ] ( super ; lookup ) ∗ [ ? s ] &
( [ ? s ] bodyDecls [ ? y : Fie ldDec l ] | [ ? s ] bodyDecls [ ? y : Clas sDec l ] )
l e t edge lookupAllMembers x : CompUnit → ?y =
[ x ] c l a s sDe c l s [ ? y : Clas sDec l ]
In words, the potentially visible members of a class declaration x are the fields and nested
classes of x , or of any of the direct or transitive superclasses of x . Of course, not all these
members are actually visible from class x . The order in which we build the stream of edges
is therefore crucial, since it captures member hiding rules. The free variable ?s will match
in order first x itself, then the parent class of x , then the parent of the parent class of x
and so on. To find the parent class of x , we simply recursively call the lookup edge on the
name reference of the superclass of x . For each ?s in the ordered sequence of parent classes,
CHAPTER 6. SCRIPTING REFACTORINGS 119
we then first look up field declarations and then class declarations. Indeed, if a class has
both a field n and a nested class n, then we need to match the field declaration first, as any
ambiguous reference with name n should resolve to that field.
The edge definition of lookupAllMembers is much more obvious for compilation units. We
simply return all top-level classes.
All declarations We shall now describe lookupAllDecls edges. There are three definitions:
one for all nodes in the program and two overridden definitions for ClassDecl and CompUnit
nodes.
l e t edge l ookupAl lDec l s x → ?y =
[ x ] enc lo s ingStmt ; l i s tP r e d e c e s s o r +[?y : Loca lVar iab leDec l ]
| [ x ] enc l o s ingScope ; lookupAl lDec l s [ ? y ]
l e t edge l ookupAl lDec l s x : Clas sDec l → ?y =
[ x ] equa l s [ ? y ]
| [ x ] lookupAllMembers [ ? y ]
| [ x ] enc l o s ingScope ; lookupAl lDec l s [ ? y ]
l e t edge l ookupAl lDec l s x : CompUnit → ?y =
[ x ] parent ; compUnits [ ? cu ] lookupAllMembers [ ? y ] &
( ?cu . packageName == x . packageName | ?cu . packageName == "" )
For a node x that is neither a class, nor a compilation unit, we first try to find an enclosing
statement of x and search for local variable declarations preceding that statement. Then
we move up to the direct enclosing scope of x (i.e. either its direct enclosing class or its
compilation unit), and search for all declarations potentially visible from that point.
The potentially visible declarations of a class declaration x are first the class x itself, then
all members of x , and finally all declarations visible from the enclosing scope of x . Again,
the order of the disjuncts is significant. This time, it captures the shadowing rules of our
language: a member with name n shadows any declaration with the same name n of an
enclosing class.
Finally, the visible declarations of a compilation unit are all the declarations of compilation
units in the same package, or declarations in the root package. Again, we do not describe
here auxiliary edges like enclosingStmt or enclosingScope. Their full definition is given in
Appendix B.
All packages Finally, we shall define the edge that points to all packages. This is straight-
forward as we represent each package by the compilation units it contains. Therefore, it
suffices to climb up to the program root and find all compilation units:
CHAPTER 6. SCRIPTING REFACTORINGS 120
l e t edge lookupAllPackages x → ?y =
[ x ] parent ∗ [ : Program ] compUnits [ ? y ]
At this point, one might be concerned about the efficiency of our variable binding mecha-
nism. It would be more efficient to compute bindings in a single pass, like in classical compiler
construction. Nevertheless, it is very convenient for prototyping to declaratively specify the
binding rules like we did, by translating the specifications of the language to concise edge
predicates. Our implementation is workable as it stands, and yet improvements are possible,
for instance by specifying additional edges for storing binding information in intermediate
nodes such as blocks.
We conclude the description of the name lookup rules with a pictural overview of the
lookup process in Figure 6.2. The declarations potentially visible at a point x are returned
in a meaningful order. We first look at members of the direct enclosing class C0,0 of x . Then,
we inspect all inherited members in the chain of superclasses of C0,0, i.e. in all C0,k , first
with k = 1, then with k = 2, and so on. Finally, we process recursively on the enclosing class
of C0,0 itself, that is C1,0. In our figure, the vertical axes represent the inheritance chains
while the horizontal axis represents the nesting chain. For instance, C1,0 is nested in C2,0.
Note that once we have started moving up in an inheritance chain to look for members, we
cannot move to an enclosing class of a superclass.
x
C0,0C1,0C2,0
C0,1C1,1C2,1
inheritance
nesting
Figure 6.2: Ordered stream of declarations following first the chain of inheritance, then thatof nesting.
NameJava provides no support for access controls and interfaces. One might rightly
wonder how we would cope with these in our style of specification. The different rules of
accessibility can be modelled as filters on the stream of visible declarations as we did to
account for the context of a reference. Interfaces, however, brings in multiple inheritance.
As explained in Section 6.4 of the Java language specification [jls05], a class may have two
or more fields with the same simple name if they are declared in different interfaces and
inherited. In that case, it is not possible to refer to any of these fields by its simple name. In
lookupAllMembers, we would hence be careful not to retrieve members whose simple name
CHAPTER 6. SCRIPTING REFACTORINGS 121
refers to more than one member in all superclasses.
6.1.3 Detecting conflicts and renaming
We may now turn to scripting the Rename Variable refactoring. We shall first limit ourselves
to a basic version of it where we reject the transformation in case of any conflict or variable
capture. Interestingly, that basic version is very similar to the Rename Variable script we
gave in Section 2.6.2, although the name binding rules of NameJava are much more complex
than those of the While language we used back there.
In both cases, we have defined the lookup edge of a variable reference x as the first match
in the flow of declarations potentially visible from x . We recall here the very simple definition
of lookup for While programs:
l e t edge lookup r : Var → ?dec =
f i r s t ( [ r ] t r eePred+[?dec : VarDecl ] & r . name == ?dec . name )
This is to compare to the lookup definition for NameJava:
l e t edge lookup x : SingleName → ?y =
f i r s t ( [ x ] lookupAl l [ ? y ] & getName x == getName ?y )
The complexity of the lookup is in fact hidden in the stream of potentially visible declarations.
In While, it suffices to climb up the tree of statements. In NameJava, that stream is defined
by carefully traversing classes along inheritance and nesting axes.
Therefore, we can detect variable captures exactly like we did for While programs, by
checking that the declaration to be renamed is not going to capture any existing variable
and that no existing declaration will capture the renamed variable. The full script reads as
follows:
l e t renameVariable program node newName =
l e t dec = pick { ?d | [ node ] lookup [ ? d ] B equa l s ( node , ? d) } in
i f not i sVa r i a b l eDe c l a r a t i o n dec then
e r r o r "Please choose a variable" ;
i f dec . name == newName then
e r r o r "Please give a different name" ;
i f a l r eadyEx i s t s dec newName then
e r r o r "Declaration already exists" ;
l e t f i n dF i r s t x =
pick { ?y | [ x ] lookupAl l [ ? y ] &
(newName == getName ?y | ?y == dec ) } in
let mayBeCaptured =
{ ?x | [ program ] ch i l d +[?x : SingleName ] &
CHAPTER 6. SCRIPTING REFACTORINGS 122
?x . name == newName } in
let needRename =
{ ?x | [ program ] ch i l d +[?x : SingleName ] lookup [ dec ] } in
foreach x in mayBeCaptured do
i f f i n dF i r s t x == dec then e r r o r "Variable capture" ;
foreach x in needRename do
i f f i n dF i r s t x != dec then e r r o r "Variable capture" ;
foreach x in needRename do
x . name ← newName ;
dec . name ← newName
The description of the core part for detecting variable capture can be found in Section
2.6.2. The only difference in this version is the additional check that no declaration with the
new name and under the same enclosing class already exists. We do not spell out the details
here. The definition of alreadyExists is also given in Appendix B.
We shall now try to resolve variable capture in order to minimise rejection.
6.1.4 Minimising rejection
Consider again the example program of Figure 6.1 and suppose that we wish to rename the
field g in class C to f . By doing so, we are hiding the field f of the superclass B of C . This
is a case of variable capture because f of B is actually referred deeper in the program with
C .this .f . Let us trace what our previous script would do in that case. The reference to f in
C .this .f would be classified as a mayBeCaptured reference simply because it is named after
the new name we wish to give to g. Then it would be checked that, in the flow of declarations
potentially visible from the qualified reference C .this .f , the declaration of g in C does not
appear before that of f in B . Since this is the case, an error would be raised to prevent
the capture. Indeed, we can not simply rename g to f in C because that would change the
binding of C .this .f to point to that renamed declaration instead of f in B .
Nonetheless, it is actually possible here to change the reference C .this .f to a more explicit
one, say ((A.B)C .this).f . In the remainder of this section, we describe how to implement
this process in JunGL.
First thing is to notice that any reference qualified with a this or super access is of the
form ((〈Y 〉)〈X 〉.this).f where 〈X 〉 and 〈Y 〉 are both optional qualified type names, and f is
a variable name. In a surrounding class B that extends A, any qualified access of the form
B .super .f can always be replaced with ((A)B .this).f . In addition, any qualified reference,
whose receiver is a general expression, is of the general form ((〈Y 〉)〈expression〉).f where 〈Y 〉
is again any optional qualified type name, and 〈expression〉 is any access of that same form
or of the previous form.
From this observation, we shall amend our initial script to rewrite any reference that is
endangered with variable capture, instead of rejecting the transformation. There are two
CHAPTER 6. SCRIPTING REFACTORINGS 123
different rewrite cases.
Self references The first kind of rewrite proceeds on any reference that is either unqualified
or qualified by a this or a super access. We call them self references for short. Let d be the
declaration node of a field f , and x a self reference to f (i.e. of the form f or A.this .f for
instance). We shall get rid of the qualifier of f (because it is not explicit enough) and rebuild
a new access of the form ((〈Y 〉)〈X 〉.this). Therefore, we need to instantiate the types X and
Y that allows us to refer to d in the context of x . Figure 6.3 shows how to find such types.
The class Y is the direct enclosing class of d , C is the direct enclosing class of x , and X is
x
CX
Y
d
inheritance
nesting
Figure 6.3: Finding X and Y for building the access ((〈Y 〉)〈X 〉.this).
both an enclosing class of x and a subclass of Y .
In JunGL, we can find C , X and Y with the following path query:
[ x ] e n c l o s i n gC l a s s [ ?C] en c l o s i n gC l a s s ∗ [ ?X]
( super ; lookup ) ∗ [ ?Y] bodyDecls [ d ]
Type references We now need to build a name to access the class declarations X and
Y from the context of x . One might think that it is always safe to build the fully qualified
name of the class, but in Java and also NameJava the context might prevent us to refer to a
class with its fully qualified name. Take the example below:
package a ;
class C {
class A {
}
class B {
CHAPTER 6. SCRIPTING REFACTORINGS 124
class a {
}
class A {
}
{
C.A x ;
}
}
}
In the declaration of x , it is not possible to write a.C .A because a would resolve to the class
in that context, not to the package. Similarly, it is not possible to simply write A because
that would reference the closest class A. Thus, we need to be careful when building such
type accesses. We sometimes even have to reject the transformation if no valid access can be
built. Arguably, this is a flaw in the design of Java, and the problem could easily have been
avoided. In C# for instance, one can always refer to a member in the global namespace by
qualifying it with global::.
Suppose we wish to build a type access to the class Y . The idea is again to write a
path query to find an enclosing class E of Y which is itself visible from the context of x .
To test for visibility, we check that the first visible class or package declaration that has the
same name as E is E itself. If we cannot find any E that is visible from x , then we have to
reject the transformation. The function buildTypeReference is as follows. The first part for
checking visibility uses an auxiliary function lookupScopeFrom. The second part for building
the actual access uses the foldr function, which is standard in functional programming:
l e t lookupScopeFrom x name =
pick { ? s |
f i r s t ( [ x ] al lTypesOrPackages [ ? s ] & getName ? s == name) }
l e t bui ldTypeReference x c =
l e t es = pick { ? es |
f i r s t ( [ c ] enc l o s ingScope ∗ [ ? e s ] &
? es == lookupScopeFrom x ( getName ? es ) ) } in
i f es == null then
e r r o r ("Cannot build type access for " + c . name)
else
let chain = toL i s t { ? i c |
[ c ] enc l o s ingScope ∗ [ ? i c ] enc l o s ingScope+[ es ] } in
let esRef = new SingleName { name = getName es } in
L i s t . f o l d r
CHAPTER 6. SCRIPTING REFACTORINGS 125
( fun node i c → new DotName {
l e f t = node ,
r i g h t = new SingleName { name = getName i c }
})
esRef chain
Note that we are sometimes rejecting too much. Indeed, we might reject the transfor-
mation if we cannot build the type access for Y in the qualified reference ((〈Y 〉)〈X 〉.this).f
although it would have been possible to build a type access for Y ′, a subclass of Y such
that the reference ((〈Y ′〉)〈X 〉.this).f is also valid. In an improved version, we could actually
incorporate the test for building the type access inside the path query that finds X and Y .
Foreign references The second kind of rewrite is simpler. It applies to any qualified refer-
ence whose receiver is a general expression, but not a self reference. We call such a reference
a foreign reference. In that case, we shall cast the original receiver with an appropriate type
name 〈Y 〉 and build ((〈Y 〉)〈expression〉). Let d be the declaration node of a field f , and x a
foreign reference to f (i.e. of the form ((B)a).f or A.this .b.f for instance). We simply have
to cast the receiver of f in x to the enclosing type Y of d and build a type access for Y .
Again, we might reject the transformation if we cannot build such a type access.
package a ;class A {
class B {C f ;
}class C extends B {
int g ; // rename to fclass B {
}class D {{
C x = f . f ;}
}}
}
package a ;class A {
class B {C f ;
}class C extends B {
int f ;class B {
}class D {{
C x = ( (A.B) ( (A.B)C. this ) . f ) . f ;}
}}
}
Figure 6.4: Rename Variable scenario successfully handled by our script.
Concluding example We have explained the mechanisation of Rename Variable with
careful checks and little rejection. The full script in Appendix B (for name lookup and the
version of Rename Variable that tries to minimise rejection) is five-page long. To conclude
CHAPTER 6. SCRIPTING REFACTORINGS 126
the section, we illustrate in Figure 6.4 what our script does on a small but tricky example.
Of course, one could argue that the way we deal with variable hiding is undesirable because
the resulting code might be sometimes much less readable. In our view this objection comes
more under coding style and best practices, and such concerns could also be checked with
JunGL.
6.2 Extract Interface
We shall now change the focus of our discussion to a different kind of refactoring transfor-
mations: the ones that alter the type structure of an object-oriented program. Perhaps the
most popular example of such a type-based refactoring is Extract Interface.
In the mechanised version of that refactoring, one selects a class from which to extract
a new interface, chooses a name for the new interface and decides on the members to pull
up there. The tool then automatically creates the new interface with the chosen name and
member signatures and makes the original class implement the new interface. This is fairly
straightforward and available as is in Visual Studio 2005 for instance. To illustrate, we give
in Figure 6.5 an example transformation of a C# program: a new interface IContainer is
created with methods void Put(int) and int Get(), and Singleton is made to implement this
new interface.
class S ing l e ton{
private int e ;public void Put( int i ){
e = i ;}public void Put( S ing l e ton s ){
e = s . Get ( ) ;}public int Get ( ){
return e ;}
}
interface IConta iner{
void Put ( int i ) ;int Get ( ) ;
}class S ing l e ton : IConta iner{
private int e ;public void Put ( int i ){
e = i ;}public void Put ( S ing l e ton s ){
e = s . Get ( ) ;}public int Get ( ){
return e ;}
}
Figure 6.5: Example of Extract Interface in its simple version.
CHAPTER 6. SCRIPTING REFACTORINGS 127
Generalising declared types Frank Tip et al. have proposed a more advanced variant
of Extract Interface, and a rigorous method for automating it, where they attempt to change
the type of each declaration involving the refactored class to use the newly-created interface
[TKB03]. This enhanced version is motivated by the observation that not updating these
declarations leads to overspecific variable declarations, which is not good object-oriented
design. In our example of Figure 6.5 for instance, it would be safe (and desirable) to change
the type of the parameter s in the second method to IContainer . Indeed, the only method
that is called on s is the method Get which has been pulled up to the type IContainer . The
mechanisation of that process consists of three steps: generate a set of type constraints from
the source code, solve the constraints to find the upper bound of each type variable, and
modify the type references that can be generalised. Eclipse supports that more advanced
version of Extract Interface.
Aim and outline of the script We have shown for Rename Variable how to specify the
name lookup rules of a subset of Java with very few expressions. For Extract Interface,
we are however interested in how JunGL scales up to include many more constructs of a
mainstream language. The aim is to express name and type lookup for all these constructs,
together with the type constraints they imply. During Extract Interface, it then suffices to
collect the relevant type constraints, solve these constraints externally and use the results to
modify the original program.
Our presentation of the script shall be much more succinct than that of Rename Variable.
Extract Interface was indeed implemented in JunGL by Arnaud Payement for a large subset
of C# and one can refer to [Pay06] for full details. In this section, we briefly mention the
key ingredients of the automation. We start by describing informally the object language
and the static-semantic information required for Extract Interface. Then, we illustrate type
constraints and explain how they are collected using JunGL. Finally we discuss very briefly
the technique used to solve the constraints and incorporate back the results in the original
program.
6.2.1 The object language
The object language considered here is a substantial subset of the C# 2.0 language [ecm06].
That subset notably includes non-trivial features such as generics or structs. We do not
spell out all the details of the abstract grammar of C#, but wish to give an overview of the
data types. One particularity is the support for both source and libraries. Indeed, to handle
realistic programs, we need to have access to namespaces, types and members declared in
external .NET assemblies, which we shall model as compilation units for simplicity. Hence,
a compilation unit shall be either a source file or an assembly:
type
Compilat ionUnit =
CHAPTER 6. SCRIPTING REFACTORINGS 128
| SourceUnit = { us ing s : Using l i s t ;
members : NamespaceMemberDecl l i s t }
| Assembly = { members : NamespaceMemberDecl l i s t }
A source unit and an assembly both encapsulate a list of namespace member declarations.
Such a declaration introduces either a namespace or a type member. A type member decla-
ration is either a type declaration, a field declaration or a callable declaration. In turn a type
is either a class, a struct, an interface or a type parameter. All these kinds of declarations
are represented with the following data types:
NamespaceMemberDecl =
| NamespaceDecl = { name : s t r i n g ;
members : NamespaceMemberDecl l i s t }
| MemberDecl = (
| TypeDecl = (
| ConcreteTypeDecl = (
| ClassDec l = . . .
| StructDec l = . . .
)
| I n t e r f a c eDe c l = . . .
| TypeParamDecl = . . .
)
| Fie ldDec l = . . .
| Ca l l ab l eDec l = . . .
)
A callable is either a method or a constructor. We shall come back to them and to
statements when we discuss the automation of Extract Method. For now, we focus on sup-
porting expressions, as they are involved in most of the type constraints required for Extract
Interface. The following data types represent the different kinds of expressions we support:
type
Express ion =
| ObjectCreateExpr = { typeRef : TypeRef ;
arguments : MethodArgument l i s t }
| ArrayCreateExpr = { typeRef : TypeRef ;
q u a l i f i e r s : Qua l i f i e r l i s t }
| MethodInvokeExpr = { t a r g e t : Expres s ion ;
arguments : MethodArgument l i s t }
| ArrayAccessExpr = { t a r g e t : Expres s ion ;
q u a l i f i e r s : Qua l i f i e r l i s t }
CHAPTER 6. SCRIPTING REFACTORINGS 129
| MemberAccessExpr = { t a r g e t : Expres s ion ;
ent i tyRe f : EntityRef }
| Reference = (
| EntityRef = { name : s t r i n g ; typeArgs : TypeRef l i s t }
| ThisRef
| BaseRef
| TypeRef = { path : NamespacePath ; q u a l i f i e r s : Qua l i f i e r l i s t }
)
| AssignExpr = { l e f t : Expres s ion ; operator : Ass ignOperator ;
r i g h t : Expres s ion }
| BinaryExpr = { l e f t : Expres s ion ; operator : BinaryOperator ;
r i g h t : Expres s ion }
| Pref ixExpr = { operator : Pre f ixOperator ;
t a r g e t : Expres s ion }
| Postf ixExpr = { t a r g e t : Expres s ion ;
operator : Pre f ixAndPost f ixOperator }
| Parenthes isExpr = { t a r g e t : Expres s ion }
| Primit iveExpr = (
| S t r i n gL i t e r a l = { va lue : s t r i n g }
| CharL i t e r a l = { va lue : s t r i n g }
| I n t e g e r L i t e r a l = { va lue : s t r i n g }
| Rea lL i t e r a l = { va lue : s t r i n g }
| Null | False | True
)
and
MethodArgument = { d i r e c t i o n : ParamDirection ; t a r g e t : Expres s ion }
As you see, it is a substantial set of constructs. One may wonder what is the field direction
in the data type MethodArgument. It simply indicates the passing mode of the argument and
we shall come back to that when we discuss Extract Method. An important construct that is
actually not apparent here is the ability to cast an expression to a given type. In fact, cast
is defined as a prefix operator:
type
Pre f ixOperator =
| UnaryAdd | UnarySub | Not | OnesComplement
| Cast = { typeRef : TypeRef}
| Pre f ixAndPost f ixOperator = (
| Increment | Decrement
)
CHAPTER 6. SCRIPTING REFACTORINGS 130
In the end, we support a large subset of the language and notably features that are often
considered harder to accommodate with in the construction of compilers. The principal
features that we do not support are exceptions, labeled statements, multiple variable and
field declarations, parameter arrays, constructor initializers, operator declarations, indexer
declarations, delegates and partial classes. This seems to be a fairly large list, but most of
the constructs we cite here are specific to C# 2.0 and are not present in a language like Java,
apart of course from exceptions, labeled statements and multiple declarations. Omitting
these simplify our next discussion of Extract Method, and we do not envision any difficulty
in supporting them.
6.2.2 Name and type lookup
The name and type lookup edges are built from about 50 sub-edges and predicates, by
naturally translating into JunGL the ECMA specifications of the language [ecm06]. The
name lookup edge declLookup links an entity reference or a method call to its definition.
Definitions of that edge are therefore similar to what we have expressed for NameJava in
the previous section. On the other hand, the type lookup edge typeLookup links an entity
reference or a method call to the declaration of its type.
To wit, the name lookup edge emanating from a method call points to the most abstract
definition of the method that may be called, while the type of a method call is the return
type of the method that is invoked:
l e t edge typeLookup e : MethodInvokeExpr → ? t =
[ e ] declLookup ; typeRef ; typeLookup [ ? t ]
Another example of typeLookup is for resolving the type of literals. For example:
l e t edge typeLookup e : I n t e g e r L i t e r a l → ? t =
[ e ] root ; sys temClasses [ ? t ] & ? t . name == "Int32"
l e t edge typeLookup e : Nul l → ? t =
[ e ] root ; sys temClasses [ ? t ] & ? t . name == "Object"
The edge root climbs up to the root node of the program, that is the node holding assemblies
and source units. As for the edge systemClasses, it points directly to the system classes in
the namespace System of the core .NET assembly.
There are other interesting aspects of the implementation of name and type lookup for C#.
More details can be found in [Pay06]. In particular, the definition of accessibility domains
follows the wording of the rules given in the specifications of the language [ecm06].
CHAPTER 6. SCRIPTING REFACTORINGS 131
6.2.3 Generating type constraints
We now illustrate what kind of type constraints are generated. A constraint is composed
of two elements and one operator, which is either a strict subtyping, a subtyping, or a type
equality. We distinguish two different types of elements: variables and constants. Variables
are nodes that need to be typed while constants are types of the program.
As a brief example, we shall draw a few type constraints from our example program of
Figure 6.5. We write type variables in square brackets.
Code Constraints
class Singleton : IContainer Singleton < IContainer
e = i [i ] ≤ [e]
e = s .Get() [s .Get()] ≤ [e]
[s .Get()] = [int ]
[s ] ≤ IContainer
return e; [e] ≤ [int ]
As we see, constraints are generated on the refactored program. Extract Interface has
indeed two distinct phases. We first create the new interface, pull up selected members and
make the original class implement the new interface. This corresponds to the naive version
of the refactoring. Then we look at type constraints on the refactored program to eventually
find a more general type for each declaration involving the refactored class.
Type constraints can be numerous even for a fairly small program. Therefore, we do
not want to generate all of them. We are only interested in those relevant to declarations
involving the refactored class. We can define an edge declarationPoint that finds all the
declarations whose type refers to a particular class declaration x :
l e t edge de c l a r a t i o nPo in t x : Clas sDec l → ? s =
( [ ? s : F ie ldDec l ] | [ ? s : MethodDecl ] | [ ? s : ParamDecl ]
| [ ? s : Var iableDeclStmt ] | [ ? s : ForEachStmt ] ) &
[ ? s ] typeRef ; typeLookup [ x ]
Once we have found all the declarations to be potentially generalised, we need to identify
any node in the program whose type constraints may involve these declarations. These are
all the declarations themselves, plus all expressions and statements containing a reference to
them:
l e t edge cons t r a in tPo in t x : Clas sDec l → ?p =
[ x ] d e c l a r a t i onPo in t [ ? p ]
| ( [ ? p : Expres s ion ] | [ ? p : Statement ] ) &
[ ? p ] ch i l d ∗ [ : EntityRef ] declLookup [ ? d ] &
CHAPTER 6. SCRIPTING REFACTORINGS 132
[ x ] d e c l a r a t i onPo in t [ ? d ]
If x is the class declaration being refactored, we can then get a stream of constraints with:
{ bu i ldCons t r a in t ? s | [ x ] cons t r a in tPo in t [ ? s ] }
where buildConstraint would be a function that takes a node that is either a declaration, a
statement or an expression and builds a set of constraints similar to the example constraints
of the above table. One might want to look at the report of Payement [Pay06] for a full
description of type constraints, which are the adaptation for C# of the constraints given in
[TKB03] for Java.
6.2.4 Solving and transforming
The solving process is in the same vein as the work done on Soot, a Java bytecode opti-
misation framework, for efficiently inferring static types at the level of bytecode [GHM00].
The constraints collected using JunGL are turned into a graph whose nodes are elements of
the constraints and whose edges represent constraints themselves. Edges hence correspond
either to strict subtyping, to subtyping or to type equality, and are labeled accordingly. The
graph is then simplified through a succession of operations, namely collapsing of nodes and
transitive reduction, in order to find the upper bound of each type variable. Again, the
full solving process is thoroughly described in [Pay06] together with an optimisation for the
special problem of Extract Interface, where the set of constant types can be simplified to two
elements: one that represents the newly-created interface and another one that represents
any other type.
Currently, the constraint solver is external to JunGL and interfaced using external calls.
It could have been implemented using the ML features of JunGL, but a more interesting
future work would be to express constraints as predicates in JunGL. The work by Speicher
et al. with GenTL [SAK07] suggests that it can be done elegantly.
Once the constraints are solved, they are exploited as follows. We take all the elements
that have been collapsed into the node containing the freshly-created interface, say I , and
we change the type of their declaration to I . Any other declaration remain unchanged.
Note that we need to be careful when changing the type of a declaration. Indeed, we
cannot simply replace the type reference by the single name of the new interface, as another
member with the same name might be hiding the interface declaration. To illustrate, we
show in Figure 6.6 a flaw in Eclipse 3.3. Types of declarations are correctly generalised, but
type accesses are not properly updated. In the refactored field declaration, I points to the
wrong interface. It should have been a.I . Since we must account for the binding context of
the declaration, the solution is to reuse exactly what has been done in Rename Variable for
generating type references.
CHAPTER 6. SCRIPTING REFACTORINGS 133
package a ;
class A {class I {}A a = this ;
}
package a ;
interface I { }
class A implements I {class I {}I a = this ;
}
Figure 6.6: Issue with type references in Eclipse 3.3.
We have presented the main ingredients for automating Extract Interface in JunGL. Nat-
urally, other type-based refactorings can be mechanised, for instance to introduce generic
types [DKTE04, vDD04, FTK+05, KETF07] or to support class library migration [BTF05].
Most of these refactorings involve an analysis of the class hierarchy and require solving type
constraints as we have described here.
6.3 Extract Method
Let us now turn to the Extract Method refactoring. We have already described informally
what this refactoring is about in Chapter 1, notably by quoting the informal recipe that is
typically found in refactoring books, e.g. [Fow99]. We have also reported a few flaws in IDEs
like Visual Studio 2005 or Eclipse 3.3. The problem is either that no true control and data
flow analyses are performed, or that preconditions of the transformation are not correctly
implemented. As we shall see, these preconditions are fairly complex, and being able to give
a clear executable specification of them is one of the main benefits of JunGL.
Aim and outline of the script The aim here is to give a precise, concise and rigorous
specification of Extract Method. Notably, we shall use logical path queries to express the
control and data flow properties that are necessary for the correct automation of the trans-
formation. Our object language shall be the same as for Extract Interface, namely a large
subset of C#, but we focus this time mostly on functions and statements.
The remainder of the section is organised as follows. We first present the abstract gram-
mar of the statements that we shall consider, and describe how we super-impose the control-
flow graph of a method on its statement nodes. Only then we turn to the implementation
of the refactoring. Its input is a name for the new method and two statement nodes in the
graph, namely the start and the end of the region to be extracted. There are four major
phases in the implementation, and we shall consider each in turn: checking the validity of
the selection, determining what parameters must be passed, where declarations should be
CHAPTER 6. SCRIPTING REFACTORINGS 134
moved, and finally doing the transformation itself.
6.3.1 The object language
The object language is roughly the same subset of C# as for Extract Interface. Here we give
the data types for callables and statements. We start with callables:
type
Ca l l ab l eDec l = (
| MethodDecl = { name : s t r i n g ; mod i f i e r s : Modi f i e r l i s t ;
parameters : ParamDecl l i s t ; b lock : Block }
| ConstructorDecl = { name : s t r i n g ; mod i f i e r s : Modi f i e r l i s t ;
parameters : ParamDecl l i s t ; b lock : Block }
)
and
ParamDecl = { d i r e c t i o n : ParamDirection ; typeRef : TypeRef ;
name : s t r i n g }
and
ParamDirection =
| Value | Ref | Out
A callable is either a method or a constructor, and indeed we wish to allow the extraction
from a constructor too. A callable has a (possibly empty) list of parameter declarations
ParamDecl. Each of the parameters has of course a reference to a type and a name, but also
a parameter direction that indicates the passing mode of the parameter.
While the default passing mode is by value, C# also allows for two other modes, namely out
and ref. Output and reference parameter passing modes are used to allow a method to alter
variables passed in by the caller. The caller of a method which takes an output parameter
is not required to assign the variable passed as that parameter prior to the call; however,
the callee is required to assign the output parameter before returning. In a way, output
parameters are like additional return values of a method. In contrast, reference parameters
must be initially assigned by the caller, and therefore the callee is not required to assign them
before their use. In effect, reference parameters are passed both in and out of a method.
The presence of these alternate passing modes has for consequence that Extract Method
in C# is less likely to be rejected compared to the same refactoring in Java. Indeed there are
more opportunities for handling parameters. Yet, the transformation has to account for all
three passing modes. We shall describe how we do that with JunGL in the remainder.
Other data types are important for the transformation, in particular the ones for repre-
senting different kinds of statements:
Statement =
CHAPTER 6. SCRIPTING REFACTORINGS 135
| VariableDeclStmt = { mod i f i e r s : Modi f i e r l i s t ; typeRef : TypeRef ;
name : s t r i n g ; i n i t i a l i z e r : Expres s ion }
| ExprStmt = { t a r g e t : Expres s ion }
| ReturnStmt = { t a r g e t : Expres s ion }
| BreakStmt
| ContinueStmt
| I fStmt = { cond i t i on : Expres s ion ; thenBranch : Statement ;
e l s eBranch : Statement }
| Loop = (
| WhileStmt = { cond i t i on : Expres s ion ; body : Statement }
| DoWhileStmt = { cond i t i on : Expres s ion ; body : Statement }
| ForStmt = { i n i t i a l i z e r : Statement ; cond i t i on : Expres s ion ;
i t e r a t o r : Expres s ion ; body : Statement }
| ForEachStmt = { typeRef : TypeRef ; name : s t r i n g ;
t a r g e t : Expres s ion ; body : Statement }
)
| Block = (
| EmptyStmt
| BlockStmt = { s ta tements : Statement l i s t }
)
Note how we group all different kinds of loop statements under a common abstract data type
Loop. Similarly, we consider that empty statements and block statements are just Blocks.
Semantically, an empty statement is indeed just a block with an empty list of statements.
This allows us to simplify our coming reasoning on the control flow.
There is no support here for exceptions, labeled statements, goto statements, switch
statements, lock statements, using statements and anonymous methods. We do not express
the complete static-semantic rules of C# 2.0 required for the automation of Extract Method,
but we illustrate how one can fully accomplish it.
6.3.2 Control and data flow
For the mechanisation of Extract Method, we primarily rely on three sorts of static-semantic
information: name binding, control flow and data flow. We have addressed name binding
before and we assume we can lookup the declaration of a reference by following its declLookup
edge. We shall focus here on control and data flow only.
We recall that initially we have a raw syntax tree with no static-semantic information.
We must therefore define lazy edges to super-impose control-flow information on that tree.
We proceed like we did in our examples with the While language in Chapter 2. That is, we
first define two dummy attributes for the entry and exit of a callable:
CHAPTER 6. SCRIPTING REFACTORINGS 136
type Entry
type Exit
l e t attribute c a l l a b l eEn t r y c : Ca l l ab l eDec l = new Entry {}
l e t attribute c a l l a b l eE x i t c : Ca l l ab l eDec l = new Exit {}
Then, we introduce an edge that links a statement to its exit statement:
l e t edge e x i t x : Statement → ?y =
[ x ] l i s t S u c c e s s o r [ ? y ]
B [ x ] parent [ : Loop ] cont inue [ ? y ]
B [ x ] parent ; e x i t [ ? y ]
B [ x ] parent ; c a l l a b l eE x i t [ ? y ]
In most cases, the exit of a statement is simply the following statement in the list that
contains it. This is handled by the fist disjunct of the B alternative. The last statement in
a loop, however, exits to the part of the loop that needs to be evaluated after each iteration.
This is what the second attempt expresses and we shall come back to the edge continue in
an instant. The third disjunct tries to exit from a block to the exit of its parent. Finally, if
none of the previous predicates succeeded, we exit the method or the constructor itself.
The edge continue is defined for all different loops as follows:
l e t edge cont inue x : WhileStmt → ?y = [ x ] cond i t i on [ ? y ]
l e t edge cont inue x : DoWhileStmt → ?y = [ x ] cond i t i on [ ? y ]
l e t edge cont inue x : ForStmt → ?y = [ x ] i t e r a t o r [ ? y ]
l e t edge cont inue x : ForEachStmt → ?y = [ x ] t a r g e t [ ? y ]
After each iteration of a while or a do-while loop, the control flows to the condition of the
loop. In the case of a traditional for loop, it is however the iterator expression that should be
evaluated first. Finally, in the case of an enhanced foreach loop, we assimilate the invocation
of MoveNext on the target collection [ecm06] with the target itself.
Again, we only wish to give a taste of how to define the control flow and we do it at
the level of statements only. We give here a few examples. The successor of an expression
statement is the exit node of that statement (as defined above):
l e t edge c f s u c c x : ExprStmt → ?y = [ x ] e x i t [ ? y ]
The successor of a return statement is the dummy exit of the callable:
l e t edge c f s u c c x : ReturnStmt → ?y =
[ x ] parent+[ : Ca l l ab l eDec l ] c a l l a b l eE x i t [ ? y ]
CHAPTER 6. SCRIPTING REFACTORINGS 137
This is partly because we do not support try-catch-finally clauses. If we were to handle
them, we would have to make any return statement enclosed in a try-catch block exit to the
corresponding finally clause. The successor of an if statement is its guard expression, because
we consider the statement itself as an intermediate meaningless node in the control flow:
l e t edge c f s u c c x : I fStmt → ?y = [ x ] cond i t i on [ ? y ]
The control-flow successor of a break statement is the exit of its enclosing loop:
l e t edge c f s u c c x : BreakStmt → ?y =
f i r s t ( [ x ] parent+[ : Loop ] e x i t [ ? y ] )
In contrast, the successor of a continue statement is the the part of the loop that needs to
be executed after each iteration:
l e t edge c f s u c c x : ContinueStmt → ?y =
f i r s t ( [ x ] parent+[ : Loop ] cont inue [ ? y ] )
To express dataflow properties, we shall need information about variables that are used
and defined in each statement or expression. A use edge links a statement or an expression
to the variables that are read during its execution. Dually, a def edge relates a statement
or an expression to the variables that it writes. We shall give here a couple of definitions
only, for handling the different parameter passing modes. A method argument x uses all the
variables found in its expression if its passing direction is not out:
l e t edge use x : MethodArgument → ?y =
! [ x ] d i r e c t i o n [ : Out ] & [ x ] t a r g e t ; use [ ? y ]
On the other hand, a method argument x defines a variable declared in ?y if its passing mode
is either out or ref:
l e t edge de f x : MethodArgument → ?y =
( [ x ] d i r e c t i o n [ : Out ] | [ x ] d i r e c t i o n [ : Ref ] )
& [ x ] t a r g e t [ : EntityRef ] declLookup [ ? y ]
To conclude, we also define a useOrDef edge which is shorthand for the union of the two
former edges:
l e t edge useOrDef x → ?y = [ x ] use [ ? y ] | [ x ] de f [ ? y ]
CHAPTER 6. SCRIPTING REFACTORINGS 138
6.3.3 Checking validity
We now turn to specifying preconditions of the transformation. The refactoring will first
need to check that it is a valid selection: for instance, one can only extract a block of code
into a method if it is single-entry single-exit. These are the usual conditions: the start
node dominates the end node, the end node post-dominates the start node, and the set of
cycles containing the start node is equal to the set of cycles containing the end node. These
conditions are easily expressed in terms of path patterns like we did in Chapter 2. For
example, here is the definition of dominates:
l e t dominates entryNode startNode endNode =
Stream . isEmpty
{ ( ) | [ entryNode ]
( local ? z : c f s u c c [ ? z ] &
?z != startNode )∗
[ endNode ] }
It takes three parameters: the entry node of the method or constructor that contains the
block, the start node of the block, and the end node of the block. By definition, the start
node dominates the end node if all paths from the entry node to the end node pass through
the start node. The predicate
[ entryNode ]
( local ? z : c f s u c c [ ? z ] & ?z != startNode )∗
[ endNode ]
signifies a path all of whose elements are not equal to the start node. We hence require that
no such path exists, by testing that the above set is empty. The function isEmpty is simply
defined as:
l e t isEmpty s = pick s == null
Other similar checks are required. The control-flow graph lacks indeed some scoping
information, and therefore, we also need to check that the selection does not straddle different
scopes.
6.3.4 Inferring parameters
When we have verified that the selection is indeed amenable to method extraction, the next
task is to determine what the parameters of the method should be, and what results must
be returned. We shall consider different sets of parameters for the different passing modes.
Those parameters are chosen among the local variables that are used and defined in the
selection.
CHAPTER 6. SCRIPTING REFACTORINGS 139
We start by describing how we compute that set of local variables. The JunGL script
that follows is an excerpt of the full script given in Appendix C. The node outerEndNode is
the direct successor of endNode, i.e. the first node which follows the selection but is not in
the selection.
l e t s e l e c t i o nS ta t emen t s = { ? s |
[ s tartNode ] ( local ? z : c f s u c c [ ? z ] &
?z != outerEndNode &
?z != exitNode ) ∗ [ ? s ] } in
let predicate mayUseOrDefInSelection (? x ) =
i s I n (? s , s e l e c t i o nS ta t emen t s ) & [ ? s ] useOrDef [ ? x ] in
let v a r i a b l e s = { ?x |
mayUseOrDefInSelection (? x ) &
( [ ? x : Var iableDeclStmt ] | [ ? x : ParamDecl ] ) }
As we see, we first select the statements that are contained in the selection, namely the
statements reachable from startNode without going through outerEndNode. Then we define
one local predicate: mayUseOrDefInSelection(?x) holds for variables that are used or defined
inside selection statements. Finally, we compute the stream variables by restricting the
variables, for which mayUseOrDefInSelection holds, to local variables and parameters.
We now turn to classifying variables. A variable x in variables will become a value
parameter if the following conditions are satisfied:
• x is live upon entry in the extracted block, that is it may be used in the selection, and
it is not redefined before it is used. The condition that x may be used is obvious; if x
is always redefined before such a use, there is no need to pass it as a parameter, as its
value can be computed locally in the extracted method.
• It is not the case that x may both be redefined in the selection, and used before it is
redefined after the selection. If x is live at the end of the selection, but not redefined
in the selection, it is fine to pass it by value.
We can thus compute the set of value parameters as follows:
l e t valueParams =
{ ?x | i s I n (? x , v a r i a b l e s ) &
mayUseBeforeDefInSelect ion (? x ) &
! ( mayDefInSelect ion (? x ) &
mayUseBeforeDefAfterSe lect ion (? x ) )
}
CHAPTER 6. SCRIPTING REFACTORINGS 140
The predicates used here have again an elegant definition in JunGL. To illustrate, consider
mayUseBeforeDefAfterSelection(?x). This predicate holds if there is a path from the end
node to a use of x with no intervening definition of x . A node u uses x if it has a user-defined
lazy edge labeled use to x . Similarly, an intervening node z does not define x if it has no
lazy edge labeled def to x .
l e t predicate mayUseBeforeDefAfterSelect ion (? x ) =
[ outerEndNode ]
( local ? z : [ ? z ] c f s u c c & ! [ ? z ] de f [ ? x ] ) ∗
[ ? u ] use [ ? x ]
| [ method ] parameters [ ? x ] d i r e c t i o n [ ? : ! Value ]
Note that this definition also deals, thanks to the second disjunct, with the possibility for a
use outside the callable method where the extraction occurs, namely when x is a non-value
parameter. Like outerEndNode, the node method is retrieved at the beginning of the script.
All details are exposed in Appendix C.
We now consider when a variable x should become an output parameter of the extracted
method. Here the specification consists of three conjuncts:
• First, there exists a potential use without prior definition of the variable x after the
selected statements: without such a potential use, there is no point in returning x as a
result of the method.
• Second, there should be no use of x before a definition of x in the selection itself. If
there was such a use, it would not be sufficient to pass x merely as an output parameter:
its initial value is important too.
• Third, x must actually be defined in the selection. If it were not, then the result of
the refactoring would not be compilable, because C# requires all output variables to be
definitely assigned.
In summary, we can define the set of output parameters as follows:
l e t outParams =
{ ?x |
i s I n (?x , v a r i a b l e s ) &
mayUseBeforeDefAfterSelect ion (? x ) &
! mayUseBeforeDefInSelect ion (? x ) &
mustDefInSe lect ion (? x )
}
Again, the definitions of these predicates are all straightforward in JunGL, and the details
can be found in Appendix C
CHAPTER 6. SCRIPTING REFACTORINGS 141
At this point, we have precisely defined what should be the value and output parameters
of the extracted method. It remains to define the reference parameters. At first glance, one
might say that any variable in the selected block that is not a value or output parameter is
a reference parameter. Such a criterion would however be much too crude. Some variables
will just be local to the selection, and such variables do not need to be passed as parameters
at all. They will become local variables of the extracted method body. A more accurate
definition of the set of reference parameters is therefore as follows:
l e t refParams =
{ ?x |
i s I n (?x , v a r i a b l e s ) &
( mayUseBeforeDefInSelect ion (? x ) |
( mayDefInSelect ion (? x ) & ! mustDefInSe lect ion (? x ) ) ) &
mayUseBeforeDefAfterSelect ion (? x ) &
! i s I n (?x , valueParams ) &
! i s I n (?x , outParams )
}
That is, x may be used before it is redefined in the selection or it is only potentially defined
in the selection, x may be used before it is redefined after the selection, and it is not a value
or output parameter.
It is interesting to work out the effect of these definitions on an example such as the
one of Figure 1.1. For convenience, we recall here the full method and the selection under
consideration:
public void F(bool b)
{
int i ;
// from
i f (b )
{
i = 0 ;
Console . WriteLine ( i ) ;
}
// to
i = 1 ;
Console . WriteLine ( i ) ;
}
Clearly b is classified as a value parameter. But what about i? As explained in the introduc-
tion, the bug in Visual Studio was that i became an output parameter (and being the only
CHAPTER 6. SCRIPTING REFACTORINGS 142
such parameter, in fact the method result). In our definition, that is prevented by the final
conjunct in the definition of out because we have
! mustDefInSe lect ion ( i )
Note that we also don’t get i as a value parameter because there is a definition before its
use in the selection. Finally, it does not become a reference parameter because it is defined
before being used after the selection. We conclude that according to our definition, i does
not become a parameter at all.
6.3.5 Placing declarations
Having decided on the parameters of the extracted method, we now turn to placing declara-
tions for its local variables. In doing so, we consider three cases: declarations that must be
moved out of the selection, declarations that must be moved into the selection, and finally
those that need to be duplicated. We discuss each of these in turn.
A declaration needs to be moved out of the selected block if it is declared there, and if it
is used or defined outside the selection:
l e t needDecMoveOut =
{ ?x |
de c InSe l e c t i o n (? x ) &
mayUseOrDefOutOfSelection (? x )
}
Conversely, if a declaration does not occur in the selected block, it is defined or used in that
block, and it is not a parameter, then the declaration should be moved into the extracted
method’s body:
l e t needDecMoveIn =
{ ?x |
i s I n (?x , v a r i a b l e s ) &
! d e c I nSe l e c t i o n (? x ) &
! i s I n (?x , valuePrams) &
! i s I n (?x , outParams ) &
! i s I n (?x , refParams )
}
Finally, there are the declarations that must be duplicated. This can happen because the
use of a variable in the selection is in fact independent of the use of the variable outside the
selection: effectively, we can split the variable into two independent ones. The declarations
in question are defined by:
CHAPTER 6. SCRIPTING REFACTORINGS 143
l e t needDecDupl icat ion =
{ ?x |
i s I n (?x , needDecMoveIn ) &
mayUseOrDefOutOfSelection (? x )
| i s I n (?x , needDecMoveOut) &
! i s I n (?x , valueParams ) &
! i s I n (?x , outParams ) &
! i s I n (?x , refParams )
}
To wit, the declaration of x needs to be moved into the existing declaration (as we have just
defined it), but there are also uses and/or definitions outside the selection; or the declaration
x needs to be moved out, but x is not passed as a parameter to the new method.
Again, let us return to Figure 1.1 and see what happens to the variable i . Because it
is not a parameter of any kind, but it occurs in the selection and it is not declared in the
selection, i will be a member of needDecMoveIn. However, note that because it also occurs
after the selection, it will in fact be classified as a declaration that needs duplication: the
two uses of i , inside and outside the selection, have been correctly separated.
6.3.6 Transforming
Armed with all the necessary information, we can now actually perform the required trans-
formation of creating a new method. This is, in fact, the least interesting part of the code:
all that needs to be done is to reconstruct the relevant portions of the graph.
As a small example fragment, consider the operation of inserting a new statement before
an existing one:
l e t i n s e r tS ta tementBe fo r e n s =
i f not Ut i l s . isEmpty { ?b | [ s ] parent [ ? b : BlockStmt ] } then
i n s e r tBe f o r e n s
else let block = new BlockStmt in
replaceWith s b lock ;
b lock . s ta tements ← [ n ; s ]
First we check whether s is itself in fact part of a sequence in the AST. If so, we simply add
n as the left-hand sibling of s . If not, however, we first need to create a new block statement,
which replaces s in the AST; both n and s become descendants of this new block statement.
The functions insertBefore and replaceWith are built-in functions of JunGL to manipulate
the syntax tree of the program. There are also insertAfter, detach for detaching a node from
its parent and clone for cloning a subtree. Note that it is not necessary to define control-flow
edges (cfsucc) on the new block statement, because we defined these to be lazy, so they will
CHAPTER 6. SCRIPTING REFACTORINGS 144
public void F(bool b){
int i ;// fromi f (b){
i = 0 ;Console . WriteLine ( i ) ;
}// toi = 1 ;Console . WriteLine ( i ) ;
}
public void F(bool b){
int i ;NewMethod(b ) ;i = 1 ;Console . WriteLine ( i ) ;
}
private void NewMethod(bool b){
int i ;i f (b){
i = 0 ;Console . WriteLine ( i ) ;
}}
Figure 6.7: Correct refactoring of Figure 1.1.
be automatically constructed when necessary. We need, however, to recompute the edges
that may have been invalidated by the transformation. Currently, we do not provide any
support for incremental evaluation, and therefore, we flush all edges after each refactoring.
We return once again to the example of Figure 1.1. Figure 6.7 shows the result of applying
this refactoring in our own tool. Note that at present, we do not detect that the selected
block did not contain any instance references, so as yet we only make it static if the original
method was itself static — it would however be very easy to add that improvement.
In the exposal above we have assumed that the original program compiles without errors.
Of course in practice it is very common to apply refactorings to programs that cannot be
compiled for subtle reasons such as the definite assignment rule of C# (which states that every
local must be initialised before it is used). In such cases, the refactoring should preserve at
best the compilation errors in the result of the transformation. By refining our predicates,
it would be fairly easy with JunGL to conservatively transform such slightly faulty input
programs.
6.4 Summary and references
We have discussed JunGL scripts for specifying three refactorings very different in nature:
Rename Variable, Extract Interface and Extract Method.
Automatically renaming a variable requires variable binding information and the ability to
detect potential conflicting declarations of variables with a similar name. We have modelled
name lookup with streams of visible declarations. This allows us to check, via a simple
CHAPTER 6. SCRIPTING REFACTORINGS 145
traversal of a stream, whether a variable reference is in the scope of a variable definition with
a similar name. Whenever variable capture occurs, we try not to reject the transformation
and instead add a more explicit qualifier to variable references so as to avoid their capture.
Properly handling name bindings is crucial in many program transformations. Strangely,
there is little work in formalising name visibility and binding semantics for mainstream
languages in which binding rules are numerous and complex. Perhaps the closest work is
that of Vorthmann on modelling and specifying name binding rules for Ada via visibility
networks [Vor93]. Our streams of potentially visible declarations are akin to such networks.
Of course, compilers for mainstream languages have to implement those complex binding
rules. JastAddJ is a full compiler for Java 5 using the attribute grammar system JastAdd
[EH07]. There, binding rules are expressed as a set of attributes. The general mechanism was
illustrated in [EH06] on a non-trivial subset of Java [jls05]. It is actually that subset plus the
additional support for this , super and casts that we have considered for our implementation
of Rename Variable.
The idea of introducing an access of the complex form ((〈Y 〉)〈X 〉.this).f is due to Schafer
et al. [SEdM08]. In their work, Y is called the source and X the bend. Their approach for
resolving X and Y is, however, very different. Their framework for renaming is based on
JastAddJ and they express the computation of accesses by inverting lookup attributes in a
systematic way. Consequently, their framework is easily extensible to new constructs with
new lookup rules: one simply needs to define, for each new lookup rule, a corresponding rule
for the access computation.
Extract Interface is a type-based refactoring which alters the type structure of a program.
It consists of two phases: collecting type constraints over the original program and solving
them to find out whether some variables can be given the type of the newly introduced inter-
face. Payement has followed the approach first introduced in [TKB03] for Java and adapted
the type constraints to a large subset of C# 2.0 [ecm06]. A report of the implementation of
Extract Interface with JunGL is available in [Pay06].
Currently, the constraint solver is external to JunGL and it would be interesting to
express constraints as predicates like it has been discussed in [SAK07]. However, the fact that
many other type-based refactorings have been proposed [DKTE04, vDD04, BTF05, KETF07]
suggests that such a constraint solver could also be a built-in functionality.
Finally, Extract Method is a low-level refactoring requiring control and data flow informa-
tion about the program. There are four phases: checking validity of the selection, determining
what parameters must be passed, where declarations should be moved, and finally doing the
transformation itself. Control and data flow information is used in the first two steps. We
gave a first account of the automation of Extract Method using JunGL in [VEdM06]. Back
there, the language was quite different as we were computing static-semantic information
via lazy functions rather than edge predicates. The preconditions of the refactoring and the
classification of parameters were, however, the same.
Of course we are not the first to attempt a precise description of Extract Method. Gris-
CHAPTER 6. SCRIPTING REFACTORINGS 146
wold and Notkin in [GN93], and Fowler in his book [Fow99] gave quite detailed recipes, but
unfortunately no precise hint for mechanising the transformation. One noteworthy work is
that of Ralf Lammel in [Lam02] towards language-parametric refactoring based on the Stra-
funski style of functional strategic programming in Haskell [LV03]. There, the refactoring for
extracting an abstraction, such as a method, is phrased in a generic manner and instantiated
for different languages, notably Haskell and Java (or rather JOOS, a subset of Java). The
approach is very appealing for its genericity, but the instantiated version of Extract Method
for JOOS is not precise enough as there is no account for dataflow. It is only checked that
the block to extract does not contain a return statement (since a return will lead to a dif-
ferent control flow once placed in another method), and that there are no assignments to
non-instance variables declared outside the block to extract (since it would be difficult to
propagate these side effects). On the other hand, Juillerat et al. have described how to bet-
ter track dataflow dependencies [JH07]. They have implemented in Eclipse, in about 1000
lines of code, an improved version of Extract Method for a large subset of Java. They do not
explain, however, how to place declarations correctly. To our knowledge, we are the first to
give a complete, concise and executable specification of the core part of Extract Method.
Chapter 7
Discussion and future work
We conclude this thesis with a summary and an overview of related work. In particular, we
compare JunGL to existing tools and languages that are most closely related to it. We also
give hints on interesting future work. Some falls into integrating well-understood ideas from
other tools to make JunGL an end-to-end solution beyond a prototype. Other future work
presents a more challenging grasp, such as the automatic verification of some correctness
properties of our scripts, or the incremental evaluation of edges and predicates.
7.1 Summary
We summarise here the contributions and results of this thesis, from the design of JunGL
and Ordered Datalog to the specification of complex refactoring transformations.
Design of the language We identified the need for a language to script refactoring trans-
formations. New refactorings are proposed all the time, and yet even common examples like
Rename or Extract Method are incorrectly implemented in leading development environments.
We exposed the requirements for such a scripting language. It should provide functional fea-
tures to easily manipulate the AST of the object program and allow the computation of
static-semantic information that is crucial for expressing refactoring preconditions. To fa-
cilitate reasoning on the transformations, scripts should be very declarative. Therefore, we
ought to provide logical features to query the program tree and the static-semantic infor-
mation associated to it. We proposed a concrete, coherent design for such a language. Our
proposal, named JunGL, has three principal features: stream comprehensions, path queries
and lazy edges for seamlessly maintaining static-semantic relationships between program en-
tities. Stream comprehension is the glue between the logical and the functional parts of
JunGL scripts. Path queries are a special kind of predicates to concisely express complex
graph queries. Combined with user-defined lazy edges, they enable the elegant expression of
long-distance relationships in the program tree, such as a type reference to its declaration.
147
CHAPTER 7. DISCUSSION AND FUTURE WORK 148
Furthermore, we briefly described our implementation of JunGL on top of the .NET platform
using both C# and F#, as well as the toolkit around the language for quickly prototyping
refactoring transformations and, more generally, semantic-aware editors.
Logical constructs Mosts parts of our scripts rely on logical constructs. Predicates, edges
and path queries enable the concise expression of
• static-semantic information (e.g. name lookup, type lookup, control flow) which is
computed in a demand-driven manner when a transformation requires it,
• code queries for finding program entities of interest during a refactoring, and
• program analyses as preconditions of a refactoring.
All logical constructs translate to a novel variant of Datalog, called Ordered Datalog, which
returns query results in a deterministic order. Ordered Datalog gives control over the order
of results and preserves the meaningful order of entities in a program. Furthermore, it
enables the expression of computations in an elegant compositional way. We showed, for
instance, how to model name lookup in a Java-like language as a stream of potentially visible
declarations. By taking the first declaration that matches the name of a reference, we get
the declaration for that reference. The approach is elegant, and quite generic. We can model
name lookup for radically different languages in the exact same manner, and hence propose a
generic script for correctly detecting variable capture while renaming a variable. Our Rename
Variable script for the toy language While introduced in Chapter 2 is indeed similar to that
of the more complex language NameJava of Chapter 6 which supports nested classes and
inheritance.
Ordered Datalog We explained in Chapter 3 the least fixpoint semantics of Datalog and
showed that it coincides with a simple operational semantics based on relational algebra,
where each predicate is interpreted as a set relation. Nonmonotonic constructs need to be
handled carefully. The class of safe Datalog programs is defined with the static restriction
that no predicate depends negatively on itself. Such programs can hence be arranged as a
collection of strata that must be evaluated in topological order, each stratum being itself a set
of mutually recursive predicates. In contrast to the classical set-based semantics of Datalog,
Ordered Datalog manipulates sequences, thus encoding a precise order at each intermediate
step of the query. We redefined relational operators to operate on duplicate-free sequences,
and studied the consequences on monotonicity and program stratification. Next, we proved
an important property of Ordered Datalog, namely that it is a refinement of normal Datalog.
Yet, we saw that neither stratified Datalog nor stratified Ordered Datalog are sufficiently
expressive for our needs. In particular, there is a common pattern in the computation of
certain static-semantic information that requires negating a recursive predicate call. To
overcome this issue, we introduced the new class of partially stratified programs. This class of
CHAPTER 7. DISCUSSION AND FUTURE WORK 149
programs is a subset of the well-known class of modularly stratified programs, but it highlights
an interesting evaluation mechanism inspired from the top-down set-based Query-Subquery
approach. When a call to a non-stratified rule is reached, the context is split to generate
several partial reductions of the called predicate. Those partial reductions being stratified,
they can be evaluated further with a set-based evaluation. Not so incidentally, our top-down
evaluation mechanism enables the computation of edges in a demand-driven manner, i.e.
only when their value is needed for the evaluation of a query. That lazy mechanism is further
enhanced by the fact that duplicate-free sequences are encoded as streams.
Evaluation We validated the design of JunGL through a number of non-trivial refactoring
scripts on substantial subsets of languages like Java and C#. In particular, we described the
important steps of Rename Variable and Extract Method and demonstrated how some bugs
in mainstream development environments are easily discussed and avoided by expressing
the refactorings in JunGL. The scripts are attached in Appendices B and C. In view of the
complexity of the refactorings they address, we find their small size very encouraging. In fact,
the only verbosity lies in the construction of code fragments and in the destructive updates
of the program tree. Those transformation parts of our scripts are indeed less declarative
than their equivalent in term rewrite systems. Beside rewrite rules, perhaps another missing
feature of our design is quotation for object programs. We discuss these two features in
future work. JunGL has proved very successful, however, for expressing all other parts of
the scripts. In particular, we were able to express concisely and elegantly the computation of
static-semantic information, such as name binding and control flow, which are usually hard
to accommodate within existing transformation systems.
7.2 Related work
Rigorous refactoring We are by no means the first to realise the need for a formal,
precise approach to refactoring. In their PhD theses, both Opdyke and Roberts insisted
on the importance of preconditions and postconditions for refactoring transformations to
ensure that a transformed program is always semantically equivalent to the original [Opd92,
Rob99]. Naturally, one cannot guarantee full behaviour preservation while refactoring real-
world programs, as there are always features that are not tractable (e.g., concurrency or
dynamic class loading).
Therefore, most rigorous specifications of refactorings rely on reasonable assumptions
and focus on certain properties to preserve during the refactoring. For instance, in our
specification of Rename Variable, we assume our transformation to be correct if it preserves
name bindings, i.e. if each variable reference points to the same declaration before and after
the transformation.
Similarly, specifications of type-based refactorings, which alter the type structure of a pro-
gram, mostly focus on maintaining type-correctness and on preserving bindings. For instance,
CHAPTER 7. DISCUSSION AND FUTURE WORK 150
changes to the declared types of method parameters should account for the static nature of
overloading resolution to ensure that the program behaviour is not affected. Based on such
reasonable assumptions, type-based refactorings have been precisely defined for mainstream
languages like Java [TKB03, BTF05]. Some of them even deal with introducing generic types
[DKTE04, vDD04, FTK+05, KETF07].
Provably-correct refactorings Other works, however, try to formally prove the complete
correctness of refactorings on simpler languages using program refinement calculi. In his
PhD thesis, Cornelio formalises a large collection of refactorings as algebraic refinement rules
[Cor04] for ROOL, a Refinement Object-Oriented Language. In the tradition of refinement
calculi, the formal semantics of ROOL are based on weakest preconditions, from which can
be derived a set of programming laws. These programming laws are then used to prove that
a refactoring transformation is indeed behaviour preserving.
Ettinger takes a similar approach in his PhD thesis, in which he develops a theoretical
framework for slicing-based behaviour-preserving transformations and derives refactorings
that have never been mechanised before [Ett06]. His language has also formal semantics
based on weakest preconditions. It is, however, restricted to imperative constructs, as the
focus of his work is exclusively on statement-level refactorings that deal with control and
data flow.
Both theses address only simple languages, because any formal development would be
hardly manageable otherwise. However, they are clearly inspirational in specifying refactor-
ings for mainstream languages. In particular, they give an invaluable insight on the correct
preconditions of a refactoring.
An attempt to cope with more complex languages is to mechanise the verification of
refactorings. Sultana and Thompson have shown how to perform the verification of different
refactorings for untyped and typed lambda-calculi in the proof assistant Isabelle/HOL [ST08].
Using an interactive theorem prover has several benefits. First, it keeps track of all the
details to be proved. Second, the formal development can be used to automatically extract
the implementation of the refactoring. As with any fully-formal work, however, verifying
non-trivial refactorings requires discharging a considerable amount of proof obligations. This
again restricts the scope of the work.
Garrido and Meseguer have followed yet another approach. They use Maude, an algebraic
specification language, to specify and verify refactorings for Java with no concurrent features
[GM06]. Their specification builds on previous work in which Maude is used to formalise the
semantics of Java. Their implementation appears to be very concise, but the refactorings
they currently verify are very local. Nonetheless, the approach is very encouraging and,
hopefully, should scale to less local and more complex refactorings. In the same vein, Junior
et al. have built on the work of Cornelio and used CafeOBJ, another algebraic specification
language, to encode the programming laws of ROOL and verify refactorings [JSC07].
Finally, Bannwart and Muller have addressed the problem of proving refactoring correct
CHAPTER 7. DISCUSSION AND FUTURE WORK 151
by introducing an explicit I/O model to ensure that the original and refactored programs are
externally equivalent [Ban06, BM06]. They specify the preconditions of various refactorings
for a subset of Java and give a formal proof that any application of a refactoring preserves
the external behaviour of the program, provided that the program satisfies the correctness
conditions of that refactoring. A peculiarity of their approach is the way they add correctness
conditions as assertions onto the refactored program. These contracts can then be checked
at runtime or statically using a program verifier such as Boogie or ESC/Java. Interestingly,
such specifications, and more generally any contract specification, ought to be refactored too
when refactorings are applied. This issue is addressed in [GFT06] by Goldstein et al. who
explain how to account for contracts in refactoring. In particular, they show how contracts
should be modified when code changes and how contracts may prevent certain changes.
Composition of refactorings It is widely accepted that complex refactoring transforma-
tions can be built from low-level primitive transformations. Opdyke was the first to make this
observation in his thesis [Opd92] and gave a set of useful primitives for refactoring object-
oriented programs, such as create an empty class or change a member function name. Later,
Roberts introduced additional postconditions for the composition of primitive transforma-
tions into high-level refactorings, and showed how to derive the precondition of a composite
refactoring from the preconditions of its components [Rob99].
More recently, Kniesel and Koch have improved Roberts’ approach by providing a formal
model for the static composition of refactorings [KK04], i.e. in a program-independent way.
In their model, each basic transformation is accompanied by a forward description and its
dual backward description, that act as predicate transformers. A forward description takes a
condition that holds before the transformation and returns the condition that will hold after
it. Conversely, a backward description takes a condition that holds after the transformation
and yields a joint precondition of the transformation. These descriptions are then used to
automatically infer the joint precondition of a chain of refactorings, thus allowing users to
correctly compose infinitely many refactorings from basic transformations. To validate their
approach, the authors have implemented a prototype framework where “basic” operations like
RenameField(class, field, newName), AddInterface(name), or Extract Method(class, method,
parameterType, sel, newName) are assumed to be hard-coded. As we have seen in this thesis,
these primitive transformations operations are in fact very difficult to mechanise correctly.
Kniesel and Koch’s work is hence complementary to ours. Their framework deals with
the composition of primitive transformations into much bigger refactorings, while JunGL ad-
dresses the complex mechanisation of those primitive transformations that require compiler-
like analyses at source level. Having implemented some of these transformations, we can
confidently say, however, that some of them could be in fact decomposed into yet smaller
primitives thus promoting their reusability. This is notably the case of Extract Method which
appears to be the composition of other atomic transformations for encapsulating the selec-
tion into a block, moving variable declarations in and out, and extracting this block to a
CHAPTER 7. DISCUSSION AND FUTURE WORK 152
new method. By introducing new temporary abstractions into the language, one could even
break the last transformation into a first step that creates an inner method, and a second
step that lifts that inner method at the level of the class.
We have not addressed such decomposition carefully with JunGL, as we do not support
yet the backwards propagation of our preconditions. One way to provide such support would
be to adopt Kniesel and Koch’s approach and require that each primitive transformation be
annotated with backward descriptions. A much more challenging route would be to infer
these descriptions from the scripts themselves. This directly relates to the verification of
some correctness properties of our scripts, which we discuss in future work.
Specifications of compiler optimisations We indicated that the design of JunGL heav-
ily borrows from the literature on declarative specifications of compiler optimisations. In
particular, our use of path queries can be traced back to the design of Gospel by Whitfield
and Soffa [WS97]. Gospel has a similar feature, but the dataflow facts are hard-coded in
the implementation, whereas in JunGL they are user-definable via lazy edges. The idea to
achieve that flexibility via a form of logic programming augmented with path expressions
originated in Lacey and De Moor’s work [LM01, DdMS02]. A separate branch of research,
instigated by Lacey, is the formal verification of compiler optimisations that are specified
in this style [LJVWF02]. Lerner et al. have demonstrated how to automate such proofs
[LMC03, LMRC05].
A completely different approach to scripting compiler optimisations was proposed by Ol-
mos and Visser in [OV02]. There, the optimisations are rewrites of the syntax tree expressed
in the term rewriting system Stratego [BKVV06]. Rewrite rules are usually context-free,
meaning that they only have access to the term to which they are applied. Stratego ex-
tends the formalism of term rewriting both with programmmable rewriting strategies and
scoped dynamic rewrite rules [BvDOV06]. Programmable rewriting strategies enables the
combination of simple rewrite rules into complex transformations and provide control over
the application of rules, by defining the order in which rewrites rules should be applied.
Such strategies can be used to carry a data structure with contextual information, but they
do not provide any direct answer to the issue of context-sensitivity for the computation of
static-semantic information, such as name binding. A better approach is the use of scoped
dynamic rewrite rules, which allows the definition of new rewrite rules at run-time. These
rewrite rules may indeed access information from the context in which they are defined and
propagate it to the location where they are applied. As a small example, the operation
?Let(x, e1, e2)
; rules( Substitute : Var(x) -> e1 )
matches a Let construct that binds x to e1, and defines a new rewrite rule that replaces any
variable reference to x with e1. Here, the substitution should be valid in the expression e2
only, and Stratego therefore provides constructs for controlling the lifetime of any dynamic
CHAPTER 7. DISCUSSION AND FUTURE WORK 153
rule. The technique is effective as it has been used to develop a frontend for Java 1.5.
Nonetheless, we feel the specification of static-semantic information is less declarative than
in JunGL and sometimes difficult to express. Furthermore, as soon as a semantic analysis
involves a graph structure such as the control flow graph of a program, it is yet harder to
express. On the other hand, for the description of the transformation itself, i.e the change to
the object program, the situation is reversed since rewriting strategies provide a much neater
formalism than our destructive updates.
Graph rewriting for program transformations The idea of declarative specifications
of refactorings via graph transformations was first put forward by Tom Mens in [MDJ02].
The refactorings considered there are different variants of moving class members. Their
specification is purely declarative, as a graph rewrite system. A big advantage of using graph
rewrite systems is that it becomes possible, for example, to detect conflicting refactorings
[MTR05]. The main difference with our work is that none of the refactorings require dataflow
analysis. It would be interesting to see whether Mens’s techniques scale up to the full-blown
refactorings of [TKB03].
An earlier attempt to use graph rewrite systems for specifying program transformations
is the Optimix system by Aßmann [Aßm98]. Optimix can be used to generate program
analyses and transformations. Interestingly, its input language is based both on Datalog and
on two classes of graph rewrite systems: edge addition rewrite systems (EARS) and more
general graph rewrite systems (GRS). EARS rules are used to add new edges to the program
graph. They are therefore quite similar to edge definitions in JunGL, and besides both can
be translated to Datalog. The added edges of an EARS rule correspond to the head of a
Datalog rule, while the tested edges and nodes correspond to the rule body. Aßmann notes
that strong confluence of EARS and fixpoint semantics of Datalog are in fact related. On
the other hand, GRS rules are used for deleting and attaching subgraphs to the original
graph. Each GRS rule has a precondition, the left-hand side of the rule, which is a graph
pattern expressed in Datalog. In Optimix, the transformation of the program tree is hence
performed repeatedly applying small rewrites to the graph, while in JunGL we have opted
for destructive updates a la ML. Again, our approach for manipulating the graph is hence
less declarative: one may want to include rewriting primitives to streamline that part of our
specifications.
Optimix was used to express various compiler optimisations, along with the information
necessary to perform those optimisations. Its rules are indeed expressive enough to compute
dataflow facts, such as use-def chains, which are crucial for implementing compiler optimisa-
tions. JunGL is therefore quite similar to Optimix, in the sense that it allows the computation
of auxiliary information required for a particular transformation. In Optimix, however, rules
relate to stratified Datalog. As we explained in Chapter 5, stratified Datalog is not expressive
enough for specifying the computation of static-semantic information such as name binding.
Compiler optimisations are typically performed on an intermediate representation output by
CHAPTER 7. DISCUSSION AND FUTURE WORK 154
the compiler frontend. Therefore it is reasonable to assume symbolic names at that stage.
On the other hand, refactoring transformations are performed at the level of source code,
thus making name lookup a crucial component in their automation.
GraphStratego is another example of graph rewriting for program transformations [KV06].
Kalleberg and Visser observed that some program analyses may be either difficult or unnat-
ural to express in term rewriting systems. Therefore, they extended Stratego with references
to represent structures that are inherently graph-like (typically the control flow of a program)
in a more natural way. One challenge of such an extension is to handle the termination of
graph traversals. In GraphStratego, the answer is the concept of phased traversal to guaran-
tee that each reference is only visited once. Phases are supported through the introduction
of new primitive strategies into the language. It is the programmer’s responsibility to em-
ploy the right strategy to ensure termination. For collecting information on a graph, the
approach is hence less declarative than in JunGL, where the evaluation of Datalog predicates
is guaranteed to terminate in any event.
Logic meta-programming As we said in the introduction, logic programming has been
proposed at many occasions for program analysis and code queries. Many of these proposals
are based on Prolog, and a few among them also address the issue of program transformations.
Perhaps the systems that resemble JunGL most are JTransformer [KHR07] and more
recently GenTL [AK07]. JTransformer, available as a plug-in for Eclipse, combines a query
and a transformation engine for Java. It represents the AST of a program as a Prolog
database, which can then be queried with Prolog queries. The other system, GenTL, extends
JTransformer to allow concrete syntax patterns containing meta-variables. In contrast with
other code querying tools of that sort, e.g. JQuery [JV03], the underlying source code can be
transformed via Conditional Transformations (CTs). By first specifying the pattern to match
and then the transformation, CTs allow a clear separation between the use of pure Prolog
for the querying part, and impure functions for the transformation. This is, in a way, similar
to JunGL where we forbid updates and creation of new values in the querying parts. CTs
are more organised, however, and may be seen as rewrite rules. JunGL differs from GenTL
in being based on Datalog rather than Prolog. Yet, this is just a subtle implementation
detail if GenTL users restrict themselves to pure Prolog in the matching part of CTs and
if tabled resolution is used [War92]. A bigger difference between the two systems is the
support by GenTL of concrete syntax which enhances the readability of code queries. On
the other hand, although GenTL has been used for specifying refactorings, the computation
of static-semantic information does not seem to be part of the scripts. It is certainly possible
to express name bindings in Prolog, but we believe that edges and path queries as sugar for
predicates make the expression of this kind of information much more elegant.
Another example of logic meta-programming for program transformations is DeepWeaver,
a tool supporting cross-cutting program analysis and transformation components [FKI+07].
DeepWeaver operates at the bytecode level and provides a declarative way to access the
CHAPTER 7. DISCUSSION AND FUTURE WORK 155
internal structure of methods, as well as control flow information. The design of DeepWeaver
is motivated by domain-specific optimisations. One example is the optimisation of database
calls by replacing a query of the form “select * from ...” by a more precise select statement
that retrieves only the columns that are actually accessed later in the execution. Like in
Optimix or GenTL, however, DeepWeaver assumes that some static-semantic information
about the object program is already available. This is, in the end, probably the biggest
difference between those systems and JunGL.
Attribute grammar systems Systems based on attribute grammars have proved very
successful in expressing static-semantic information. We have already mentioned in the intro-
duction the examples of the Synthesizer Generator [RT84] and of JastAdd [EH04]. Another
example is the Eli system for the flexible construction of compilers [GLH+92]. A particu-
larity of Eli is to allow the definition of attribution modules that can easily be reused. As
noted by Kastens and Waite in [KW94], attribute grammars can only be widely accepted
as a viable specification formalism if they can be decomposed into logical modules that can
be treated in isolation. JastAdd builds on the same observation, but one of its main addi-
tional strengths is its integration with a mainstream language. Attribute bodies are indeed
expressed in plain Java, which makes the system more widely applicable. JastAdd also builds
on reference attribute grammars to relate nodes in the program tree, for instance a variable
reference to its declaration. This allows to encode long-distance dependencies in the AST,
pretty much like JunGL does with lazy edges. Finally, JastAdd supports circular attributes,
which comes in very handy to express control and dataflow properties. Ekman, the main
designer of JastAdd, used all these features to implement in a modular and clean formalism
JastAddJ, a full compiler for Java 1.4 and its extension for Java 5 [EH07].
Recently, Schafer and Ekman have started to express refactorings on top of JastAddJ. In
particular, as we said in Chapter 6, they have designed a framework for sound and extensible
renaming for Java, where they re-qualify ambiguous accesses by inverting lookup attributes
in a systematic way [SEdM08]. Schafer has also implemented our specifications of Extract
Method, using circular attributes to express dataflow properties. The result is less declarative
than the conditions in JunGL, as attributes in JastAdd are written in plain Java, but the
specification is still concise. In general, we believe the expression of attribute bodies in
Java makes the code less tractable compared to JunGL edges, but it has the advantage of
bringing in more flexibility. For instance, greatest fixpoints can be computed as well, whereas
in JunGL we are limited to least fixpoints. Furthermore, while JunGL is still pretty much
a research prototype, JastAdd is a mature tool that can be used in an industrial setting.
Interesting future work would be to extend JastAdd with logic features.
Perhaps the main advantage of logic programming over attribute grammars is that logic
programs can be run backwards. To illustrate, consider again the part of Rename Variable
where we look for all references to the declaration we wish to rename. JastAdd supports
collection attributes for gathering such uses when they are first encountered, thus minimising
CHAPTER 7. DISCUSSION AND FUTURE WORK 156
the number of tree traversals. When the lookup edge is expressed as a logic program, however,
the computation can be reversed from the declaration to all uses. We are of course not the
first to express the computation of name bindings and other contextual information as logic
programs. This was indeed the idea behind Pan, an environment generator in the spirit of
the Synthesizer Generator [BGV92]. The formalism of Pan’s semantic descriptions is that of
logic constraint grammars, which combine logic programming and consistency maintenance.
A logic constraint grammar is a context-free grammar with Prolog-based goals attached
to the productions in the grammar. As with normal attribute grammars, however, the
evaluation does not terminate if it encounters circular dependencies. On the other hand,
JunGL guarantees termination by using Datalog, rather than Prolog. Furthermore, JunGL
provides much syntactic sugar to facilitate the expression of predicates, in particular via path
queries.
Path queries The idea of path queries in the context of program transformations is due
to De Moor et al. [DdMS02, dMLVW03, SdML04]. For the version used in JunGL, we drew
inspiration from the syntax in [LS06], which followed on from the design in the work cited
above. A similar style of queries is of course very common in the literature on semi-structured
databases, e.g. [BFS00]. In [LRY+04], Liu et al. proposed parametric regular path queries,
which we have not yet introduced in the design of JunGL. If we were to support such path
queries, we would be able to define a new control-flow successor edge from a statement x and
parameterise it with the variable that is written when executing x :
l e t edge wr i t e x : Statement −(?v )→ ?y =
[ x ] c f s u c c [ ? y ] & [ x ] de f [ ? v ]
We could then use write to collect, for instance, all variables that are written before exiting
a method:
{ ?v | [ ] wr i t e (? v ) [ : Exit ] }
At first sight, it seems this new feature would have quite an impact on the underlying graph
representation of the program, since it allows for directed hyperedges, i.e. edges that have
more than one source and one target. In fact, in our framework, hyperedges would simply
be translated to relations of greater arity, and parameterised edge labels in path queries
converted to calls to these relations. This last comment naturally leads us to explore future
work.
7.3 Future work
Concrete syntax and rewriting We mentioned at several occasions that JunGL would
benefit vastly from concrete syntax. Some parts of our scripts indeed remain quite verbose,
CHAPTER 7. DISCUSSION AND FUTURE WORK 157
in particular the one for creating new fragments of code. Queries on the structure of the
object program would also be much more readable and easier to write in concrete syntax.
Visser explained in [Vis02] all the benefits of concrete syntax over abstract syntax for meta-
programming. He showed with Stratego [BKVV06] how the syntax definition formalism SDF
[vdBHdJ+01] can be used to extend a language with elements of concrete syntax notation.
Rewrite rules in Stratego may accept concrete syntax patterns, enclosed in semantic brackets
to distinguish them from normal term patterns. Those syntax terms are then expanded
in-place by the Stratego compiler to their equivalent AST term patterns. Employing such
syntax terms results in more concise and much more readable rewrite rules.
The GenTL transformation language by Appeltauer and Kniesel also supports concrete
syntax [AK07]. The foundations of GenTL are not in term rewriting like Stratego, but in
logic programming like JunGL. Concrete syntax may be embedded in any predicate and
employed in any precondition of a Conditional Transformation. As said before, CTs are
quite similar to rewrite rules in the way they are applied. Both allow to match precise nodes
in the program tree and replace them with new code fragments, possibly embedding some of
the matched nodes. Clearly, such mechanism would streamline the transformation parts of
our specifications, making them more declarative.
Stratego provides strategies to control the order of rule application and the traversal over
term structures. In our experiments, Ordered Datalog has always provided enough control
for transformations. Indeed, we have always been able to query nodes in an order that
was adequate to perform safely the destructive updates of the underlying tree. We believe,
however, that if JunGL was to support rewriting, strategies would appear important, notably
to handle rule application failures.
Proving some correctness properties A more challenging avenue of future work is on
proving some correctness properties of our scripts. The declarative approach we adopt to
express static-semantic information and refactoring preconditions already provides a sound
basis for rigorous reasoning on the transformations. For instance, an important aspect in the
specification of Extract Method is the classification of local variables into different kinds of
parameters, namely those passed by value, those passed by reference and output parameters.
This classification is complex and yet crucial. An important property of the classification is
that no variable will be classified as two different kinds of parameter. It is easy to check, from
the definitions of valueParams, outParams and refParams, that this requirement is indeed
satisfied. Another desirable property, which is again quite easy to check, is that no variable
use will become orphaned, with no declaration to match it.
Of course, one may want to check more complex aspects of the transformations and
mechanise such checks. For instance, as we have already said, one may want to verify
statically that two transformations can be safely composed. We believe our formalism for
defining auxiliary information and expressing preconditions makes our scripts more tractable
than in other solutions, such as JastAdd where attributes are expressed in plain Java. The
CHAPTER 7. DISCUSSION AND FUTURE WORK 158
path we have started to explore is the application of lightweight verification techniques using
Why, a verification condition generator back-end [Fil03, why07]. Why takes as input an
annotated program in HL, its internal language, and outputs proof obligations to be further
discharged by a proof assistant or an automatic decision procedure. HL is a small ML-like
language with imperative features, such as references, and annotations written in first-order
logic. With an appropriate memory model, we can express in HL the transformation parts of
our scripts. The translation does not require much annotation, except for the loop invariants,
preconditions and postconditions of our very few functions. Indeed, not all program points
need to be annotated, as Why uses a calculus of weakest preconditions to infer annotations
at most intermediate points. As for the logical parts, we have experimented with several
first-order axiomatisations of edge definitions, predicates and path queries.
The big advantage of Why is to offer the use of multiple backend provers. Our first
attempt has been with the automatic decision procedure Simplify [Sim07]. Simple proof
obligations are easily discharged with Simplify but we found more complicated obligations,
notably involving transitive closure, much harder to discharge. Simplify may fail to prove an
obligation either because it is not true or because it is too difficult to prove. In both cases, we
have found it difficult to track down the reasons for a failed proof. One may wish to explore
further this line of research and take advantage of Why to generate verification conditions
for a proof assistant such as Coq [coq07].
Incremental evaluation Another challenging area of research is the incremental evalua-
tion of edges and predicates in JunGL. Usual refactoring scenarios occur in an interactive
development environment where the object program changes frequently. In addition to the
user edits, the refactoring transformations themselves may invalidate some of the semantic in-
formation attached to the JunGL graph. As we explained in Chapter 2, lazy edges discharge
us from maintaining semantic information explicitly at every tree node. The information
is computed on-demand when it is required. Currently, however, we do not maintain that
information incrementally on every change. Instead, we flush the whole cache of lazy edges
at the end of each transformation, to ensure no edge, which ought to be invalidated and
recomputed, will be incorrectly reused in further refactorings.
Lazy edges translate to Ordered Datalog, but they share similarities with reference at-
tribute grammars. This is not too surprising as attribute grammars can be implemented
as logic programs [DM85]. Thus, to address the problem of incrementally maintaining lazy
edges in our program graph, one can build on the work of two research areas: the work done
on incremental evaluation of reference attribute grammars (e.g. [Hed91, Mad98, Boy02]),
and the work done on incremental evaluation of logic programs and maintenance of database
views (e.g. [DT92, SR05]).
More program analyses Finally, one may want to specify more program analyses in
JunGL, in order to implement more complex refactorings that have not been fully automated
CHAPTER 7. DISCUSSION AND FUTURE WORK 159
so far for mainstream languages. In [Ett06], Ettinger describes how to correctly automate
complex statement-level refactoring based on slicing, such as the Untangling refactoring we
proposed in [EV04]. To implement that kind of refactoring, one would first need to build
a slicing tool for a mainstream language. We have shown in Chapter 2 how to implement
a naive slicer using JunGL, for a toy imperative language with no pointers. One may now
want to encode a points-to analysis in JunGL before implementing a slicer for a Java-like
language.
Such program analyses can be expressed elegantly in JunGL. Logic programming lan-
guages, like Prolog and Datalog, have been proposed at several occasions to express static
program analyses in a natural and concise way [Rep93, DRW96]. Reps et al. showed in
[Rep93] how to use Datalog for on-demand interprocedural slicing. However, one issue in im-
plementing that kind of program analysis as logic programs has been performance. Indeed,
early logic-based implementations did not scale very well compared to traditional implemen-
tations. More recent works on the use of Binary Decision Diagrams for program analysis
[LH04, BNL05] suggest that better performance can be achieved now. In particular, Whaley
et al. showed in [WACL05] how to encode a context-sensitive points-to analysis in Datalog
and evaluate it efficiently with BDDs. Without any change in the design of our language, we
believe such an alternative implementation would be very valuable to a wider applicability
of JunGL.
Appendix A
JunGL grammar
⟨letter
⟩::= A..Z | a..z
⟨digit
⟩::= 0..9
⟨Number
⟩::=
⟨digit
⟩+
⟨String
⟩::= "
⟨any
⟩?"
⟨Identifier
⟩::=
⟨letter
⟩(
⟨letter
⟩|⟨digit
⟩| | ’ )?
⟨@Identifier
⟩::= @
⟨Identifier
⟩
⟨?Identifier
⟩::= ?
⟨Identifier
⟩
⟨CompoundName
⟩::=
⟨Identifier
⟩?.
⟨Identifier
⟩
Figure A.1: Lexemes, identifiers and compound names
160
APPENDIX A. JUNGL GRAMMAR 161
⟨Program
⟩::=
⟨TopLevelStatement
⟩+
⟨TopLevelStatement
⟩::= using
⟨CompoundName
⟩+, {
⟨Statement
⟩?}∣∣ ⟨
Statement⟩
⟨Statement
⟩::=
⟨Declaration
⟩∣∣ do
⟨Block
⟩
⟨Declaration
⟩::=
⟨NamespaceDeclaration
⟩∣∣ ⟨
NodeTypeDeclaration⟩
∣∣ ⟨LetDeclaration
⟩
⟨NamespaceDeclaration
⟩::= namespace
⟨CompoundName
⟩{
⟨Declaration
⟩+}
⟨NodeTypeDeclaration
⟩::= type
⟨NodeTypeFragment
⟩( and
⟨NodeTypeFragment
⟩)?
⟨NodeTypeFragment
⟩::=
⟨Annotation
⟩? ⟨Identifier
⟩[ =
⟨TypeBody
⟩]
⟨Annotation
⟩::=
⟨@Identifier
⟩[ (
⟨String
⟩) ]
⟨NodeTypeBody
⟩::= {
⟨FieldDeclaration
⟩+; }∣∣ ( |
⟨NodeTypeFragment
⟩)+∣∣ (
⟨NodeTypeBody
⟩)
⟨FieldDeclaration
⟩::=
⟨Identifier
⟩:
⟨Type
⟩
⟨LetDeclaration
⟩::= let
⟨Pattern
⟩=
⟨Block
⟩∣∣ let
⟨Identifier
⟩ ⟨Pattern
⟩+=
⟨Block
⟩∣∣ let rec
⟨Identifier
⟩ ⟨Pattern
⟩+=
⟨Block
⟩∣∣ let predicate
⟨Identifier
⟩(
⟨?Identifier
⟩?, )
=⟨Predicate
⟩∣∣ let rec predicate
⟨Identifier
⟩(
⟨?Identifier
⟩?, )
=⟨Predicate
⟩
( and⟨Identifier
⟩(
⟨?Identifier
⟩?, ) =
⟨Predicate
⟩)?
∣∣ let edge⟨Identifier
⟩⟨Identifier
⟩[ :
⟨CompoundName
⟩] ->
⟨?Identifier
⟩
=⟨Predicate
⟩∣∣ let attribute
⟨Identifier
⟩⟨Identifier
⟩[ :
⟨CompoundName
⟩] =
⟨Expression
⟩
Figure A.2: Syntax of JunGL programs
APPENDIX A. JUNGL GRAMMAR 162
⟨Block
⟩::=
⟨Expression
⟩+;
⟨Expression
⟩::=
⟨SimpleExpression
⟩∣∣ begin
⟨Block
⟩end∣∣ let
⟨Pattern
⟩=
⟨Expression
⟩in
⟨Block
⟩∣∣ let
⟨Identifier
⟩ ⟨Pattern
⟩+=
⟨Expression
⟩in
⟨Block
⟩∣∣ let predicate
⟨Identifier
⟩(
⟨?Identifier
⟩?, )
=⟨Predicate
⟩in
⟨Block
⟩∣∣ if
⟨SimpleExpression
⟩then
⟨Expression
⟩[ else
⟨Expression
⟩]∣∣ match
⟨SimpleExpression
⟩with
( |⟨Pattern
⟩->
⟨Expression
⟩)+∣∣ foreach
⟨Pattern
⟩in
⟨SimpleExpression
⟩do
⟨Expression
⟩∣∣ ⟨
SimpleExpression⟩.
⟨Identifier
⟩<-
⟨Expression
⟩
⟨SimpleExpression
⟩::= true
∣∣ false∣∣ null∣∣ ⟨
Number⟩
∣∣ ⟨String
⟩∣∣ ⟨
Identifier⟩
∣∣ ⟨SimpleExpression
⟩.
⟨Identifier
⟩∣∣ ⟨
SimpleExpression⟩+
∣∣ ⟨SimpleExpression
⟩ ⟨InfixOperator
⟩ ⟨SimpleExpression
⟩∣∣ ⟨
PrefixOperator⟩ ⟨
SimpleExpression⟩
∣∣ ⟨SimpleExpression
⟩is
⟨CompoundName
⟩∣∣ fun
⟨Pattern
⟩+->
⟨Expression
⟩∣∣ new
⟨CompoundName
⟩[⟨FieldInitialiser
⟩+, ]∣∣ {
⟨?SimpleExpression
⟩|
⟨Predicate
⟩}∣∣ (
⟨Expression
⟩?, )∣∣ ⟨
SimpleExpression⟩::
⟨Expression
⟩∣∣ [
⟨Expression
⟩?; ]
⟨FieldInitialiser
⟩::=
⟨Identifier
⟩=
⟨Expression
⟩
Figure A.3: Syntax of expressions
APPENDIX A. JUNGL GRAMMAR 163
⟨?SimpleExpression
⟩::= true
∣∣ false∣∣ null∣∣ ⟨
Number⟩
∣∣ ⟨String
⟩∣∣ ⟨
Identifier⟩
∣∣ ⟨?Identifier
⟩∣∣ ⟨
?SimpleExpression⟩.
⟨Identifier
⟩∣∣ ⟨
?SimpleExpression⟩+
∣∣ ⟨?SimpleExpression
⟩ ⟨InfixOperator
⟩ ⟨?SimpleExpression
⟩∣∣ ⟨
PrefixOperator⟩ ⟨
?SimpleExpression⟩
∣∣ ⟨?SimpleExpression
⟩is
⟨CompoundName
⟩∣∣ fun
⟨Pattern
⟩+->
⟨?SimpleExpression
⟩∣∣ new
⟨CompoundName
⟩[⟨?FieldInitialiser
⟩+, ]∣∣ {
⟨?SimpleExpression
⟩|
⟨Predicate
⟩}∣∣ (
⟨?SimpleExpression
⟩?, )∣∣ ⟨
?SimpleExpression⟩::
⟨?SimpleExpression
⟩∣∣ [
⟨?SimpleExpression
⟩?; ]
⟨?FieldInitialiser
⟩::=
⟨Identifier
⟩=
⟨?SimpleExpression
⟩
Figure A.4: Syntax of expressions with logical identifiers
⟨InfixOperator
⟩::= or
∣∣ and∣∣ ==
∣∣ !=∣∣ <
∣∣ <=∣∣ >
∣∣ >=∣∣ +
∣∣ -∣∣ *
∣∣ /⟨PrefixOperator
⟩::= not
∣∣ -
Figure A.5: Operators
⟨Pattern
⟩::=
∣∣ true∣∣ false
∣∣ null∣∣ ⟨Number
⟩∣∣ ⟨
String⟩
∣∣ ⟨Identifier
⟩∣∣ (
⟨Pattern
⟩?, )∣∣ ⟨
Pattern⟩::
⟨Pattern
⟩∣∣ [
⟨Pattern
⟩?; ]
Figure A.6: Syntax of patterns
APPENDIX A. JUNGL GRAMMAR 164
⟨Predicate
⟩::= true
∣∣ false∣∣ local⟨?Identifier
⟩+:
⟨Predicate
⟩∣∣ ⟨
Predicate⟩|
⟨Predicate
⟩∣∣ ⟨
Predicate⟩|>
⟨Predicate
⟩∣∣ ⟨
Predicate⟩&
⟨Predicate
⟩∣∣ !
⟨Predicate
⟩∣∣ first
⟨Predicate
⟩∣∣ (
⟨Predicate
⟩)∣∣ ⟨
CompoundName⟩(
⟨Term
⟩?, )∣∣ ⟨
?SimpleExpression⟩
∣∣ ⟨PathPredicate
⟩
⟨Term
⟩::=∣∣ ⟨
?SimpleExpression⟩
⟨PathPredicate
⟩::=
⟨NodePredicate
⟩(
⟨EdgePredicate
⟩ ⟨NodePredicate
⟩)?
⟨NodePredicate
⟩::= [
⟨Term
⟩[ : [ ! ]
⟨CompoundName
⟩] ]
⟨EdgePredicate
⟩::=
⟨CompoundName
⟩[ + | * ]∣∣ (
⟨ComplexEdgePredicate
⟩) [ + | * ]∣∣ ⟨
EdgePredicate⟩;
⟨EdgePredicate
⟩
⟨ComplexEdgePredicate
⟩::=
⟨EdgePredicate
⟩[⟨PathPredicate
⟩]∣∣ ⟨
PathPredicate⟩ ⟨
EdgePredicate⟩
∣∣ local⟨?Identifier
⟩+:
⟨ComplexEdgePredicate
⟩∣∣ ⟨
ComplexEdgePredicate⟩&
⟨Predicate
⟩
Figure A.7: Syntax of predicates
⟨Type
⟩::= bool∣∣ int∣∣ string∣∣ ⟨
CompoundName⟩
∣∣ ⟨Type
⟩list∣∣ ⟨
Type⟩stream∣∣ ⟨
Type⟩->
⟨Type
⟩∣∣ ⟨
Type⟩*
⟨Type
⟩∣∣ (
⟨Type
⟩)
Figure A.8: Syntax of type references
Appendix B
Rename Variable
The name binding rules for the object language described in Section 6.1.1:
using NameJava . Ast
{
namespace NameJava . NameResolution
{
(∗ main lookup edges ∗)
l et edge lookup x : SingleName → ?y =
f i r s t ( [ x ] lookupAl l [ ? y ] & getName x == getName ?y )
l et edge lookup x : DotName → ?y = [ x ] r i gh t ; lookup [ ? y ]
l et getName x =
i f x i s CompUnit then x . packageName
else x . name
(∗ s t a t i c contex t ∗)
l et predicate i sVariableName (? x ) =
[ : F i e ldDec l ] expr [ ? x :Name ] | [ : Loca lVar i ab l eDec l ] expr [ ? x :Name ]
| i sVariableName (? z ) & [ ? z :DotName ] r i gh t [ ? x :Name ]
l et predicate isTypeName(? x ) =
[ : ClassDecl ] super [ ? x :Name ] | [ : F i e ldDec l ] f i e l dType [ ? x :Name ]
| [ : Loca lVar i ab l eDec l ] varType [ ? x :Name ] | [ : Cast ] castType [ ? x :Name ]
| [ ? x :Name ] parent [ :DotName ] r i gh t [ : This ]
| isTypeName(? z ) & [ ? z :DotName ] r i gh t [ ? x :Name ]
l et predicate isPackageOrTypeName (? x ) =
[ ? z :DotName ] l e f t ; ch i l d ∗ [ ? x :Name ] & isTypeName(? z )
l et predicate isAmbiguous (? x ) =
[ ? x :Name ] & ! isVariableName (? x ) &
! isPackageOrTypeName (? x ) & ! isTypeName(? x )
165
APPENDIX B. RENAME VARIABLE 166
l et edge exp rQua l i f i e r x : SingleName → ?y =
[ x ] parent ; r i gh t [ x ] parent ; l e f t ; ( expr ∗ ; r i gh t ∗ )∗ [ ? y : SingleName ]
l et predicate onTheRightOfDot(? x ) = [ ? x ] parent [ :DotName ] r i gh t [ ? x ]
l et edge enclos ingStmt x → ?y = f i r s t ( [ x ] parent ∗ [ ? y : Stmt ] )
l et edge enc l o s i ngC l a s s x → ?y = f i r s t ( [ x ] parent+[?y : ClassDecl ] )
l et edge enc l os ingScope x → ?y =
f i r s t ( [ x ] parent+[?y : ClassDecl ] & ! [ x ] parent+[?y ] super ; ch i l d ∗ [ x ] )
B [ x ] parent+[?y : CompUnit ]
(∗ type lookup ∗)
l et edge typeLookup x : SingleName → ?y =
[ x ] lookup [ : F i e ldDec l ] f i e l dType ; lookup [ ? y : ClassDecl ]
| [ x ] lookup [ ? y : ClassDecl ]
| [ x ] lookup [ ? y : CompUnit ]
l et edge typeLookup x :DotName → ?y = [ x ] r i gh t ; typeLookup [ ? y ]
l et edge typeLookup x : This → ?y =
onTheRightOfDot( x ) & [ x ] parent ; l e f t ; l ookupEnc los ingClas s [ ? y ]
| ! onTheRightOfDot( x ) & [ x ] enc l o s i ngC l a s s [ ? y ]
l et edge typeLookup x : Super → ?y =
onTheRightOfDot( x ) &
[ x ] parent ; l e f t ; l ookupEnc los ingClas s ; super ; lookup [ ? y ]
| ! onTheRightOfDot( x ) & [ x ] enc l o s i ngC l a s s ; super ; lookup [ ? y ]
l et edge typeLookup x : Parenthes i sedExpr → ?y = [ x ] expr ; typeLookup [ ? y ]
l et edge typeLookup x : Cast → ?y = [ x ] castType ; lookup [ ? y ]
l et edge l ookupEnc los ingClas s x : SingleName → ?y =
onTheRightOfDot( x ) &
[ x ] parent ; l e f t ; l ookupEnc los ingClas s ; bodyDecls [ ? y ] &
[ x ] enc l o s i ngC l a s s +[?y ] & getName x == getName ?y
| ! onTheRightOfDot( x ) & f i r s t ( [ x ] en c l o s i ngC l a s s +[?y ] &
getName x == getName ?y )
l et edge l ookupEnc los ingClas s x :DotName → ?y =
[ x ] r i gh t ; l ookupEnc los ingClas s [ ? y ]
(∗ lookup au x i l i a r y edges ∗)
l et edge l ookupAl l x : SingleName → ?y =
[ x ] lookupAllWithDotContext [ ? y ] &
( isVariableName (x ) & ( [ ? y : F i e ldDec l ] | [ ? y : Loca lVar i ab l eDec l ] )
B isTypeName(x ) & [ ? y : ClassDecl ]
B isPackageOrTypeName (x ) & ( [ ? y : ClassDecl ] | [ ? y : CompUnit ] )
B isAmbiguous (x ) )
l et edge lookupAllWithDotContext x : SingleName → ?y =
onTheRightOfDot( x ) & [ x ] parent ; l e f t ; typeLookup ; lookupAllMembers [ ? y ]
| ! onTheRightOfDot( x ) & [ x ] l ookupAl lDec l s [ ? y ]
| ! onTheRightOfDot( x ) & [ x ] lookupAl lPackages [ ? y ]
APPENDIX B. RENAME VARIABLE 167
l et edge lookupAllMembers x : ClassDecl → ?y =
[ x ] ( super ; lookup ) ∗ [ ? s ] &
( [ ? s ] bodyDecls [ ? y : F i e ldDec l ] | [ ? s ] bodyDecls [ ? y : ClassDecl ] )
l et edge lookupAllMembers x : CompUnit → ?y =
[ x ] c l a s sD e c l s [ ? y : ClassDecl ]
l et edge l ookupAl lDec l s x → ?y =
[ x ] enclos ingStmt ; l i s t P r e d e c e s s o r +[?y : Loca lVar i ab l eDec l ]
| [ x ] enc l os ingScope ; l ookupAl lDec l s [ ? y ]
l et edge l ookupAl lDec l s x : ClassDecl → ?y =
[ x ] equa l s [ ? y ]
| [ x ] lookupAllMembers [ ? y ]
| [ x ] enc l os ingScope ; l ookupAl lDec l s [ ? y ]
l et edge l ookupAl lDec l s x : CompUnit → ?y =
[ x ] parent ; compUnits [ ? cu ] lookupAllMembers [ ? y ] &
( ?cu . packageName == x . packageName | ?cu . packageName == "" )
l et edge l ookupAl lPackages x → ?y =
[ x ] parent ∗ [ : Program ] compUnits [ ? y ]
}
}
And the full script for Rename Variable itself:
using NameJava . Ast , NameJava . NameResolution
{
namespace NameJava . Rename
{
l et i sVa r i ab l eDec l a r a t i on d =
d i s F i e ldDec l or d i s Loca lVar i ab l eDec l
l et a l r eadyEx i s t s dec newName =
not Ut i l s . isEmpty { ?d |
[ dec : Loca lVar i ab l eDec l ] parent ; ch i l d [ ? d : Loca lVar i ab l eDec l ] &
?d . name == newName
| [ dec : F i e ldDec l ] parent ; ch i l d [ ? d : F i e ldDec l ] & ?d . name == newName}
l et edge al lTypesOrPackages x :Name → ?y =
[ x ] l ookupAl lDec l s [ ? y : ClassDecl ] | [ x ] lookupAl lPackages [ ? y ]
l et f i ndSe l fC r o s sPo i n t (x , d) =
pick { (x , ? c , ? ec , ? sc , d) | [ x ] en c l o s i ngC l a s s [ ? c ] en c l o s i ngC l a s s ∗ [ ? ec ]
( super ; lookup ) ∗ [ ? sc ] bodyDecls [ d ] }
l et lookupScopeFrom x name =
pick { ? s | f i r s t ( [ x ] al lTypesOrPackages [ ? s ] & getName ? s == name) }
l et bui ldTypeReference x c =
l et es = pick { ? es | f i r s t ( [ c ] enc l os ingScope ∗ [ ? es ] &
? es == lookupScopeFrom x ( getName ? es ) ) } in
APPENDIX B. RENAME VARIABLE 168
i f es == null then
e r r o r ( "Cannot build type access for " + c . name)
else
let chain = toL i s t { ? i c |
[ c ] enc l os ingScope ∗ [ ? i c ] enc l os ingScope +[ es ] } in
let esRef = new SingleName { name = getName es } in
L i s t . f o l d r
( fun node i c → new DotName {
l e f t = node ,
r i gh t = new SingleName { name = getName i c }
})
esRef chain
l et bui ldThi sRe f e r ence (x , c , ec , sc , d ) =
l et t h i s = new This in
let ee = i f ec == c then t h i s else
new DotName { l e f t = bui ldTypeReference x ec , r i gh t = th i s } in
let se = i f sc == ec then ee else
new Parenthes i sedExpr {
expr = new Cast { castType = bui ldTypeReference x sc , expr = ee }
} in
se
l et getExprRewrite (x , d) =
l et ( o l dQua l i f i e r , oldType , newType ) =
pick { (?q , ? ot , ? nt ) |
[ x ] parent [ : DotName ] r i gh t [ x ] parent ; l e f t [ ? q ] typeLookup [ ? ot ] &
[ d ] enc l o s i ngC l a s s [ ? nt ] } in
i f oldType == newType then ( fun ( ) → ( ) )
else let cas t = new Cast { castType = bui ldTypeReference x newType } in
let r ewr i t e ( ) = begin
replaceWith o l dQ ua l i f i e r (new Parenthes i sedExpr { expr = cas t } ) ;
cas t . expr ← o l dQ ua l i f i e r
end in
r ewr i t e
l et getThisRewr i te (x , d) =
l et o l dQu a l i f i e r = pick { ?q |
[ x ] parent [ :DotName ] r i gh t [ x ] parent ; l e f t [ ? q ] } in
let newQual i f i e r = bui ldThi sRe f e r ence ( f i ndSe l fC r o s sPo i n t (x , d ) ) in
let r ewr i t e ( ) = (
i f o l dQu a l i f i e r == null then begin
let e = new DotName { l e f t = newQual i f i e r } in
replaceWith x e ;
e . r i gh t ← x
end else
replaceWith o l dQu a l i f i e r newQual i f i e r
) in
r ewr i t e
APPENDIX B. RENAME VARIABLE 169
l et renameVariable program node newName =
l et dec = pick { ?d | [ node ] lookup [ ? d ] B [ node ] equa l s [ ? d ] } in
i f not i sVa r i ab l eDec l a r a t i on dec then
e r r o r "Please choose a variable " ;
i f dec . name == newName then
e r r o r "Please give a different name" ;
i f a l r eadyEx i s t s dec newName then
e r r o r "Declaration already exists " ;
l et f i n dF i r s t x =
pick { ?y |
[ x ] lookupAl l [ ? y ] & (newName == getName ?y | ?y == dec ) } in
let needRename =
{ ?x | [ program ] ch i l d +[?x : SingleName ] lookup [ dec ] } in
let mayBeCaptured =
{ (? x , ? d ) | [ program ] ch i l d +[?x : SingleName ] lookup [ ? d ] &
?x . name == newName } in
let needNewQual i f i er = L i s t . f o l d l
( fun l ( x , d) → i f f i n d F i r s t x == dec then (x , d ) :: l else l )
[ ] ( t oL i s t mayBeCaptured ) in
let needNewQual i f i er = L i s t . f o l d l
( fun l x → i f f i n dF i r s t x != dec then (x , dec ) :: l else l )
needNewQual i f i er ( t oL i s t needRename ) in
foreach (x , d) in needNewQual i f i er do
i f d i s Loca lVar i ab l eDec l then e r r o r "Cannot hide local variable " ;
l et getRewr i te (x , d) =
i f pick { ( ) | [ x ] exp rQua l i f i e r [ ] } != null then
getExprRewrite (x , d)
else
getThisRewr i te (x , d) in
let r ew r i t e s = L i s t .map getRewr i te needNewQual i f i er in
foreach r ewr i t e in r ew r i t e s do r ewr i t e ( ) ;
foreach x in needRename do x . name ← newName;
dec . name ← newName
}
}
Appendix C
Extract Method
Extract Method for the object language described in Section 6.3.1 :
using CSharp . Ast , CSharp . Binding , CSharp . Flow
{
namespace CSharp . ExtractMethod
{
(∗ checks f o r a we l l−de f ined reg ion ∗)
l et dominates entryNode startNode endNode =
( startNode == endNode ) or
Ut i l s . isEmpty { ( ) |
[ entryNode ] ( local ? z : c f s u c c [ ? z ] & ?z != startNode ) ∗ [ endNode ] }
l et postDominates startNode endNode exitNode =
( startNode == endNode ) or
Ut i l s . isEmpty { ( ) |
[ s tar tNode ] ( local ? z : c f s u c c [ ? z ] & ?z != endNode ) ∗ [ exitNode ] }
l et haveSameParent x y =
not Ut i l s . isEmpty { ( ) | [ x ] parent ; ch i l d [ y ] }
(∗ t rans format ion code ∗)
l et createVoidTypeRef ( ) =
new TypeRef {
path = new NamespacePath {
ent i tyRef = new Enti tyRef { name = "void" }
}
}
l et createParamDecl name typeRef d i r e c t i o n =
new ParamDecl {
name = name , typeRef = typeRef , d i r e c t i o n = d i r e c t i o n
}
l et createArg name d i r e c t i o n =
170
APPENDIX C. EXTRACT METHOD 171
new MethodArgument {
t a r g e t = new Enti tyRef { name = name } , d i r e c t i o n = d i r e c t i o n
}
l et cr eateEnt i tyRef name =
new Enti tyRef { name = name }
l et createEmptyPrivateVoidMethod name parameters =
new MethodDecl {
name = name , mod i f i e r s = [ new Pr ivate ] ,
typeRef = createVoidTypeRef ( ) , parameters = parameters ,
b lock = new BlockStmt
}
l et c r ea t eCa l l S i t eS tmt methodName arguments =
new ExprStmt {
t a r g e t = new MethodInvokeExpr {
t a r g e t = new Enti tyRef { name = methodName } ,
arguments = arguments
}
}
l et i n s e r tStatementBefor e n s =
i f not Ut i l s . isEmpty { ?b | [ s ] parent [ ? b : BlockStmt ] } then
i n s e r tB e f o r e n s
else let block = new BlockStmt in
replaceWith s block ;
block . statements ← [ n ; s ]
l et c l oneDec l d =
new Var iableDeclStmt {
mod i f i e r s = L i s t .map c l one d . modi f i e r s ,
typeRef = c lone d . typeRef ,
name = d . name
}
l et detachDecl d =
i f d . i n i t i a l i z e r == null then
detach d
else
replaceWith d (new ExprStmt {
t a r g e t = new AssignExpr {
l e f t = new Enti tyRef { name = d . name } ,
operator = new Assign ,
r i gh t = c lone d . i n i t i a l i z e r
}
})
l et i s S t a t i c x =
not Ut i l s . isEmpty { ( ) | [ x ] mod i f i e r s [ : S t a t i c ] }
APPENDIX C. EXTRACT METHOD 172
(∗ Extrac t Method ∗)
l et extractMethod startNode endNode newMethodName =
l et (method , c l a s s , entryNode , exitNode ) = pick { (?m, ? c , ? entry , ? e x i t ) |
[ s tar tNode ] parent+[?m: Ca l l ab l eDec l ] d i r ec tEnc l os ingType [ ? c : TypeDecl ]
& [ endNode ] parent+[?m]
& [ ?m] ca l l ab l eEnt r y [ ? entry ]
& [ ?m] c a l l a b l e Ex i t [ ? e x i t ] } in
let outerEndNode = pick { ?n | [ endNode ] e x i t [ ? n ] } in
i f not dominates entryNode startNode endNode then
e r r o r "Not all possible flows go through the start of selection " ;
i f not postDominates startNode outerEndNode exitNode then
e r r o r "Not all possible flows go through the end of selection " ;
i f not haveSameParent startNode endNode then
e r r o r "Selected block is not enclosed in a single parent statement " ;
l et s e l e c t i onS ta t ement s = { ? s |
[ s tar tNode ] ( local ?z : c f s u c c [ ? z ] & ?z != outerEndNode ) ∗ [ ? s ] } in
let predicate mayUseOrDefInSelection (? x ) =
i s I n (? s , s e l e c t i onS ta t ement s ) & [ ? s ] useOrDef [ ? x ] in
let va r i a b l e s = { ?x | mayUseOrDefInSelection (? x ) &
( [ ? x : Var iableDeclStmt ] | [ ? x : ParamDecl ] ) } in
let predicate mayUseOrDefOutOfSelection (? x ) =
[ entryNode ] c f s u c c +[? s ] c f s u c c +[ exitNode ] &
! i s I n (? s , s e l e c t i onS ta t ement s ) & [ ? s ] useOrDef [ ? x ]
| [ method ] parameters [ ? x ] in
let predicate dec InSe l e c t i on (? x ) =
i s I n (?d , s e l e c t i onS ta t ement s ) & [ ? d ] dec [ ? x ] in
let predicate mayUseInSelect ion (? x ) =
i s I n (?u , s e l e c t i onS ta t ement s ) & [ ? u ] use [ ? x ] in
let predicate mayDef InSelect ion (? x ) =
i s I n (?d , s e l e c t i onS ta t ement s ) & [ ? d ] de f [ ? x ] in
let predicate mustDefBe for eSe l ec t i on (? x ) =
! ( [ entryNode ] ( local ?z : c f s u c c [ ? z ] & ! [ ? z ] de f [ ? x ] )+ [ startNode ] ) in
let predicate mayUseAfterSelect ion (? x ) =
[ outerEndNode ] c f s u c c ∗ [ ? d ] c f s u c c ∗ [ exitNode ] & [ ? d ] use [ ? x ]
| [ ? x : ParamDecl ] d i r e c t i o n [ : ! Value ] in
let predicate mustDef InSe l ect i on (? x ) =
! ( [ s tartNode ] ( local ?z : [ ? z ] c f s u c c & ! [ ? z ] de f [ ? x ] )+ [ outerEndNode ] ) in
let predicate mayUseBeforeDef InSelect ion (? x ) =
i s I n (?u , s e l e c t i onS ta t ement s ) &
[ startNode ] ( local ?z : [ ? z ] c f s u c c & ! [ ? z ] de f [ ? x ] ) ∗ [ ? u ] use [ ? x ] in
APPENDIX C. EXTRACT METHOD 173
l et predicate mayUseBeforeDefAfterSelect ion (? x ) =
[ outerEndNode ] ( local ?z : [ ? z ] c f s u c c & ! [ ? z ] de f [ ? x ] ) ∗ [ ? u ] use [ ? x ]
| [ method ] parameters [ ? x ] d i r e c t i o n [ : ! Value ] in
let predicate mayUseOrDefBeforeSelect ion (? x ) =
[ entryNode ] c f s u c c +[? s ] c f s u c c +[ startNode ] & [ ? s ] useOrDef [ ? x ]
| [ method ] parameters [ ? x ] d i r e c t i o n [ : ! Out ] in
let predicate mayUseOrDefAfterSelect ion (? x ) =
[ outerEndNode ] c f s u c c ∗ [ ? s ] c f s u c c +[ exitNode ] & [ ? s ] useOrDef [ ? x ]
| [ method ] parameters [ ? x ] d i r e c t i o n [ : ! Value ] in
let predicate decBe f o r eSe l e c t i on (? x ) =
[ entryNode ] c f s u c c +[?d ] c f s u c c +[ startNode ] & [ ? d ] dec [ ? x ]
| [ method ] parameters [ ? x ] in
let valueParams =
{ ?x | i s I n (?x , v a r i a b l e s ) &
mayUseBeforeDef InSelect ion (? x ) &
! ( mayDef InSelect ion (? x ) &
mayUseBeforeDefAfterSelect ion (? x ) )
} in
let outParams =
{ ?x | i s I n (?x , v a r i a b l e s ) &
mayUseBeforeDefAfterSelect ion (? x ) &
! mayUseBeforeDef InSelect ion (? x ) &
mustDef InSe l ect i on (? x )
} in
let refParams =
{ ?x | i s I n (?x , v a r i a b l e s ) &
( mayUseBeforeDef InSelect ion (? x ) |
mayDef InSelect ion (? x ) & ! mustDef InSe l ect i on (? x ) ) &
mayUseBeforeDefAfterSelect ion (? x ) &
! i s I n (?x , valueParams ) &
! i s I n (?x , outParams )
} in
let needDecMoveOut =
{ ?x | dec InSe l e c t i on (? x ) &
mayUseOrDefOutOfSelection (? x )
} in
let needDecMoveIn =
{ ?x | i s I n (?x , v a r i a b l e s ) &
! dec InSe l e c t i on (? x ) &
! i s I n (? x , valueParams ) &
! i s I n (? x , outParams ) &
! i s I n (? x , refParams )
} in
let needDecDupl icat ion =
APPENDIX C. EXTRACT METHOD 174
{ ?x | i s I n (?x , needDecMoveIn ) &
mayUseOrDefOutOfSelection (? x ) |
i s I n (?x , needDecMoveOut ) &
! i s I n (? x , valueParams ) &
! i s I n (? x , outParams ) &
! i s I n (? x , refParams )
} in
let bui ld l d = L i s t .map ( fun x →(x , d ) ) l in
let parameters = L i s t . concat
[ bu i ld ( t oL i s t valueParams ) (new Value ) ;
bu i ld ( t oL i s t refParams ) (new Ref ) ;
bu i ld ( t oL i s t outParams ) (new Out ) ] in
let paramDecls = L i s t .map
( fun (x , d ) → createParamDecl x . name ( c l one ( x . typeRef ) ) ( c l one d ) )
parameters in
let newMethod = createEmptyPrivateVoidMethod newMethodName paramDecls in
let args = L i s t .map
( fun (x , d ) → createArg x . name ( c l one d ) ) parameters in
let c a l l S i t e = c r ea t eCa l l S i t eS tmt newMethodName args in
i n s e r tStatementBefor e c a l l S i t e startNode ;
foreach d in needDecMoveOut do
i n s e r tStatementBefor e ( c l oneDec l d) c a l l S i t e ;
l et topStatements = { ? t s | i s I n (? ts , s e l e c t i onS ta t ement s ) &
[ ? t s ] ( local ? z : parent [ ? z ] & ! i s I n (? z , s e l e c t i onS ta t ement s ))+[ method ] } in
foreach t s in topStatements do detach t s ;
newMethod . block . statements ← L i s t . append
( L i s t .map c l one ( toL i s t needDecMoveIn ) ) ( t oL i s t topStatements ) ;
i f i s S t a t i c method then
newMethod . mod i f i e r s ← L i s t . append newMethod . mod i f i e r s [ new S t a t i c ] ;
i n s e r tA f t e r newMethod method ;
foreach dec in { ?d |
i s I n (?d , needDecMoveOut ) & ! i s I n (?d , needDecDupl icat ion )
| i s I n (?d , needDecMoveIn ) & ! i s I n (?d , needDecDupl icat ion ) } do
detachDecl dec
}
}
Bibliography
[AK07] Malte Appeltauer and Gunter Kniesel. Towards concrete syntax patterns forlogic-based transformation rules. In Eighth International Workshop on Rule-Based Programming (RULE ’07), Paris, France, 2007.
[App98] Andrew W. Appel. Modern Compiler Implementation in ML. CambridgeUniversity Press, 1998.
[Aßm98] Uwe Aßmann. OPTIMIX — a tool for rewriting and optimizing programs.In H. Ehrig, G. Engels, H. J. Kreowski, and G. Rozenberg, editors, Hand-book of Graph Grammars and Computing by Graph Transformation, volume2: Applications, Languages and Tools, pages 307–318. World Scientific, 1998.
[Ban06] Fabian Bannwart. Changing software correctly. Technical Report 509, Depart-ment of Computer Science, ETH Zurich, 2006.
[BBK+07] Emilie Balland, Paul Brauner, Radu Kopetz, Pierre-Etienne Moreau, and An-toine Reilles. Tom: Piggybacking rewriting on Java. In Proceedings of the18th Conference on Rewriting Techniques and Applications (RTA ’07), Lec-ture Notes in Computer Science. Springer-Verlag, 2007.
[BBPR05] Rajesh Bordawekar, Michael Burke, Igor Peshansky, and MukundRaghavachari. Simplify XML processing with XJ.http://www.ibm.com/developerworks/xml/library/x-awxj.html, 2005.
[BFS00] Peter Buneman, Mary Fernandez, and Dan Suciu. UnQL: A query languageand algebra for semistructured data based on structural recursion. VLDBJournal, 9(1):76–110, 2000.
[BGH07] Marat Boshernitsan, Susan L. Graham, and Marti A. Hearst. Aligning devel-opment tools with the way programmers think about code changes. In Pro-ceedings of the SIGCHI conference on Human Factors in Computing Systems(CHI ’07), pages 567–576, New York, NY, USA, 2007. ACM Press.
[BGV92] Robert A. Ballance, Susan L. Graham, and Michael L. Van De Vanter. The Panlanguage-based editing system. ACM Transactions on Software Engineeringand Methodology, 1(1):95–127, 1992.
[Bir98] Richard Bird. Introduction to Functional Programming using Haskell (secondedition). Prentice Hall, New York, USA, 1998.
[BKVV06] Martin Bravenboer, Karl Trygve Kalleberg, Rob Vermaas, and Eelco Visser.Stratego/XT Tutorial, Examples, and Reference Manual (latest). Department
175
BIBLIOGRAPHY 176
of Information and Computing Sciences, Universiteit Utrecht, Utrecht, TheNetherlands, 2006. http://www.strategoxt.org.
[BM06] Fabian Bannwart and Peter Muller. Changing programs correctly: Refactoringwith specifications. In J. Misra, T. Nipkow, and E. Sekerinski, editors, FormalMethods (FM), volume 4085 of Lecture Notes in Computer Science, pages 492–507. Springer-Verlag, 2006.
[BMR07] Emilie Balland, Pierre-Etienne Moreau, and Antoine Reilles. Bytecode rewrit-ing in tom. In Second Workshop on Bytecode Semantics, Verification, Analysisand Transformation (Bytecode ’07), Braga,Portugal, 2007.
[BMS05] Gavin Bierman, Erik Meijer, and Wolfram Schulte. The essence of data accessin Cω - the power is in the dot!, 2005.
[BMSU86] Francois Bancilhon, David Maier, Yehoshua Sagiv, and Jeffrey D Ullman.Magic sets and other strange ways to implement logic programs (extendedabstract). In Proceedings of the fifth ACM SIGACT-SIGMOD symposium onPrinciples of Database Systems (PODS ’86), pages 1–15, New York, NY, USA,1986. ACM Press.
[BNL05] Dirk Beyer, Andreas Noack, and Claus Lewerentz. Efficient relational cal-culation for software analysis. IEEE Transactions on Software Engineering,31(2):137–149, 2005.
[Boy02] John Boyland. Incremental evaluators for remote attribute grammars. Elec-tronic Notes in Theoretical Computer Science, 63(3), 2002.
[BR87] Catriel Beeri and Raghu Ramakrishnan. On the power of magic. In Proceedingsof the sixth ACM SIGACT-SIGMOD symposium on Principles of DatabaseSystems (PODS ’87), pages 269–284, 1987.
[BTF05] Ittai Balaban, Frank Tip, and Robert Fuhrer. Refactoring support for classlibrary migration. In Proceedings of the 20th ACM conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA ’05),pages 265–279, 2005.
[BvDOV06] Martin Bravenboer, Arthur van Dam, Karina Olmos, and Eelco Visser. Pro-gram transformation with scoped dynamic rewrite rules. Fundamenta Infor-maticae, 69(1–2):123–178, 2006.
[CEF+08] Don Chamberlin, Daniel Engovatov, Daniela Florescu, Giorgio Ghelli, JimMelton, and Jerome Simeon. XQuery Scripting Extension 1.0 (W3C workingdraft), 2008. Available at http://www.w3.org/TR/xquery-sx-10/.
[CFM+08] Don Chamberlin, Daniela Florescu, Jim Melton, Jonathan Robie, and JeromeSimeon. XQuery Update Facility 1.0 (W3C candidate recommendation), 2008.Available at http://www.w3.org/TR/xquery-update-10/.
[CK01] Horatiu Cirstea and Claude Kirchner. The rewriting calculus — Part I and II.Logic Journal of the Interest Group in Pure and Applied Logics, 9(3):427–498,May 2001.
[Cla78] Keith L. Clark. Negation as failure. In Herve Gallaire and Jack Minker,editors, Logic and Databases, pages 293–322. Plenum Press, New York, 1978.
BIBLIOGRAPHY 177
[CMR92] Mariano Consens, Alberto Mendelzon, and Arthur Ryman. Visualizing andquerying software structures. In Proceedings of the 14th international con-ference on Software engineering (ICSE ’92), pages 138–156, New York, NY,USA, 1992. ACM Press.
[coq07] The Coq proof assistant. http://coq.inria.fr/, 2007.
[Cor04] Marcio Lopes Cornelio. Refactorings as Formal Refinements. PhD thesis,Universidade de Pernambuco, 2004.
[Cor06] James R. Cordy. The TXL source transformation language. Science of Com-puter Programming, 61(3):190–210, 2006.
[Cre97] Roger F. Crew. ASTLOG: A language for examining abstract syntax trees. InUSENIX Conference on Domain-Specific Languages, pages 229–242, 1997.
[CW96] Weidong Chen and David S. Warren. Tabled evaluation with delaying forgeneral logic programs. Journal of the ACM, 43(1):20–74, 1996.
[DDGM07] Brett Daniel, Danny Dig, Kely Garcia, and Darko Marinov. Automated testingof refactoring engines. In Proceedings of the ACM SIGSOFT Symposium onthe Foundations of Software Engineering (ESEC/FSE ’07), New York, NY,USA, 2007. ACM Press.
[DdMS02] Stephen J. Drape, Oege de Moor, and Ganesh Sittampalam. Transforming the.NET intermediate language using path logic programming. In Principles andPractice of Declarative Programming (PPDP ’02), pages 133–144, 2002.
[DKTE04] Alan Donovan, Adam Kiezun, Matthew S. Tschantz, and Michael D. Ernst.Converting Java programs to use generic libraries. In Proceedings of the 19thACM conference on Object-Oriented Programming, Systems, Languages andApplications (OOPSLA ’04), pages 15–34, 2004.
[DM85] Pierre Deransart and Jan Maluszynski. Relating logic programs and attributegrammars. Journal of Logic Programming, 2(2):119–155, 1985.
[dMLVW03] Oege de Moor, David Lacey, and Eric Van Wyk. Universal regular path queries.Higher-order and Symbolic Computation, 16(1-2):15–35, 2003.
[DP02] Brian A. Davey and Hilary Priestley. Introduction to Lattices and Order (sec-ond edition). Cambridge University Press, 2002.
[DRW96] Stephen Dawson, C. R. Ramakrishnan, and David Scott Warren. Practicalprogram analysis using general purpose logic programming systems. In Pro-ceedings of the ACM Symposium on Programming Language Design and Im-plementation (PLDI ’96), pages 117–126. ACM Press, 1996.
[DT92] Guozhu Dong and Rodney W. Topor. Incremental evaluation of datalogqueries. In Proceedings of the 4th International Conference on Database The-ory (ICDT ’92), pages 282–296, London, UK, 1992. Springer-Verlag.
[ecm06] C# Language Specification. Standard ECMA-334. http://www.ecma-international.org/publications/standards/Ecma-334.htm, 2006.
BIBLIOGRAPHY 178
[EESV08] Torbjorn Ekman, Ran Ettinger, Max Schafer, and Mathieu Verbaere.Refactoring bugs in Eclipse, IntelliJ IDEA and Visual Studio, 2008.http://progtools.comlab.ox.ac.uk/projects/refactoring/bugreports.
[EGM+06] Michael Eichberg, Daniel Germanus, Mira Mezini, Lukas Mrokon, andThorsten Schafer. QScope: an open, extensible framework for measuring soft-ware projects. In Proceedings of the Conference on Software Maintenanceand Reengineering (CSMR ’06), pages 113–122, Washington, DC, USA, 2006.IEEE Computer Society.
[EH04] Torbjorn Ekman and Gorel Hedin. Rewritable reference attributed grammars.In Martin Odersky, editor, Proceedings of the European Conference on Object-Oriented Programming (ECOOP ’04), pages 144–169, 2004.
[EH06] Torbjorn Ekman and Gorel Hedin. Modular name analysis for Java usingJastAdd. In Generative and Transformational Techniques in Software Engi-neering, International Summer School (GTTSE ’05) Braga, Portugal, volume4143 of Lecture Notes in Computer Science, pages 422–436. Springer, 2006.
[EH07] Torbjorn Ekman and Gorel Hedin. The JastAdd extensible Java compiler.In Proceedings of the 22th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA ’07),2007.
[EKC98] Michael D. Ernst, Craig S. Kaplan, and Craig Chambers. Predicate dispatch-ing: A unified theory of dispatch. In Proceedings of the 12th European Confer-ence on Object-Oriented Programming (ECOOP ’98), pages 186–211, Brussels,Belgium, July 20-24, 1998.
[EMOS04] Michael Eichberg, Mira Mezini, Klaus Ostermann, and Thorsten Schafer.XIRC: A kernel for cross-artifact information engineering in software devel-opment environments. In Proceedings of the 11th Working Conference on Re-verse Engineering (WCRE’04), volume 00, pages 182–191, Los Alamitos, CA,USA, 2004. IEEE Computer Society.
[Ett06] Ran Ettinger. Refactoring via Program Slicing and Sliding. PhD thesis, Uni-versity of Oxford, 2006.
[EV04] Ran Ettinger and Mathieu Verbaere. Untangling: a slice extraction refactoring.In Gail C. Murphy and Karl J. Lieberherr, editors, Proceedings of the 3rdinternational conference on Aspect-oriented software development (AOSD ’04),pages 93–101, 2004.
[Fal07] Luis Diego Fallas. Creating Java refactorings with Scala and EclipseLTK. http://langexplr.blogspot.com/2007/07/creating-java-refactorings-with-scala.html, 2007.
[Fil03] Jean-Christophe Filliatre. Why: a multi-language multi-prover verificationtool. Technical Report 1366, LRI, Universite Paris Sud, 2003.
[Fit02] Anne Fitzpatrick. A well-intentioned query and the halloween problem. Annalsof the History of Computing, IEEE, 24(2):86–89, Apr-Jun 2002.
BIBLIOGRAPHY 179
[FKI+07] Henry Falconer, Paul H. J. Kelly, David M. Ingram, Michael R. Mellor, TonyField, and Olav Beckmann. A declarative framework for analysis and opti-mization. In Proceedings of Compiler Construction (CC ’07), pages 218–232.Springer, 2007.
[FKK07] Robert M. Fuhrer, Adam Kiezun, and Markus Keller. Advanced refactoringin Eclipse: Past, present, and future. In Proceedings of the 1st Workshop onRefactoring Tools, pages 30–31, 2007.
[Fow99] Martin Fowler. Refactoring: Improving the Design of Existing Code. AddisonWesley, 1999.
[Fow01] Martin Fowler. Crossing refactoring’s rubicon.http://www.martinfowler.com/articles/refactoringRubicon.html, 2001.
[FTK+05] Robert Fuhrer, Frank Tip, Adam Kiezun, Julian Dolby, and Markus Keller. Ef-ficiently refactoring Java applications to use generic libraries. In Proceedings ofthe 19th European Conference on Object-Oriented Programming (ECOOP ’05),pages 71–96, Glasgow, Scotland, July 27–29, 2005.
[GFT06] Maayan Goldstein, Yishai A. Feldman, and Shmuel Tyszberowicz. Refactoringwith contracts. In Proceedings of the AGILE Conference (AGILE ’06), pages53–64, Washington, DC, USA, 2006. IEEE Computer Society.
[GHM00] Etienne Gagnon, Laurie J. Hendren, and Guillaume Marceau. Efficient infer-ence of static types for Java bytecode. In Proceedings of the 7th InternationalSymposium on Static Analysis (SAS ’00), pages 199–219, London, UK, 2000.Springer-Verlag.
[GL88] Michael Gelfond and Vladimir Lifschitz. The stable model semantics for logicprogramming. In Robert A. Kowalski and Kenneth Bowen, editors, Proceedingsof the Fifth International Conference on Logic Programming (ICLP ’88), pages1070–1080, Cambridge, Massachusetts, 1988. The MIT Press.
[GLH+92] Robert W. Gray, Steven P. Levi, Vincent P. Heuring, Anthony M. Sloane, andWilliam M. Waite. Eli: a complete, flexible compiler construction system.Communications of the ACM, 35(2):121–130, 1992.
[GM78] Herve Gallaire and Jack Minker. Logic and Databases. Plenum Press, NewYork, 1978.
[GM06] Alejandra Garrido and Jose Meseguer. Formal specification and verificationof Java refactorings. In Proceedings of the Sixth IEEE International Work-shop on Source Code Analysis and Manipulation (SCAM ’06), pages 165–174,Washington, DC, USA, 2006. IEEE Computer Society.
[GN93] William G. Griswold and David Notkin. Automated assistance for programrestructuring. ACM Transactions on Software Engineering and Methodology,2(3):228–269, 1993.
[Hed91] Gorel Hedin. Incremental static-semantic analysis for object-oriented lan-guages using door attribute grammars. In Proceedings on Attribute Grammars,Applications and Systems, pages 374–379, London, UK, 1991. Springer-Verlag.
BIBLIOGRAPHY 180
[HR92] Susan Horwitz and Thomas Reps. The use of program dependence graphsin software engineering. In Proceedings of the International Conference onSoftware Engineering (ICSE ’92), pages 392–411, 1992.
[HRB90] Susan Horwitz, Thomas Reps, and David Binkley. Interprocedural slicingusing dependence graphs. ACM Transactions on Programming Languages andSystems, 12(1):26–61, 1990.
[HVdM06] Elnar Hajiyev, Mathieu Verbaere, and Oege de Moor. CodeQuest: scalablesource code queries with Datalog. In Dave Thomas, editor, Proceedings of theEuropean Conference on Object-Oriented Programming (ECOOP ’06), volume4067 of Lecture Notes in Computer Science, pages 2–27. Springer, 2006.
[HVMV05] Elnar Hajiyev, Mathieu Verbaere, Oege de Moor, and Kris de Volder. Code-Quest with Datalog. In Companion to the 20th ACM SIGPLAN conference onObject-Oriented Programming, Systems, Languages and Applications (OOP-SLA ’05), New York, NY, USA, 2005. ACM Press.
[imp07] IMP home page. http://www.eclipse.org/imp/, 2007.
[Jar98] Stan Jarzabek. Design of flexible static program analyzers with PQL. IEEETransactions on Software Engineering, 24(3):197–215, 1998.
[JH07] Nicolas Juillerat and Beat Hirsbrunner. Improving method extraction: Anovel approach to data flow analysis using boolean flags and expressions. InProceedings of the 1st Workshop on Refactoring Tools, pages 48–49, 2007.
[jls05] The Java Language Specification (third edition).http://java.sun.com/docs/books/jls/, 2005.
[JM84] Neil D. Jones and Alan Mycroft. Stepwise development of operational anddenotational semantics for prolog. In Symposium on Logic Programming, pages281–288, 1984.
[JSC07] Antonio Carvalho Junior, Leila Silva, and Marcio Cornelio. Using CafeOBJ tomechanise refactoring proofs and application. Electronic Notes in TheoreticalComputer Science, 184:39–61, 2007.
[JV03] Doug Janzen and Kris De Volder. Navigating and querying code withoutgetting lost. In Proceedings of the 2nd international conference on Aspect-oriented software development (AOSD ’03), pages 178–187, New York, NY,USA, 2003. ACM Press.
[Ker05] Joshua Kerievsky. Refactoring to Patterns. Addison Wesley, 2005.
[KETF07] Adam Kiezun, Michael D. Ernst, Frank Tip, and Robert M. Fuhrer. Refac-toring for parameterizing Java classes. In Proceedings of the 29th Interna-tional Conference on Software Engineering (ICSE ’07), Minneapolis, MN,USA, May 23–25, 2007.
[KHR07] Gunter Kniesel, Jan Hannemann, and Tobias Rho. A comparison of logic-basedinfrastructures for concern detection and extraction. In Proceedings of the 3rdworkshop on Linking aspect technology and evolution (LATE ’07). ACM, 2007.
BIBLIOGRAPHY 181
[KK04] Gunter Kniesel and Helge Koch. Static composition of refactorings. Scienceof Computer Programming, 52(1-3):9–51, 2004.
[KKKS96] Marion Klein, Jens Knoop, Dirk Koschutzki, and Bernhard Steffen. DFA &OPT-METAFrame: a toolkit for program analysis and optimization. In Toolsand Algorithms for the Construction and Analysis of Systems (TACAS ’96),volume 1055 of Lecture Notes in Computer Science, pages 418–421. Springer,1996.
[Kli05] Paul Klint. A tutorial introduction to RScript. Centrum voor Wiskunde enInformatica, draft, 2005.
[KSR07] Raffi Khatchadourian, Jason Sawin, and Atanas Rountev. Automated refac-toring of legacy Java software to enumerated types. In Proceedings of theInternational Conference on Software Maintenance (ICSM’07), 2007.
[KV06] Karl Trygve Kalleberg and Eelco Visser. Strategic graph rewriting: Trans-forming and traversing terms with references. In Proceedings of the 6th In-ternational Workshop on Reduction Strategies in Rewriting and Programming,Seattle, Washington, August 2006.
[KW94] Uwe Kastens and William M. Waite. Modularity and reusability in attributegrammars. Acta Informatica, 31(7):601–627, 1994.
[Lam02] Ralf Lammel. Towards Generic Refactoring. In Proceedings of Third ACMSIGPLAN Workshop on Rule-Based Programming (RULE ’02), Pittsburgh,USA, 2002. ACM Press.
[LDG+04] Xavier Leroy, Damien Doligez, Jacques Guarrigue, Didier Remy, and JeromeVouillon. The Objective Caml System. http://caml.inria.fr/, 2004.
[LH04] Ondrej Lhotak and Laurie Hendren. Jedd: A BDD-based relational extensionof Java. In Proceedings of the ACM SIGPLAN conference on ProgrammingLanguage Design and Implementation (PLDI’04), pages 158–169, 2004.
[LJVWF02] David Lacey, Neil D. Jones, Eric Van Wyk, and Carl Christian Frederiksen.Proving correctness of compiler optimizations by temporal logic. In Proceed-ings of the 29th ACM symposium on Principles of Programming Languages(POPL ’02), pages 283–294, 2002.
[Llo87] John W. Lloyd. Foundations of Logic Programming (second edition). Springer-Verlag, 1987.
[LM01] David Lacey and Oege de Moor. Imperative program transformation by rewrit-ing. In R. Wilhelm, editor, Proceedings of the 10th International Conference onCompiler Construction (CC ’01), volume 2027 of Lecture Notes in ComputerScience, pages 52–68. Springer Verlag, 2001.
[LM07] Ralf Lammel and Erik Meijer. Revealing the X/O impedance mismatch(Changing lead into gold). In Roland Backhouse, Jeremy Gibbons, Ralf Hinze,and Johan Jeuring, editors, Datatype-Generic Programming, LNCS. Springer-Verlag, 2007.
BIBLIOGRAPHY 182
[LMC03] Sorin Lerner, Todd Millstein, and Craig Chambers. Automatically proving thecorrectness of compiler optimizations. In Proceedings of the ACM SIGPLANconference on Programming Language Design and Implementation (PLDI ’03),pages 220–231, 2003.
[LMRC05] Sorin Lerner, Todd Millstein, Erika Rice, and Craig Chambers. Automatedsoundness proofs for dataflow and analyses via local rules. In Proceedings ofthe 32nd ACM symposium on Principles of Programming Languages, pages364–377, 2005.
[LRY+04] Yanhong Annie Liu, Tom Rothamel, Fuxiang Yu, Scott D. Stoller, and NanjunHu. Parametric regular path queries. In Proceedings of the ACM SIGPLANconference on Programming Language Design and Implementation (PLDI ’04),pages 219–230, New York, NY, USA, 2004. ACM Press.
[LS06] Yanhong Annie Liu and Scott D. Stoller. Querying complex graphs. In P. VanHentenryck, editor, Proceedings of the 8th International Symposium on Prac-tical Aspects of Declarative Languages (PADL ’06), pages 16–30, 2006.
[LV02] Ralf Lammel and Joost Visser. Typed Combinators for Generic Traversal.In Proceedings of Practical Aspects of Declarative Programming (PADL ’02),volume 2257 of LNCS, pages 137–154. Springer-Verlag, January 2002.
[LV03] Ralf Lammel and Joost Visser. A Strafunski Application Letter. In Proceedingsof Practical Aspects of Declarative Programming (PADL ’03), volume 2562 ofLNCS, pages 357–375. Springer-Verlag, 2003.
[Mad98] William Maddox. Incremental static semantic analysis. Technical ReportUCB/CSD-97-948, University of California, Berkeley, 1998.
[MBB06] Erik Meijer, Brian Beckman, and Gavin Bierman. LINQ: reconciling object,relations and XML in the .NET framework. In Proceedings of the 2006 ACMSIGMOD international conference on Management of data (SIGMOD ’06),pages 706–706, New York, NY, USA, 2006. ACM Press.
[MDJ02] Tom Mens, Serge Demeyer, and Dirk Janssens. Formalising behaviour pre-serving program transformations. In Graph Transformation, volume 2505 ofLecture Notes in Computer Science, pages 286–301, 2002.
[Mil04] Todd Millstein. Practical predicate dispatch. In Proceedings of the 19th ACMconference on Object-Oriented Programming, Systems, Languages and Appli-cations (OOPSLA ’04). ACM Press, 2004.
[MLVW03] Oege de Moor, David Lacey, and Eric Van Wyk. Universal regular path queries.Higher-order and symbolic computation, 16(1-2):15–35, 2003.
[Mos06] Maxim Mossienko. Structural search and replace: What, why and how-to.http://www.jetbrains.com/idea/docs/ssr.pdf, 2006.
[MTHM97] Robin Milner, Mads Tofte, Robert Harper, and David MacQueen. The defini-tion of Standard ML (Revised). MIT Press, May 1997.
[MTR05] Tom Mens, Gabriele Taentzer, and Olga Runge. Detecting structural refac-toring conflicts using critical pair analysis. Electronic Notes in TheoreticalComputer Science, 127(3):113–128, 2005.
BIBLIOGRAPHY 183
[Muc97] Steven S. Muchnick. Advanced compiler design and implementation. MorganKaufmann Publishers Inc., San Francisco, CA, USA, 1997.
[MV04] Edward McCormick and Kris De Volder. JQuery: finding your way throughtangled code. In Companion to the 19th annual ACM SIGPLAN conference onObject-Oriented Programming, Systems, Languages and Applications (OOP-SLA ’04), pages 9–10, New York, NY, USA, 2004. ACM Press.
[NNH99] Flemming Nielson, Hanne Riis Nielson, and Chris Hankin. Principles of Pro-gram Analysis. Springer, 1999.
[Ode07] Martin Odersky. The Scala Programming Language. http://www.scala-lang.org, 2007.
[OO84] Karl J. Ottenstein and Linda M. Ottenstein. The program dependence graphin a software development environment. Software Development Environments(SDE), pages 177–184, 1984.
[Opd92] William F. Opdyke. Refactoring Object-Oriented Frameworks. PhD thesis,University of Illinois at Urbana-Champaign, 1992.
[OV02] Karina Olmos and Eelco Visser. Strategies for source-to-source constant prop-agation. In B. Gramlich and S. Lucas, editors, Workshop on Reduction Strate-gies in Rewriting and Programming, volume 70 of Electronic Notes in Theo-retical Computer Science. Elsevier Science Publishers, May 2002.
[Pai94] R. Paige. Viewing a program transformation system at work. In ManuelHermenegildo and Jaan Penjam, editors, Proceedings of the Sixth Interna-tional Symposium on Programming Language Implementation and Logic Pro-gramming, pages 5–24. Springer Verlag, 1994.
[Pay06] Arnaud Payement. Type-based refactoring using JunGL. Master’s thesis,University of Oxford, 2006.
[PDR91] Geoffrey Phipps, Marcia A. Derr, and Kenneth A. Ross. Glue-Nail: a deductivedatabase system. In Proceedings of the 1991 ACM SIGMOD internationalconference on Management of data (SIGMOD ’91), pages 308–317, New York,NY, USA, 1991. ACM.
[Prz88] Teodor C. Przymusinski. On the declarative semantics of deductive databasesand logic programs. In Foundations of Deductive Databases and Logic Pro-gramming., pages 193–216. Morgan Kaufmann, 1988.
[RBJ97] Don Roberts, John Brant, and Ralph Johnson. A refactoring tool for smalltalk.Theory and Practice of Object Systems, 3(4):253–263, 1997.
[Rep93] Thomas W. Reps. Demand interprocedural program analysis using logicdatabases. In Proceedings of the Workshop on Programming with LogicDatabases, pages 163–196, 1993.
[RG02] Raghu Ramakrishnan and Johannes Gehrke. Database Management Systems(third edition). McGraw-Hill Higher Education, 2002.
[Rob99] Don B. Roberts. Practical Analysis for Refactoring. PhD thesis, University ofIllinois at Urbana-Champaign, 1999.
BIBLIOGRAPHY 184
[Ros94] Kenneth A. Ross. Modular stratification and magic sets for datalog programswith negation. Journal of the ACM, 41(6):1216–1266, 1994.
[RS82] J. Alan Robinson and Ernest E. Sibert. LOGLISP: Motivation, design and im-plementation. In K. L. Clark and S. A. Tanlund, editors, Logic Programming,pages 299–313. Academic Press, 1982.
[RSSS94] Raghu Ramakrishnan, Divesh Srivastava, S. Sudarshan, and Praveen Seshadri.The CORAL deductive system. The VLDB Journal, 3(2):161–210, 1994.
[RT84] Thomas Reps and Tim Teitelbaum. The Synthesizer Generator. ACM SIG-SOFT Software Engineering Notes, 9(3):42–48, 1984.
[SAK07] Daniel Speicher, Malte Appeltauer, and Gunter Kniesel. Code analyses forrefactoring by source code patterns and logical queries. In Proceedings of the1st Workshop on Refactoring Tools, pages 17–20, 2007.
[SdML04] Ganesh Sittampalam, Oege de Moor, and Ken Friis Larsen. Incremental exe-cution of transformation specifications. In Proceedings of the 31st ACM sym-posium on Principles of Programming Languages (POPL ’04), pages 26–38,2004.
[SEdM08] Max Schafer, Torbjorn Ekman, and Oege de Moor. Sound and extensiblerenaming for Java. In Proceedings of the 23th ACM conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA ’08),2008. To appear.
[Ser01] Silvija Seres. The Algebra of Logic Programming. PhD thesis, University ofOxford, 2001.
[SH04] Peter Sestoft and Henrik I. Hansen. C# Precisely. MIT Press, 2004.
[Sim07] The Simplify decision procedure.http://kind.ucd.ie/products/opensource/Simplify/, 2007.
[Spi90] Michael Spivey. A functional theory of exceptions. Science of Computer Pro-gramming, 14(1):25–42, 1990.
[Spi00] Michael Spivey. Combinators for breadth-first search. Journal of FunctionalProgramming, 10(4):397–408, 2000.
[SR05] Diptikalyan Saha and C. R. Ramakrishnan. Incremental and demand-drivenpoints-to analysis using logic programming. In Proceedings of the 7th ACMSIGPLAN international conference on Principles and Practice of DeclarativeProgramming (PPDP ’05), pages 117–128, New York, NY, USA, 2005. ACM.
[SS99] Michael Spivey and Silvija Seres. Embedding Prolog in Haskell. In Haskell ’99,Technical Report UU-CS-1999-28, Department of Computer Science, Univer-sity of Utrecht., 1999.
[SSH99] Silvija Seres, Michael Spivey, and C. A. R. Hoare. Algebra of logic program-ming. In Proceedings of the International Conference on Logic Programming(ICLP ’99), pages 184–199, 1999.
BIBLIOGRAPHY 185
[SSL01] Frank Simon, Frank Steinbruckner, and Claus Lewerentz. Metrics based refac-toring. In Proceedings of the Fifth European Conference on Software Mainte-nance and Reengineering (CSMR ’01), page 30, Washington, DC, USA, 2001.IEEE Computer Society.
[SSW94] Konstantinos Sagonas, Terrance Swift, and David S. Warren. Xsb as an effi-cient deductive database engine. In Proceedings of the 1994 ACM SIGMODinternational conference on Management of data (SIGMOD ’94), pages 442–453, New York, NY, USA, 1994. ACM.
[ST08] Nik Sultana and Simon Thompson. Mechanical verification of refactorings. InProceedings of the 2008 ACM SIGPLAN symposium on Partial evaluation andsemantics-based program manipulation (PEPM ’08), pages 51–60, New York,NY, USA, 2008. ACM.
[Sym05] Don Syme. F# Home Page. http://research.microsoft.com/fsharp/fsharp.aspx,2005.
[TKB03] Frank Tip, Adam Kiezun, and Dirk Baumer. Refactoring for generalizationusing type constraints. In Proceedings of the 18th ACM conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA ’03),pages 13–26, 2003.
[TM03] Tom Tourwe and Tom Mens. Identifying refactoring opportunities using logicmeta programming. In Proceedings of the Seventh European Conference onSoftware Maintenance and Reengineering (CSMR ’03), page 91, Washington,DC, USA, 2003. IEEE Computer Society.
[Tom87] Masaru Tomita. An efficient augmented-context-free parsing algorithm. Com-putational Linguistics, 13(1-2):31–46, 1987.
[TZ86] Shalom Tsur and Carlo Zaniolo. LDL: A logic-based data language. InProceedings of the 12th International Conference on Very Large Data Bases(VLDB ’86), pages 33–41, San Francisco, CA, USA, 1986. Morgan KaufmannPublishers Inc.
[Ull89] J. D. Ullman. Bottom-up beats top-down for datalog. In Proceedings ofthe eighth ACM SIGACT-SIGMOD-SIGART symposium on Principles ofdatabase systems (PODS ’89), pages 140–149, New York, NY, USA, 1989.ACM.
[Ull94] Jeffrey D. Ullman. Assigning an appropriate meaning to database logic withnegation. Computers as Our Better Partners, pages 216–225, 1994.
[Van05] Ivan Vankov. Relational approach to program slicing. Master’s thesis, Univer-sity of Amsterdam, 2005.
[vdBHdJ+01] Mark van den Brand, Jan Heering, Hayco de Jong, Merijn de Jonge, TobiasKuipers, Paul Klint, Leon Moonen, Pieter Olivier, Jeroen Scheerder, JurgenVinju, Eelco Visser, and Joost Visser. The ASF+SDF Meta-Environment:a Component-Based Language Development Environment. In Proceedings ofCompiler Construction (CC ’01), LNCS. Springer, 2001.
BIBLIOGRAPHY 186
[vDD04] Daniel von Dincklage and Amer Diwan. Converting Java classes to use generics.In Proceedings of the 19th ACM conference on Object-Oriented Programming,Systems, Languages and Applications (OOPSLA ’04), pages 1–14, 2004.
[VEdM06] Mathieu Verbaere, Ran Ettinger, and Oege de Moor. JunGL: a scripting lan-guage for refactoring. In Dieter Rombach and Mary Lou Soffa, editors, Proceed-ings of the 28th International Conference on Software Engineering (ICSE ’06),pages 172–181, New York, NY, USA, 2006. ACM Press.
[Vie86] Laurent Vieille. Recursive axioms in deductive databases: The query-subqueryapproach. In Larry Kerschberg, editor, Proceedings of International Confer-ence on Expert Database Systems, 1986.
[Vis02] Eelco Visser. Meta-programming with concrete object syntax. In Generativeprogramming and component engineering, pages 299–315, 2002.
[Vor93] Scott A. Vorthmann. Modelling and specifying name visibility and bindingsemantics. Technical Report CMU//CS-93-158, Carnegie Mellon University,1993.
[VPdM06] Mathieu Verbaere, Arnaud Payement, and Oege de Moor. Scripting refactor-ings with JunGL. In Companion to the 21th ACM SIGPLAN conference onObject-Oriented Programming, Systems, Languages and Applications (OOP-SLA ’06), pages 651–652, New York, NY, USA, 2006. ACM Press.
[vRS91] Allen van Gelder, Kenneth Ross, and John S. Schlipf. The well-founded se-mantics for general logic programs. Journal of the ACM, 38(3):620–650, 1991.
[W3C07] W3C. XQuery 1.0 and XPath 2.0 formal semantics.http://www.w3.org/TR/xquery-semantics/, 2007.
[WACL05] John Whaley, Dzintars Avots, Michael Carbin, and Monica S. Lam. UsingDatalog and binary decision diagrams for program analysis. In Kwangkeun Yi,editor, Proceedings of the 3rd Asian Symposium on Programming Languagesand Systems (APLAS ’05), volume 3780. Springer-Verlag, 2005.
[Wad99a] Philip Wadler. A formal semantics of patterns in XSLT. In Markup Technolo-gies, 1999.
[Wad99b] Philip Wadler. Two semantics for XPath. Available athttp://www.cs.bell-labs.com/ who/wadler/topics/xml.html, 1999.
[War92] David S. Warren. Memoing for logic programs. Communications of the ACM,35(3):93–111, 1992.
[Wei84] Mark Weiser. Program slicing. IEEE Transactions on Software Engineering,10:352–357, 1984.
[why07] The Why verification tool. http://why.lri.fr/, 2007.
[WS97] Deborah Whitfield and Mary Lou Soffa. An approach for exploring code-improving transformations. ACM Transactions on Programming Languagesand Systems, 19(6):1053–1084, 1997.