a language to script refactoring transformations

A Language to Script

Refactoring Transformations

Mathieu Verbaere

Wolfson College

Michaelmas Term 2008

Submitted in partial fulfilment of the requirements for

the degree of Doctor of Philosophy

�Oxford University Computing Laboratory

Programming Research Group

A Language to Script

Refactoring Transformations

Mathieu VerbaereWolfson College

D.Phil. ThesisMichaelmas Term 2008

Abstract

Refactorings are behaviour-preserving program transformations, typically for improving the

structure of existing code. A few of these transformations have been mechanised in interactive

development environments. Many more refactorings have been proposed, and it would be

desirable for programmers to script their own refactorings. Implementing such source-to-

source transformations, however, is quite complex: even the most sophisticated development

environments contain significant bugs in their refactoring tools.

We introduce a domain-specific language to script refactoring transformations. The lan-

guage, named JunGL, is a hybrid of a functional language in the style of ML and a logic

query language. It allows the computation of static-semantic information, such as name bind-

ing and control flow, and the expression of refactoring preconditions as queries on a graph

representation of the program. Borrowing from earlier work on the specification of compiler

optimisations, JunGL notably uses path queries to express dataflow properties.

We have been careful to keep the semantics of all logical features very declarative to

provide a sound basis for rigorous reasoning on the transformations. All constructs translate

to a novel variant of Datalog, a query language originally put forward in the theory of

databases. This variant works on duplicate-free sequences rather than sets, with the rationale

to present logical matches in a meaningful deterministic order. We call it Ordered Datalog.

Ordered Datalog programs, like Datalog programs, can be classified depending on how

nonmonotonic constructs such as negation are used. We identify the new class of partially

stratified programs as sufficiently expressive for our application, and highlight an evaluation

strategy following the Query-Subquery approach. Finally, we describe the current imple-

mentation of JunGL, and validate the whole design of the language via a number of complex

refactoring transformations.

Acknowledgements

I would first like to express my gratitude to my supervisor Oege de Moor for his guidance

and support, and for offering me to return to Oxford for a DPhil after my MSc project in his

group and a year away in Paris.

I would also like to thank Microsoft Research for funding my work through its European

PhD Scholarship Programme. I am particularly grateful to Fabien Petitcolas at MSR Cam-

bridge for making sure scholars always get great opportunities to present and discuss their

ongoing work.

Thanks also go to my final examiners, Mike Spivey and Ralf Lammel, for their comments

and suggestions during the viva which helped me improve this thesis.

The Programming Tools Group in Oxford has been a very pleasant and productive envi-

ronment to work in. Thanks to all its members. I am especially grateful to close friends Rani

Ettinger and Elnar Hajiyev. It is Rani who introduced me to the research field of refactoring

tools. It is Elnar who later set out with me on the Datalog adventure. I am also grateful

to Arnaud Payement for his enthusiasm while experimenting with JunGL, and to Damien

Sereni who has always been willing to help and share his broad knowledge of computer sci-

ence. Many thanks to all of them for their highly valuable inputs at different stages of this

work. I have enjoyed our discussions a lot.

I am also greatly thankful to my family and friends, in France and the UK, for their

support and the happy moments we shared in Oxford, London, Bidford, Martigues, Aix,

Joigny, Marcellaz, Les Contamines, Strasbourg, Luneville and Paris.

Finally, I want to thank Dorothee for her true love and the great life we have together.

ii

Contents

1 Introduction 1

1.1 The process of refactoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Some refactoring examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 On automating transformations . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Trends and challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 A scripting language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.6 Alternative solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.7 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.8 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Design of the language 16

2.1 ML-like features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.1 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1.2 Pattern matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Logical features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.1 Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.2 Lazy edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.3 Path queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3 Computational model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.4 Other features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.5 The toolkit around the language . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.5.1 The graph structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.5.2 The interpreter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.5.3 Editors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.6 Further examples on While programs . . . . . . . . . . . . . . . . . . . . . . . 37

2.6.1 Binding and definite assignment checks . . . . . . . . . . . . . . . . . 37

2.6.2 Rename Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.6.3 Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.7 Summary and references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

iii

CONTENTS iv

3 Datalog 44

3.1 Logic programs and syntax of Datalog . . . . . . . . . . . . . . . . . . . . . . 44

3.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2.1 Minimal models and least fixpoints . . . . . . . . . . . . . . . . . . . . 47

3.2.2 Safe Datalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2.3 Mapping predicate calculus to relational algebra . . . . . . . . . . . . 50

3.2.4 Evaluation of strata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.3 Evaluation strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3.1 Top-down vs bottom-up . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3.2 Query-Subquery and magic sets . . . . . . . . . . . . . . . . . . . . . . 55

3.3.3 Existing implementations . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.4 General logic programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59


4 Ordered semantics of the logical features 63

4.1 Why order matters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Duplicate-free sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.2.2 Relational operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3 Stratified Ordered Datalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3.1 Non-termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3.2 Chasing nonmonotonic ordered operators . . . . . . . . . . . . . . . . 71

4.3.3 A refinement of stratified Datalog . . . . . . . . . . . . . . . . . . . . 75

4.4 Data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.5 Translating predicates, edges and path queries . . . . . . . . . . . . . . . . . . 79

4.5.1 Abstract syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.5.2 Relational equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.5.3 Ordered Datalog rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.5.4 Encoding dynamic edge dispatch . . . . . . . . . . . . . . . . . . . . . 87

4.5.5 A full translation example . . . . . . . . . . . . . . . . . . . . . . . . . 88


5 Evaluating more general ordered queries 93

5.1 On accepting more queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.2 Beyond stratified Ordered Datalog . . . . . . . . . . . . . . . . . . . . . . . . 95

5.2.1 Partial instantiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.2.2 Partial stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.3 Demand-driven evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.3.1 Top-down sequence-based evaluation . . . . . . . . . . . . . . . . . . . 99

5.3.2 The issue with first . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

CONTENTS v

5.3.3 Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.4 Generating partial reductions . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.5 Back to sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.5.2 The orelse operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109


6 Scripting refactorings 113

6.1 Rename Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.1.1 The object language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.1.2 Name lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.1.3 Detecting conflicts and renaming . . . . . . . . . . . . . . . . . . . . . 121

6.1.4 Minimising rejection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.2 Extract Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126


6.2.2 Name and type lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.2.3 Generating type constraints . . . . . . . . . . . . . . . . . . . . . . . . 131

6.2.4 Solving and transforming . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.3 Extract Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133


6.3.2 Control and data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.3.3 Checking validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6.3.4 Inferring parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6.3.5 Placing declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

6.3.6 Transforming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143


7 Discussion and future work 147

7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

7.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

7.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

A JunGL grammar 160

B Rename Variable 165

C Extract Method 170

Chapter 1

Introduction

1.1 The process of refactoring

Refactoring is the process of improving the design of a program while preserving its behaviour.

Often the purpose is to correct existing design flaws, to prepare a program for the introduction

of a new functionality, or to take advantage of a new programming language feature such as

generic types.

Although refactoring has been done informally (and manually) for decades, it was first

seriously examined only fifteen years ago by William Opdyke in his PhD dissertation [Opd92].

There, Opdyke presents refactoring as a disciplined technique along with a classification of

useful transformations to improve indeed the design of object-oriented programs. Perhaps

the most obvious and most popular example is renaming: a variable, or any other program

artifact with a name, is given a new name to better reflect its purpose in the code, and

therefore improve the overall readability of the program.

Later, the upsurge of iterative programming methodologies (such as extreme programming

and other agile methodologies), which promote evolutionary change throughout the entire

life-cycle of a project, contributed greatly in increasing the general interest in refactoring.

Martin Fowler’s catalogue [Fow99] long remained the single classical reference for developers,

before the recent editions of more books on the topic, e.g. [Ker05]. All of these present a

remarkable number of different refactoring transformations, more or less complex, tedious

and hence error-prone. To cope with such difficulties, practitioners are often recommended

to run unit testing after each refactoring to check that the behaviour of the resulting program

is indeed externally similar to the behaviour of the original program.

This also explains the considerable interest in providing automated (or semi-automated)

support for applying refactoring transformations. The Smalltalk Refactoring Browser by

John Brant and Don Roberts was the first tool to provide that kind of automated support

[RBJ97, Rob99]. Since then, a lot of engineering effort has been put in refactoring tools

and most Integrated Development Environments now provide such support, in the form of a

1

CHAPTER 1. INTRODUCTION 2

fixed menu of transformations that may be applied, for instance for renaming, extracting a

method, extracting an interface, and so on.

1.2 Some refactoring examples

To better illustrate what a single refactoring transformation is, we shall present two well-

known refactorings, namely Encapsulate Field and Extract Method. We expose each refactor-

ing as it is described in [Fow99], that is with the motivation for it, a tiny example and the

general mechanics to achieve it.

Encapsulate Field In an object-oriented program, a public field should be turned into a

private one and accessors should be provided for it. The rationale is that data and behaviour

are best separated.

The following Java declaration:

public Str ing name ;

ought to be refactored to:

private Str ing name ;

public Str ing getName ( ) { return name ; }

public void setName( St r ing aName) { name = aName ; }

The mechanics are described as:

• “Create getting and setting method for the field.

• Find all clients outside the class that reference the field. If the client uses the

value, replace the reference with a call to the getting method. If the client

changes the value, replace the reference with a call to the setting method.

[. . .]

• Compile and test after each change.

• once all clients are changed, declare the field as private.

• Compile and test.”

That excerpt gives evidence of two natural but important technicalities. First, the me-

chanics of a transformation is tightly coupled to the object language of the transformation.

Indeed, in C# for instance, the support for properties makes the second step useless, as prop-

erties are accessed in exactly the same manner as fields. Second, it is clear that another

variant of Encapsulate Field could be derived where references to the field which occur inside

the class of the field would also be updated.


Extract Method A method that is too long and serves too many purposes should be split

into several single-purpose and well-named methods.

For instance, the following piece of Java code:

void printOwning (double amount ) {

pr intBanner ( ) ;

// pr in t d e t a i l s

System . out . p r i n t l n ( ”name : ” + getName ( ) )

System . out . p r i n t l n ( ”amount : ” + amount ) ;

}

is better refactored into:

void printOwing (double amount ) {

pr intBanner ( ) ;

p r i n tDe t a i l s ( amount ) ;

}

void p r i n tDe t a i l s (double amount ) {

System . out . p r i n t l n ( ”name : ” + getName ( ) )

System . out . p r i n t l n ( ”amount : ” + amount ) ;

}

This time, the mechanics read as follows:

• “Create a new method, and name it after the intention of the method (name

it by what it does, not how it does it). [. . .]

• Copy the extracted code from the source method into the new target method.

• Scan the extract code for references to any variables that are local in scope to

the source method. These are local variables and parameters to the method.

• See whether any temporary variables are used only within this extracted code.

If so, declare them in target method as temporary variables.

• Look to see whether any of these local-scope variables are modified by the

extracted code. If one variable is modified, see whether you can treat the

extracted code as a query and assign the result to the variable concerned. If

this is awkward, or if there is more than one such variable, you can’t extract

the method as it stands. [. . .]

• Pass into the target method as parameters local-scope variables that are read

from the extracted code.


• Compile when you have dealt with all the locally-scoped variables.

• Replace the extracted code in the source with a call to the target method. [. . .]

• Compile and test.”

As we see, this is both complex and informal. A key point of the mechanics is the implicit

presence of preconditions: “if there is more than one such variable, you can’t extract the

method as it stands”. Preconditions play an important role in the automation of refactoring

to ensure the transformation will either be completed and behaviour preserving, or rejected.

Note also that preconditions might differ from one language to another. In C#, multiple

variables can be modified by the extracted method and returned to the original method since

the language supports ref and out parameter passing modes [SH04].

1.3 On automating transformations

The quest for serious automated refactoring support is related by Fowler in his article enti-

tled Crossing Refactoring’s Rubicon [Fow01]. At an early stage, transformations were often

performed only at the level of text, or at best on the Abstract Syntax Tree but purely syn-

tactically. Of course, these were hardly behaviour-preserving. It is only in 2001 that the

Rubicon was crossed with the implementation of Extract Method by a few tools.

The mechanised version of Extract Method allows the programmer to select a contiguous

block of code, which is then extracted into a new method. For that kind of automatic ex-

traction, the tools need to perform a deep semantic analysis to determine what parameters

should be passed to the new method, and whether the transformation is at all possible. If

not, the refactoring should be rejected. This indeed ought to happen, for Java programs, if

more than one variable is assigned in the block to be extracted, as their value can simply

not be returned from the new method - at least not without the encapsulation of the re-

turned variables into a dedicated wrapper, which is likely to impede readability of the code.

Unfortunately, although current tools work out the correct solution for most extractions,

they still fail on some corner cases depending on the implementation. Eclipse, IntelliJ IDEA

and Visual Studio provide this refactoring, but we could find correctness issues in all three

implementations [EESV08].

An example of such a flaw in the first release of Visual Studio 2005 is shown in Figure

1.1. On the left is the original program, and the region to be extracted is indicated by the

‘from’ and ‘to’ comments. On the right is the resulting code: note that in the new method,

the variable i is returned without necessarily being assigned. The refactored version does not

compile as it violates the definite assignment rule of C#. In fact, the new method does not

need to return the variable i because it is not live at the end of the selection. We reported

that bug and it has been fixed in the new version of Visual Studio.

Another perhaps more subtle issue has been reported by Ran Ettinger in Eclipse 3.3. In

the artificially constructed Java code of Figure 1.2, one cannot extract the region between


public void F(bool b){

int i ;// fromi f (b){

i = 0 ;Console . WriteLine ( i ) ;

}// toi = 1 ;Console . WriteLine ( i ) ;

}


int i ;i = NewMethod(b ) ;i = 1 ;Console . WriteLine ( i ) ;

}

private static int NewMethod(bool b){

int i ;i f (b){


}return i ;

}

Figure 1.1: Extract Method bug in Visual Studio 2005.

the ‘from’ and ‘to’ comments. The rejection is accompanied with the following explanation

message: “Ambiguous return value: selected block contains more than one assignment to local

variable”. In fact, only n is used after the selection. A true flow-sensitive dataflow analysis

would have noticed the effect of the break.

public int g ( ) {int n = 10 ;int i = 0 ;while ( i<n) {

// fromi++;n−−;// tobreak ;

}return n ;

}

Figure 1.2: Extract Method rejection issue in Eclipse 3.3.

These kinds of bugs go to the heart of the difficulty of implementing new refactorings: it

requires dataflow analysis (in particular variable liveness), of the same kind as in compiler

optimisations. From these and similar examples, we deduce that a framework for refactoring


must provide dataflow analysis facilities as well as other, perhaps more obvious, features such

as pattern matching and mechanisms for variable binding. We shall show the correct way to

refactor the Visual Studio’s example in Chapter 6.

The two flaws presented here are just illustrative of more faulty refactorings documented

in [EESV08]. Another study has also reported many issues in two of the mainstream Java

IDEs [DDGM07]. The authors developed a technique for automated testing of refactoring

engines, based on the iterative generation of structurally complex test inputs. They found

a total of 21 new bugs in Eclipse and 24 in NetBeans. The issues concern refactorings of

different kinds, among them Rename Field, Encapsulate Field and Pull up Method (for moving

a method from a subclass to some superclass).

All these bugs show the inherent complexity of implementing correct program transfor-

mations. They also give evidence that most transformations cannot be expressed in purely

syntactic terms without any recourse to compiler-like analyses.

1.4 Trends and challenges

In view of the large number of refactorings that have been proposed and of the complexity in

correctly expressing refactorings, it is natural to think about providing some kind of a toolkit

to facilitate their implementation. Additionally, some current trends in software development

also make a strong case for more numerous, more sophisticated, and more reliable refactoring

features. In order to draw a complete picture for the requirements of a refactoring toolkit,

we briefly expose those trends together with the challenges they present with respect to

refactoring.

A profusion of languages Software developers are faced with a profusion of technologies

and languages when starting a new project. Even on legacy code, the choice of a different

language for developing a new functionality is often considered.

One of the design goals of the Common Language Runtime environment in the .NET

framework was to enable cross-language development, which includes cross-language debug-

ging, cross-language exception handling and even cross-language inheritance. That is, any

.NET compliant language is seamlessly usable with another .NET compliant language. Well-

known examples of mainstream object-oriented .NET languages are C# and VB.NET, but

other languages such as F# [Sym05], a variant of OCaml, can also be compiled to the .NET

intermediate language, known as CIL. Assemblies written in C# and other .NET languages

can be directly accessed from F#, and vice versa. With cross-language development, a de-

veloper can choose the language that best suits her needs and still be able to integrate into

a single application. Obviously, cross-language refactorings are expected in that context:

renaming a C# method should update calls eventually present in an F# program.

In parallel, language designers are trying to bridge the gap between certain paradigms

in software development. Designers are notably addressing the so-called O/R (for object-


relational) and X/O (for XML-object) impedance mismatches [LM07] which are encountered

when using a relational database or an XML stream to store objects. The idea is to provide

relation-specific or XML-specific features to raise the manipulation of data, stored in these

respective formats, at the level of objects. Concretely, this is done by integrating support for

native queries into a host object-oriented language like Java or C#.

The XJ project at IBM Watson Research proposes that kind of novel mechanisms for

the integration of XML as first-class constructs into Java [BBPR05]. The LINQ project

has taken another, more general, route and provides general-purpose query facilities to the

.NET Framework that apply to all sources of information, not just relational or XML data

[MBB06]. Each project can be seen, however, as the introduction of an embedded Domain

Specific Language (DSL) into a host language - for the purpose of manipulating XML in the

case of XJ, and for general-purpose queries in LINQ.

Of course, all these languages and language extensions should be properly supported in

development environments. End-users expect syntax highlighting, on-the-fly semantic anal-

yses and refactoring support. Yet, building sophisticated development environments is a

difficult task. IMP, developed at IBM Watson Research, is an Eclipse-based meta-tooling

framework that is exactly intended to speed the creation of rich IDEs [imp07]. IMP aims to

provide a set of APIs to help in the implementation of semantics analyses and refactorings.

Such APIs are already useful but we wish to facilitate even more the automation of refactor-

ing transformations, in order to help refactoring authors manage the growing demand and

complexity of refactoring tools.

In terms of refactoring support, language extensions have indeed two major consequences.

Firstly, existing implementations of refactoring transformations must be updated to support

the new constructs that were not originally present in the host language. Secondly and

perhaps less obviously, developers expect new refactoring tools for migrating their old code

to take advantage of the new constructs. In the context of XJ for instance, it is desirable

to transform code for constructing an XML fragment via calls to the DOM API into safer

and more readable XJ code for constructing the same XML fragment. To illustrate, one may

wish to convert this Java code:

Element r eg i on = doc . createElement ( ” r eg i on ” ) ;

Element name = doc . createElement ( ”name” ) ;

Text tex t = doc . createTextNode ( ” c en t r a l ” ) ;

name . appendChild ( t ex t ) ;

r eg i on . appendChild (name ) ;

Element s a l e s = doc . createElement ( ” s a l e s ” ) ;

t ex t = doc . createTextNode ( ”12” ) ;

s a l e s . appendChild ( t ex t ) ;

s a l e s . s e tAt t r i bu t e ( ” uni t ” , ” m i l l i o n s ” ) ;

r eg i on . appendChild ( s a l e s ) ;


into that XJ snippet:

r eg i on r = new r eg i on (

<r eg ion>

<name>c ent r a l </name>

<s a l e s uni t=” m i l l i o n s ”>12</s a l e s >

</reg ion >);

Another more obvious example of language extensions challenging for refactoring tools is

Java 5. Beside the engineering effort required to make existing refactoring implementations

aware of the new features in Java 5, much research work has been done to automate the

introduction of generic types [DKTE04, vDD04, KETF07] or to convert constants to enums

[KSR07].

User-defined transformations Beyond the emergence of new languages and language

extensions for which it is desirable to provide new refactoring transformations, advanced

developers may wish to author their own transformations. Perhaps the most relevant appli-

cation of user-defined transformations is the migration of library calls using an old API to

a refactored one, which is in a way a less extreme form of language extensions. Of course,

the complexity of the mechanisation depends on the sophistication of the transformation.

Changing the name of a method at all client call sites is fairly straightforward, but the fully

automatic migration of applications that use legacy library classes to newer, sometimes quite

different, classes is more difficult [BTF05]. Because of these different levels of sophistication,

there is no single silver-bullet solution to user-defined transformations. Support for them is

very diverse in existing systems.

A first solution, available in Eclipse, is to keep an history of refactorings. The provider

of the API records the refactorings she plays on the codebase of the API, and later ships

the recorded script with the new library. This approach appeals for its simplicity, but it

limits very much the kind of modifications that can be done to the API. Indeed, complex

changes that are likely to affect the sequence of library calls in the client code, can often not

be expressed as a series of general-purpose refactoring transformations, at least not with the

ones currently proposed in IDEs.

Another approach, still in Eclipse but certainly more heavyweight, is to write a new

plugin. Yet, like with IMP, this requires the mastery of complex APIs offered by the Java

Development Tools (JDT) and the Refactoring Language Toolkit (LTK). Very recently, at the

first workshop on refactoring tools, Robert Fuhrer and other people involved in the Eclipse

refactoring support have related the history of these APIs and mapped out a roadmap for

refactoring’s future [FKK07]. One of the main challenges presented there is indeed to ease

the development of refactorings. To address the matter, the authors suggest:


• a declarative AST-based transformation language, in which transformation could ac-

tually be type-checked to check upfront that a transformation will always result in a

valid AST;

• better means of specifying underlying analyses, with clean declarative formulas that

would be mapped onto efficient data representations.

In the meantime, and with the same goal of facilitating the development of refactorings,

the use of Scala [Ode07] for the implementation of user-defined transformations has been

suggested [Fal07]. Scala fully inter-operates with Java and its functional features, like pattern

matching, are highly desirable in the context of meta-programming, i.e. for writing programs

that manipulate programs.

Halfway between replaying fixed refactorings and writing a full plugin, one can find intu-

itive, flexible but mostly syntactic solutions. In IntelliJ IDEA, there is indeed a user-friendly

facility called Structural Search and Replace that enables limited transformations by pattern

matching on the syntax tree [Mos06]. On the same line, Marat Boshernitsan developed iXj

[BGH07], a visual tool that allows programmers to make complex code transformations in an

intuitive manner. The strength of iXj is its alignment with programmers mental models of

programming structures. Indeed it does not require the manipulation of complex source code

representations such as Abstract Syntax Trees. The tool looks very promising but currently

lacks support for accessing more semantic information about the code. Although it already

provides pattern matching of specific variables with a particular static type in a given scope,

most of the interesting transformations for manipulating library calls require control and

dataflow information in addition to bindings. The current visual model of iXj could certainly

be extended to integrate these additional concepts. Nevertheless, visual models often have

inherent limitations, and a top-down approach of extending the model of iXj might hit one

of them.

We believe it is more appropriate to start from a more heavyweight solution and provide

some useful constructs and means of abstraction to ease the development of refactorings. Of

course, in that kind of bottom-up approach, only tool experts can at first author their own

refactorings, but the ultimate aim is to reach end-user developers by providing high-level

building blocks for authoring custom transformations.

1.5 A scripting language

The observations we have made in the previous sections can be summarised in three points.

First, the correct implementation of refactoring transformations is hard and requires some

deep static-semantic information about the programs to transform. Second, there is a growing

demand for refactorings to support new languages or language extensions. Third, advanced

developers may wish to implement their own transformations.


To address these issues, we propose a domain-specific language that enables the concise

description of refactoring transformations as scripts. The notion of scripts usually conveys

the idea of a small piece of code that can be run in an interactive environment. Our wish

is indeed to allow tool authors to quickly prototype and express refactorings within scripts

that could be exchanged and replayed. Scripts should be concise and be as close as possible

to the specifications of refactorings.

How should such a scripting language look like? Looking at the mechanics of some

refactorings, we note there are three common steps in their mechanisation:

• Finding elements of interest;

• Checking preconditions;

• Performing the actual transformation.

In addition, some tool authors may also wish to check postconditions to guarantee some

correctness of the transformation. It is clear from these steps that logical features to find

elements and check conditions on the program to refactor are useful. As for the third step

and the actual manipulation of the program, the benefits of functional features for meta-

programming is very well-known.

Moreover, in view of the increasingly prevalent mixtures and embeddings of languages,

we wish to target any object language or indeed several languages at once for cross-language

refactoring. Since most refactoring transformations require some knowledge about name and

type lookup, as well as control and dataflow information, it is an absolute requirement to

allow, in the scripting language itself, the description of static-semantic information about

the object languages. Furthermore, there are two other less obvious reasons for such a

requirement. First, we envision the use of similar scripts for other, perhaps more simple, tool

support such as the navigation between artifacts in an IDE. Second and more importantly,

we believe having the computation of that information in a clear formalism will allow us to

reason more precisely about the correctness of the refactorings that build on that information.

We shall turn to that point in Chapter 7 where we discuss future work in detail.

One may rightly wonder, however, how the above requirements differ from that of a

compiler. The main difference, beyond the fact that we wish to perform transformations at

the source level, is in the ability to find program elements of interest, and compute some

properties on these particular elements only. In contrast, compilers perform global analyses

on the complete program. For instance, compilers usually build a complete symbol table

for their input program, in order to resolve any variable reference to its declaration. On

the other hand, refactoring transformations are most of the time fairly local, and even when

they require a global search (for instance, when renaming a global variable), not all static-

semantic information is actually needed. The ability to query the program structure and

compute static-semantic information in a demand-driven manner are, in fact, two important


and singular requirements for a language to script refactorings. We shall explain throughout

the thesis and in particular in Section 5.3 how we achieve this.

1.6 Alternative solutions

A wealth of techniques and research tools are closely related to the domain of refactoring.

Although none of them appears to be the right solution for scripting refactoring transforma-

tions, they are all inspirational.

General-purpose transformations A refactoring is just a special kind of code trans-

formation. One might therefore wonder if general-purpose transformations systems could

elegantly address the issues we have raised earlier and enable the expression of refactoring

transformations into a concise and readable formalism.

An example of such general-purpose transformation tool is the TXL programming lan-

guage [Cor06]. TXL is a hybrid functional and rule-based language designed to support source

transformation tasks. In particular, it allows rapid prototyping of new language parsers and

new language extensions.

Another example is the ASF+SDF Meta-Environment, a complete toolkit for the im-

plementation of transformations and other language processing, based on a Generalized LR

parser [vdBHdJ+01]. It focuses mostly on syntax definitions with SDF and on syntactical

transformations. Stratego/XT [BKVV06] is a language and toolset for program transfor-

mation that builds on SDF. The Stratego language provides rewrite rules for expressing

transformations, and the XT toolset offers a collection of tools, such as powerful parser and

pretty-printer generators and grammar engineering tools. Its original focus was also on syntax

analysis, but Stratego now supports dynamic rewrite rules for expressing context-sensitive

transformations and more semantic analysis [BvDOV06]. Although they can be used to com-

pute static-semantic information, dynamic rewrite rules are sometimes difficult to use for

that purpose, as the context they capture can only be propagated top-down.

All these systems have primarily focused on more syntactic analyses, and added support

for more semantic tasks trying to stay close to their original formalism. As a result, the

computation of contextual information, such as name lookup, is sometimes hard to define

in an intuitive way. On the other hand, these systems support rewrite rules which are very

appealing for specifying the actual transformation steps of a refactoring.

APTS [Pai94] is another general transformation system. It allows sophisticated program

derivation and, in that sense, closely relates to the area of refactoring where behaviour preser-

vation is important. In addition to rewrite rules, it supports inference rules for computing

semantic information. Interestingly, these inference rules are expressed in a language simi-

lar to Datalog, a database query language that we present in Chapter 3. As we shall see,

the logical features of our scripting language are also reminiscent of Datalog. Nevertheless,


the formalism of APTS, though very powerful, is too heavyweight. The expression of a

refactoring script should be more intuitive.

Attribute grammar systems The general-purpose transformation systems cited above

are not well-suited for the computation of static-semantic information necessary in the im-

plementation of developers tools. Systems based on attribute grammars have proved much

more successful.

The Synthesizer Generator [RT84] demonstrated the use of declarative specifications for

implementing language-based editing environments. Their formalism was that of an at-

tribute grammar tailored to the application domain of language-based editors. The context-

dependent features (i.e. the static-semantic information) of a language were described using

a combination of synthesized and inherited attributes. The former are expressed using infor-

mation from the children of a node, whereas the latter are passed down from parent nodes.

JastAdd is a recent system which also builds on the formalism of attribute grammars

[EH04]. One of its strengths is its integration with a mainstream language, namely Java. In

addition, JastAdd supports circular attributes for fixpoint computations, reference attributes

for relating nodes in the AST, and collection attributes for specifying cross-reference-like

properties such as sets of variable uses. The elegance of JastAdd has notably been demon-

strated with the implementation of JastAddJ, a full Java 5 compiler [EH07]. We shall discuss

attribute grammar systems again in Chapter 7.

Compiler optimisations In many ways, refactoring transformations are similar to com-

piler optimisations. The main difference though is that refactorings are applied at the source

level, rather than at the level of a convenient intermediate representation.

Over the past fifteen years, there has been much activity in the formal specification of

compiler optimisations, and in generating program transformers from such specifications,

e.g. [WS97, KKKS96, LM01, LJVWF02, DdMS02, MLVW03, OV02, LMC03, SdML04,

LMRC05]. All these works contrast with research that seeks to express transformations

only in syntactic terms, and provide foundations for the specification of refactoring trans-

formations. We will mention these works in more detail in Chapter 7. In particular, we

will compare our work with Optimix, an optimiser generator that mixes Datalog and graph

rewriting [Aßm98].

Logic meta-programming We mentioned earlier that our scripting language shall embed

some logical features to find elements in the code and check static conditions on the program.

This is what others have proposed in the context of code queries for spotting refactoring

opportunities [TM03] and for other software engineering tasks. There are many examples of

code querying systems and all of them are inspirational.

JQuery [JV03, MV04] is an Eclipse plugin for querying Java code empowered by a Prolog-

like engine. CodeQuest [HVdM06] is a prototype compiler of code queries expressed in


Datalog to procedural SQL. GraphLog [CMR92] is a query language with enough power to

express path properties on graphs, equivalent to linear Datalog, but with a graphical syntax.

PQL [Jar98] is a representation-independent query language with a syntax close to SQL.

Finally, ASTLog [Cre97] focuses on traversing syntax trees.

The important difference though is that, in all these works, results of code queries are not

directly used to transform the program. In addition, although these systems are expressive

enough to encode the complex preconditions of transformations (except maybe ASTLog which

was really designed for tree queries only), most of them are actually not expressive enough for

the computation of static-semantic information, such as name binding. It is usually assumed

that this kind of information is computed in an earlier pass and made available in some

built-in relations.

1.7 Contributions

In this thesis, we suggest that the techniques which have proved successful in specifying

compiler optimisations form an appropriate basis for scripting refactoring transformations –

with the important difference again that in refactoring one transforms source code, and not

some convenient intermediate representation.

Furthermore, we propose to bridge the gap between such techniques and code queries by

allowing the expression of both complex contextual static-semantic properties (such as name

lookup or dataflow) and more structural code queries (for finding elements of interest) in a

clean uniform formalism that translates to a variant of Datalog.

The principal contributions of this thesis are:

• The identification of the need for a scripting language for refactoring transformations,

and of its requirements. The language must notably allow script authors to:

– easily find program elements of interest;

– describe, for different object languages, static-semantic information, such as name

binding, type analysis and flow analysis;

– concisely express preconditions of refactorings using that static-semantic informa-

tion;

– perform the actual transformation.

• The formulation of features for such a language, in particular:

– functional features (borrowed from ML, such as higher order functions and pattern

matching) for manipulating ASTs;

– logical queries (akin to Datalog) for expressing complex static relationships be-

tween program elements;


– path queries as a convenient shorthand for queries that capture complex static-

semantic properties, such as control and dataflow properties.

• The integration of all these features in a clean, coherent design.

• An implementation of the language on the .NET platform.

• The validation of the language design on a number of non-trivial examples, and the

first, to our knowledge, complete specification of the core part of Extract Method for a

large subset of C#.

• A variant of Datalog where query results are returned in a meaningful order, and whose

semantics is based on duplicate-free sequences rather than sets.

• A class of partially stratified Datalog programs (sufficiently expressive to encode the

computation of static-semantic information), along with a top-down set-based resolu-

tion strategy to evaluate such programs.

1.8 Outline

This thesis assumes the reader has basic knowledge of functional programming, predicate

calculus and relational algebra. It also assumes general background knowledge on the broad

field of meta-programming, and more specifically in the area of compiler construction. The

remainder of the thesis is organised as follows.

In Chapter 2, we introduce the ideas in the design of our language, called JunGL. That

design is illustrated through the implementation of representative analyses and refactorings

on a toy imperative language. We also briefly present the toolkit that we have built around

our language using both C# and F#, a .NET functional language inspired by OCaml and

developed at Microsoft Research.

In Chapter 3, we give an introduction to Datalog in its classical version based on a finite-

set semantics. Several important classes of Datalog programs have been characterised, e.g.

statically stratified Datalog and modularly stratified Datalog. We detail them along with

their common implementation strategies.

In Chapter 4, we explain that the order of results produced during the evaluation of a

JunGL query is important. In that respect, Prolog-like resolution mechanism seems at first

appropriate for our application, but termination of queries would be hard to tract. Instead,

our logical features translate to an ordered variant of Datalog whose semantics are based

on duplicate-free sequences rather than sets. We study this variant of Datalog, which we

call Ordered Datalog, and give a precise translation of predicates, edges and path queries to

Ordered Datalog programs.

In Chapter 5, we introduce a broader class of stratified Datalog programs that appears in

practice sufficiently expressive for the computation of static-semantic information. This class


allows the use of nonmonotonic constructs inside recursion, but remains smaller than the

class of modularly stratified Datalog. Furthermore, we describe the evaluation of Ordered

Datalog programs in a demand-driven manner on a top-down stream-based framework, and

we also address the relationship between Ordered Datalog and normal set-based Datalog by

exploring how to express Ordered Datalog queries in normal Datalog.

In Chapter 6, we put to test the whole design of JunGL and discuss a number of complex

refactorings for large subsets of languages like Java or C#. We choose to present three well-

known refactorings, Rename Variable, Extract Method and Extract Interface, as we believe

they are representative of three important classes of refactorings. The first class deals with

scoping, the second with control and data flows, and the last one, more specific to object-

oriented programming, consists of refactorings that alter the type hierarchy of a program.

In Chapter 7 finally, we discuss more related work, compare our language with other

approaches, and highlight directions for future work.

Three appendices have been attached to the thesis. Appendix A is a reference for the syn-

tax of our scripting language. Appendices B and C are example scripts of complex refactoring

transformations.

Chapter 2

Design of the language

In this chapter, we introduce informally the features of JunGL — short for Jungle Graph

Language — that make it appealing to the specific domain of scripting refactoring transfor-

mations. JunGL borrows features both from functional ML-like languages and from logic

languages. It differs, however, from previous early approaches in combining these two styles

of programming, such as LogLisp [RS82]. First, our language focuses on querying and ma-

nipulating a representation of a program. Second, the logical features of JunGL are mostly

based on a variant of Datalog, a database query language with a very declarative semantics,

that we shall introduce in the next chapter.

Here, we illustrate most constructs with excerpts from a common JunGL script. That

script describes functions and predicates for manipulating a toy imperative language, called

While.

2.1 ML-like features

JunGL is primarily a functional language in the tradition of ML [MTHM97]. Like ML, it

has important features such as pattern matching and higher-order functions, while allowing

the use of updatable references. The advantages of this type of programming language in

compiler-like tools is well-known [App98]. As a very brief illustration of the style of definition,

here is the ‘map’ function that applies a function f to all elements of a list l :

l e t rec map f l =

match l with

| [ ] → [ ]

| x:: xs → ( f x ) :: (map f xs )

That is, map is recursively defined: in the body, we examine whether l is empty or whether

l consists of an element x followed by the remaining list xs . In the latter case, we apply f to

x and recurse on xs .

16

CHAPTER 2. DESIGN OF THE LANGUAGE 17

The function map can now be used as in:

l e t succ = fun x → x + 1 in

map succ [ 1 ; 2 ; 4 ]

We first define another function succ that takes an integer and returns its successor, and

we then ask for the result of mapping succ to the list [1; 2; 4]: that is [2; 3; 5].

2.1.1 Types

Usual ML types As we see from our previous example, functions are first-class values. A

function can be passed to another function as a parameter, be assigned to a variable (such as

succ above) and be returned as a result of a function. So are lists (e.g. [1; 2; 4]), and tuples

(e.g. (1, 2, 4)).

Obviously, in addition to functions, lists and tuples, JunGL also manipulates basic types:

booleans, numeric values and strings. Numeric values currently consist only of integers. In

the course of designing JunGL, we omitted reals since we could not think of an application

for them. The refactoring research community has proposed, however, to locate refactoring

opportunities using metrics [SSL01]. One could also use heuristics to guide the transforma-

tions that cannot be optimally defined. For those applications, we agree that a support for

real values would be convenient. Adding such support is straightforward of course.

Streams In addition to the primitive types common to ML-like languages, JunGL offers

streams as another built-in data type. Streams are lazily evaluated lists and do not come

built-in in strict languages like ML. In JunGL, we use streams exactly like one would use lists

and list comprehensions in a lazy functional language such as Haskell [Bir98]. Indeed, as we

explain in more detail later in this chapter, answers to lazily evaluated predicates are returned

as streams. This often allows us to specify a search problem in a nice, compositional way:

generate a stream of successes, take the first one and no further elements will be computed.

AST custom data types Another difference lies in custom data types. Traditionally,

in the family of ML languages, there are two ways of building custom types: records and

algebraic data types. A record data type is a user-defined data structure that encapsulates

labeled, possibly mutable, fields. An algebraic data type is an immutable data type each of

whose values is data from other data types wrapped in one of the constructors of the data

type. Algebraic data types can be recursively defined. They are commonly used to represent

abstract classes of a specific kind of data. In meta-programming notably, they allow the

concise definition of Abstract Syntax Tree grammars.

In JunGL we need only one way for constructing custom data types, and that is exactly

for defining the Abstract Syntax Tree structure of a program. Algebraic data types seem

to be the data types of choice. Nonetheless, we also wish to perform destructive updates


on the program in order to apply a transformation without having to rebuild a whole new

copy of the tree, and so we need record types with mutable fields. In fact, we even wish

to manipulate incomplete program trees. Hence we make fields optional, i.e. fields can be

assigned the value null whose type is the bottom of all types.

Algebraic data types force each alternative of the abstract class to be a constructor,

whereas in Abstract Syntax Trees it is common to have an abstract class directly under

another abstract class. For instance, in While, an expression - which is abstract - can either

be a variable with a mutable name or a literal - also abstract - which in turn can be anything

among true, false and the set of all integers. If one were to describe the grammar for these

abstract syntax trees in Caml [LDG+04], one would mix records and constructors and write

something like:

type va r i a b l e = { mutable name : s t r i n g }

type i n t e g e r = { mutable va lue : i n t }

type l i t e r a l =

| True

| False

| Int of i n t e g e r

type expr e s s i on =

| Var of va r i a b l e

| L i t e r a l of l i t e r a l

In JunGL, whose design is tuned for the specific manipulation of Abstract Syntax Trees,

we have opted for a more concise notation:

type Express ion =

| Var = { name : s t r i n g }

| L i t e r a l = (

| True

| False

| Int = { va lue : i n t }

)

Arguably, the hierarchy of AST nodes is more readable when expressed in JunGL. An

abstract Expression is either a concrete record Var, which holds the name of a variable, or

an abstract data type Literal, which in turn is any of the concrete records True, False and

Int. In the remainder of the thesis, we refer to these custom types as AST data types.

We give in Figure 2.1 the full definition for representing abstract syntax trees of While

programs. We shall see in Section 2.5, when we describe the toolkit around JunGL, that

AST data types can be further annotated, for instance with pretty printing instructions.


typeProgram = { s ta tements : Statement l i s t }

andStatement =| WhileLoop = { cond i t i on : Expres s ion ; body : Statement }| I f = { cond i t i on : Expres s ion ; thenBranch : Statement ;

e l s eBranch : Statement }| VarDecl = { typeRef : Type ; name : s t r i n g }| Assignment = { var : Var ; expr e s s i on : Expres s ion }| Block = { s ta tements : Statement l i s t }| Pr int = { expr e s s i on : Expres s ion }

andExpress ion =| Var = { name : s t r i n g }| L i t e r a l = (| True | False| Int = { va lue : i n t })

| In f i xOpera t i on = { l e f t : Expres s ion ; operator : In f i xOpera to r ;r i g h t : Expres s ion }

| Pre f ixOperat ion = { operator : Pre f ixOperator ;operand : Expres s ion }

| Parenthe s i s edExpre s s i on = { expr e s s i on : Expres s ion }and

In f i xOpera to r =| And | Or| Add | Sub| Mul | Div| Equal | NotEqual| LessThan | GreaterThan

andPre f ixOperator =| Not | Plus | Minus

andType =| IntType| BoolType

Figure 2.1: Data types for Abstract Syntax Trees in While


Constructing AST values Having introduced AST data types, we ought to say a brief

word about how we construct particular values. Again, the syntax is a mix between algebraic

data types and records.

new I f {

cond i t i on = new Var { name = "b" } ,

thenBranch = new Pr int {

expr e s s i on = new Int { va lue = 0 }

}

}

builds the While code for if (b) print(0);. Unlike for traditional record types, we do not

need to specify all fields of a concrete AST data type. Here, we do not assign any value to the

else branch of the if for instance, and so its value is simply the null value. Also, since updates

of the tree are possible, one could build the same value through a sequence of instructions:

l e t i fStmt = new I f {} in

i fStmt . cond i t i on ← new Var { name = "b" } ;

i fStmt . thenBranch ← new Pr int {

expr e s s i on = new Int { va lue = 0 }

} ;

i fStmt

As we see, JunGL has no support for code quotations yet. This could be addressed in

future work together with the integration of a GLR parser. We discuss these additions in

Chapter 7.

Summary To wrap up this section on types, we give the grammar of available types in

JunGL:

τ ::= bool∣∣ int∣∣ string∣∣ Node∣∣ τ list∣∣ τ stream∣∣ τ → τ∣∣ τ × · · · × τ∣∣ unit

The type Node includes all AST data types, → is the built-in constructor for function types,

and × the built-in constructor for tuple types. The unit type is just the type of the empty

tuple () - it is similar to type void in C-like imperative languages.


2.1.2 Pattern matching

Pattern matching is an important feature of functional and term rewriting languages. It

enables the powerful processing of data based on its structure. In fact, it is the only way of

processing data of a constructed data type. Let us come back to our first map example:

l e t rec map f l =

match l with

| [ ] → [ ]

| x:: xs → ( f x ) :: (map f xs )

We process differently the empty list and the list whose head and tail can respectively be

assigned to x and xs . Pattern matching is really the only way to extract the head of a list,

and a JunGL function that does the job for any list would be defined as:

l e t head l =

match l with

| [ ] → e r r o r "empty list"

| x:: → x

If head is called on the empty list, then we raise an error. Otherwise, we yield x whose

value comes from the first element of the list. The character ‘ ’ denotes a don’t-care pattern

that can match virtually anything.

We use similar pattern matching to deconstruct and process tuples and AST data types.

To illustrate briefly, we give the definition of a function that recursively traverses an expres-

sion to collect a list of encountered variables:

l e t rec concat l 1 l 2 =

match l 1 with

| [ ] → l 2

| x:: xs → x:: ( concat xs l 2 )

l e t rec c o l l e c tVa r i a b l e s expr =

match expr with

| Var → [ expr ]

| In f i xOpera t i on { l e f t = l , r i g h t = r } →

concat ( c o l l e c tVa r i a b l e s l ) ( c o l l e c tVa r i a b l e s r )

| Pre f ixOperat ion { operand = e } →

c o l l e c tVa r i a b l e s e

| Parenthe s i s edExpres s i on { expr e s s i on = e } →

c o l l e c tVa r i a b l e s e

| → [ ]


Pattern matching is very appealing in the context of term rewriting and for program

transformations. So appealing that Scala, which embeds some pattern matching constructs,

was recently suggested to implement some simple refactorings in Eclipse [Fal07].

Another example is Tom [BBK+07], an extension of Java designed to manipulate tree

structures and XML documents. One of its attractions, among many others, is the ability

to do pattern matching in Java. It even provides Associative-Commutative matching, which

not only would be very useful in the context of JunGL, but also would fit nicely with the

logical features we are about to describe.

However, pattern matching in that form is nice but not enough powerful for our appli-

cation of scripting refactoring transformations, where we often need to collect information

about the program tree. Therefore, JunGL also supports generic queries which are more

appropriate for such purposes. As an example, one could express the same earlier function

as the following query:

l e t predicate descendant (? x , ? y ) =

ch i l d (? x , ? y ) | local ? z : descendant (?x , ? z ) & ch i l d (? z , ? y )

l e t c o l l e c tVa r i a b l e s expr =

{ ?v | descendant ( expr , ? v ) & (? v i s Var ) }

In words, we define a recursive predicate descendant that holds for the two logical variables

?x and ?y if ?y is a child of ?x , or if ?y is a child of an intermediate node ?z , which is itself a

descendant of ?x . Then we use that predicate in a comprehension, on the last line, to search

for all nodes ?v that are descendants of expr and are of type Var .

Furthermore, JunGL supports path queries as a convenient shorthand for regular queries.

The above program can hence be abbreviated to:

l e t c o l l e c tVa r i a b l e s expr =

{ ?v | [ expr ] c h i l d ∗ [ ? v : Var ] }

The path query in the comprehension, recognisable by the use of square brackets around node

variables, should be read as “a path from node expr to a node ?v of type Var following, zero

or more times, a direct child edge in the program tree”. Here, the child edges are built-in,

but as we shall see, new edges can easily be defined.

These logical constructs enable the search for complex patterns using a variety of tree

traversals. They present an alternative to usual solutions for traversing a tree using different

search strategies. In functional programming, different kinds of tree traversal are usually

achieved by the use of combinators [Spi00, LV02]. In Tom or in Stratego [BKVV06], built-in

or constructed strategies are used to control tree traversals. Yet, in order to find complex

patterns in the tree, a context may have to be carried over during these search strategies.

For instance, in [BMR07], Balland et al. parameterise a search strategy with a map of labels


to nodes in order to collect these labels and traverse bytecode instructions based on their

control flow. Another example of context propagation, summarised in [BvDOV06] by the

Stratego people, is the introduction of dynamic rewrite rules for expressing context-sensitive

transformations.

In JunGL, user-defined edges provide a mechanism to turn the tree into a directed graph

by mapping nodes to other nodes in the program tree, thus allowing to refer to contextual

information. That mechanism is very similar to the use of reference attribute grammars,

which has proved very successful for the construction of compilers [EH07]. In the following

section, we introduce the logical features of JunGL for building such a directed graph and

for querying it.

2.2 Logical features

Typically we wish to super-impose some graph structure on top of the object program tree,

run a number of queries on that graph to find out specific information, and then make some

destructive updates to the underlying tree. As we illustrated, a functional language is not

ideal for querying a graph structure; logic languages, in the Datalog tradition, are much

better suited to that task.

2.2.1 Predicates

The notion of predicates in our language effectively makes JunGL a hybrid functional and

logic language. Predicates are build from conjunctions (&), disjunctions (|), negations (!),

a first operator (in a way similar to the cut operator in Prolog), calls to other predicates,

tests and path queries to which we shall dedicate a special section. Furthermore, we allow

recursion inside predicates, under some conditions though, which we shall explain later in

the thesis.

JunGL is therefore akin to early attempts at integrating logic features into functional

languages, such as LogLisp [RS82] or the embedding of Prolog in Haskell proposed by Mike

Spivey and Silvija Seres [SS99]. Importantly, however, we have not found it necessary to

import the full power of a logic language such as Prolog, and in particular there is no use

of unification in the implementation. Our logical features are instead based on Datalog

(essentially Prolog minus data structures as we shall see in Chapter 3), which provides just

the right balance of expressive power with an efficient implementation. With Datalog on finite

structures, in contrast to Prolog, it is impossible to output an infinite stream of successes.

This difference appears to be crucial when it comes to building a graph on top of the program

tree and querying it. In JunGL, we guarantee that all queries terminate even on a cyclic

graph such as the control flow of a program. We shall elaborate on this issue in Section 2.3.

Predicates can be named just like functions, by using the keyword predicate in a let

binding:


l e t predicate s i b l i n g (?x , ? y ) =

[ ? x : VarDecl ] & [ ? y : VarDecl ] & ?x != ?y & ?x . name == ?y . name

This predicate looks for two sibling variables in the whole program, that is two distinct

variables with equal names.

When integrating a functional and a logic language, the key question is how we use

predicates in functions, and vice versa. In JunGL, one can use predicates in functions via a

stream comprehension. More precisely,

{ ?x | p(? x ) }

will return a stream of all x that satisfy the predicate p. For instance, the following expression

returns all pairs of sibling variables in a loaded While program:

{ (? x , ? y ) | s i b l i n g (?x , ? y ) }

Note again that logical variables such as ?x are prefixed by a question mark to distinguish

them from normal variable names. One can use expressions as arguments in predicates, but

obviously all logical variables in such an expression must be bound.

Logical terms do not have to be named if their value is of no interest. Like in functional

pattern matching, ‘ ’ denotes a used-once free variable that can match anything. It is thus

possible to write:

{ ?x | s i b l i n g (?x , ) }

2.2.2 Lazy edges

The tree representing the program we wish to query does not by itself contain enough infor-

mation to encode non-naive refactorings which require information beyond pure syntax. For

instance, one ought to know where a given variable is declared. Similarly, one might expect

to have access to the control-flow successors of a statement.

The solution we have opted for relies on the ability to super-impose contextual semantic

information on top of the tree representation of the program. Initially, that representation

is just a forest of ASTs for all the compilation units, whose edges simply indicate child and

parent relationships. We allow the addition of further relationships via lazy edge definitions.

By “lazy”, we mean that an edge is only evaluated when it is required. Hence the initial tree

is turned into a directed graph in a demand-driven manner.

To illustrate the definition of edges, we shall describe how the declaration of a variable

in While can be looked up very simply, just by defining an extra lazy edge that relates a

variable reference to its declaration.

First, we create an edge treePred to reflect a special traversal strategy based on tree

predecessors. The definition of an edge always follows the same pattern, that is:


l e t edge tr eePred n → ?pred =

. . .

Here, the name of the edge is treePred. The variable n captures the source node of the

edge, and ?pred is a logic variable that is to match the target of a possible edge emanating

from n. The body of an edge is then defined as a relation between n and ?pred . We therefore

complete our example as follows:


f i r s t ( [ n ] l i s tP r e d e c e s s o r [ ? pred ] | [ n ] parent [ ? pred ] )

In words, it says that ?pred is the tree predecessor of n either:

1. if n is in a list and ?pred is the direct predecessor of n in that list, or

2. if ?pred is the direct parent of n.

In addition, the operator first is used to select only the first of the two possible matches, thus

returning the parent of n only if n has no list predecessor. If n has neither list predecessor

nor parent, then there is no match for ?pred : it is a failure and there is no treePred edge

emanating from n. On the AST of a program, following transitively treePred edges from a

given node n just builds up a path from n to the root of the AST, where all parents of n and

their list predecessors are visited.

The two alternatives around the union operator ‘|’ are path queries that we shall detail

shortly. Right now, it is enough to understand that the body of an edge definition is just a

predicate that must hold for any target of that edge. This explains the difference of notation

between the source and target nodes in the definition: The question mark in ?pred indicates

a free variable that must be substituted with all possible targets of edges outgoing from the

single node n. Such an asymmetry makes sense in the presence of the operator first. Indeed,

if we were considering a relation treePred with symmetric roles for both the source and target

nodes, then the operator first would apply to the whole relation, and we would get only one

treePred edge outgoing from the first node that has either a left sibling or a parent. Here

first is implicitly parameterised by the variable n. The asymmetry allows us to reason, more

simply, about targets from a single node only.

Armed with the treePred search strategy, it is very easy to define the edge that binds a

variable to its declaration. In our toy language While, it suffices to climb up the tree and

look for the first declaration of a variable whose name matches the name of the variable we

are trying to resolve.

l e t edge lookup r : Var → ?dec =

f i r s t ( [ r ] t r eePred+[?dec : VarDecl ] & r . name == ?dec . name )


Interestingly, the source r of the edge is here accompanied by the AST data type Var .

This means the edge lookup will only be defined from nodes that are of type Var , i.e. from

variable references. The body of the edge definition then reads as follows: follow one or more

treePred edges from r until a node of type VarDecl is found, with a name equal to the name

of the variable defined in r . The use of first forces the evaluation of results with respect to

the traversal order, and yields only the first match if there is one.

The edges treePred and lookup will only be constructed when we try to access them from a

specific node. This mechanism of lazy edge construction is very convenient when introducing

new tree nodes, as it often relieves us of the burden to laboriously construct all the auxiliary

information on new nodes. Without it, scripts would quickly become prohibitively complex

because we would have to remember to construct all relevant edges when creating new graph

nodes, and also inefficient. All computed information on the AST is handled in this way, so

for example edges for representing the control flow of a program are also represented as lazy

edges. We shall now describe that example together with another feature, namely attributes.

In some cases, it is useful to enrich a node with some value, rather than linking it to other

existing nodes. For that purpose we use attributes. The value of an attribute may be of

any type, and notably be a freshly created node. Indeed, it is sometimes convenient to add

dummy nodes to the original program tree, especially to make the super-imposition of edges

more natural. To illustrate in a precise context, the definition of the control-flow graph of a

program is more readable in the presence of special dummy nodes entry and exit, attached

to the root node of any program.

type Entry

type Exit

l e t attribute entry p : Program = new Entry {}

l e t attribute e x i t p : Program = new Exit {}

Here, we define two new AST data types and two new attributes for representing the

entry and the exit of any node of type Program. The values of these attributes are just a

new Entry node and a new Exit node respectively.

We can now use these attributes to define the control-flow successors of any statement,

again as lazy edges. The following edge definition specifies the control-flow successors of

ordinary statements such as assignments:

l e t edge defaultCFSucc x : Statement → ?y =

f i r s t ( [ x ] l i s t S u c c e s s o r [ ? y ]

| [ x ] parent [ ? y : WhileLoop ]

| [ x ] parent ; defaultCFSucc [ ? y ]

| [ x ] parent ; e x i t [ ? y ]

)


The edge listSuccessor is a built-in edge that relates a node present in a list in the original

AST (such as a statement in a block) to its successor in the same list. The default control-flow

successor of a statement is therefore the first match of all the following ordered alternatives:

the next statement in the list, or otherwise the direct parent if it is a while loop — to encode

the recursion —, or otherwise the default successor of the parent — typically when you escape

a block or the branches of a conditional — or otherwise, finally, the dummy exit node of the

program.

We now need to give the exact control-flow successors for each kind of statements, and

we do that via the following definitions:

l e t edge c f s u c c x : Statement → ?y = [ x ] defaultCFSucc [ ? y ]

l e t edge c f s u c c x : Block → ?y =

f i r s t ( [ x ] f i r s tC h i l d [ ? y ] | [ x ] defaultCFSucc [ ? y ] )

l e t edge c f s u c c x : I f → ?y =

[ x ] thenBranch [ ? y ]

| f i r s t ( [ x ] e l s eBranch [ ? y ] | [ x ] defaultCFSucc [ ? y ] )

l e t edge c f s u c c x : WhileLoop → ?y =

[ x ] body [ ? y ] | [ x ] defaultCFSucc [ ? y ]

We see here that overriding is allowed in edge definitions. The cfsucc edge definition of

Block overrides that of Statement. The definition to use in order to compute edges emanating

from a given node x is resolved by looking at runtime the type of x . As expected, it is always

the most-specific edge definition that is used. Hence the control-flow successors of a variable

declaration (a node of type VarDecl), an assignment (Assignment) or a print statement

(Print) are all computed by evaluating the first cfsucc edge definition. The three latter edge

definitions are for specific kinds of statements: the successor of a block is either its first child

or, if the block is empty, its default successor; the successors of a conditional statement are

both the then branch and, either the else branch or the default successor of the if; as for a

while loop, its successors are both its body and its default successor.

Note that the definition for if statements is valid for well-formed programs only. Never-

theless, in JunGL, it would be easy to cope with ill-formed programs too, and handle the

control-flow graph of a program that is not syntactically complete. To illustrate briefly, here

is how we can cope with missing then branches in conditional statements:


f i r s t ( [ x ] thenBranch [ ? y ] | [ x ] defaultCFSucc [ ? y ] )



At this stage, we have already given several sample definitions of edges. Looking at the

body of them more closely, we can see quite a few references to edges that were not introduced

through a proper let edge definition. Most of them simply correspond to some labeled field

of an AST data type (e.g. thenBranch, body) or to some additional attribute introduced via

let attribute (e.g. entry). The others, as we have mentioned sometimes, are built-in edges

that relate nodes to their immediate neighbours in all possible directions in the tree. They

are summarised in table 2.1. A child of a node x is any node directly under x or any node

in a list of nodes directly under x . The order of children is given by the position of fields in

the AST data type, plus eventually by the list order for fields that are lists of nodes. The

successor y of a node x whose parent is p, and position is i with respect to all children of p,

is the child of p with position i +1 if it exists. However, y is a list successor of a node x only

with the additional constraint that x and y appear to be in the same list.

Name Points to

parent the parent of the node if any.child all the children of the node if any.firstChild the first child of the node if any.lastChild the last child of the node if any.successor the right-sibling of the node if any.predecessor the left-sibling of the node if any.listSuccessor if the node is present in a list of nodes,

the successor of the node in the same list if any.listPredecessor if the node is present in a list of nodes,

the predecessor of the node in the same list if any.

Table 2.1: Built-in edges in JunGL

In order to understand more precisely edge bodies, we now turn to introducing path

queries.

2.2.3 Path queries

The most common way of constructing predicates is via path queries, also called regular path

queries. Path queries are regular expressions for checking properties about individual paths

(existential queries) or about all paths (universal queries) on a graph representation of a

program. Path queries are of course very well-known in the context of semi-structured data,

but have only been revisited fairly recently for the specific purpose of querying the control flow

of programs by De Moor et al. in [dMLVW03]. Liu et al. then proposed parametric regular

path queries [LRY+04], which slightly increase the expressiveness by allowing additional

information to be collected along single or multiple paths. Even more recently, Liu introduced

an intuitive syntax to use path queries for querying any complex graph [LS06]. Path queries

in JunGL follow the general idea of that syntax. The semantics are however different as our

path queries yield results in a deterministic order.


Path queries are very intuitive and we have already seen many examples in previous edge

definitions. For instance,



There, we have two simple path queries on both sides of the ‘|’ operator. The path compo-

nents between square brackets are conditions on nodes, whereas listPredecessor and parent

match either:

1. the type of an edge emanating from a node, or

2. the type of an attribute attached to a node, or

3. the name of a field defined in a node.

For simplicity, however, we always call “edge” any component between two node blocks.

In addition, we refer to the first node component as the start node, and to the second node

component as the end node.

Each node component consists of a variable (logical or not) whose type is an AST data

type. It can be annotated with a positive or negative AST data type reference (for instance

[?pred:Statement] or [?pred:!Statement], to constraint the possible matches to nodes

that are, or are not, of type Statement.

An edge can be a simple label like above, or a more complex expression. Notably, edges

can be sequentially composed using ‘;’. It is also possible to append a ‘+’ or ‘*’ to an edge l .

The former is simply the transitive closure of the edge relation, meaning that the end node

can be reached from the start node by following one or more matches of l . The latter is the

reflexive transitive closure of the edge relation, which on top of the transitive closure allows

the end node and the start node to be identical.

We often need further expressive power in order to match a complex pattern where each

node on a transitive path has a side condition. For that purpose we allow, like in [LS06], the

use of existential local variables inside an edge expression.

As an illustration, we shall define strict post-dominance between statements in a control-

flow graph, but to better appreciate the definition in JunGL, we first give the precise definition

given by Muchnick in [Muc97]. There, post-dominance and strict post-dominance are defined

as follows:

In the control-flow graph, node p post-dominates node i, written p pdom i, if

every possible execution path from i to exit includes p.

[. . .]

Node p strictly post-dominates node i if p pdom i and p 6= i.

In JunGL, the edge definition for strict post-dominance reads:


l e t edge postDominates x : Statement → ?y =

[ ? y : Statement ] c f s u c c +[x ] &

! ( [ ? y ] ( local ? z : c f s u c c [ ? z ] & ?z != x)+[ : Exit ] )

That is, x post-dominates ?y if x is a transitive successor of ?y in the control-flow graph and

there is no path from ?y to the exit that does not go through x , i.e. whose intervening nodes

?z are all different from x . Note that we assume there is a path from each node to the exit,

which is reasonable.

The key here is the use of the locally scoped variable ?z , which is substituted with a

different node at each step in the path from ?y to the exit. These local variables improve

greatly the expressive power of path queries, and as we see allow the concise and readable

expression of complex control and data flow properties that one finds in compiler or program

analysis books [Muc97, NNH99]. A detailed description of the syntax of path queries can be

found in Appendix A, where the full grammar of JunGL is exposed.

2.3 Computational model

Now that we have presented the main features of the language, as well as the program tree

structure that is manipulated, one may naturally wonder about the computational model of

JunGL. In this section, we describe how functional and logical features interact with each

other and with the underlying program tree structure.

In particular, we shall highlight the declarative nature of the logical features and explain

how we deal with issues like cycles in the program graph and termination of recursive queries.

We shall also discuss the interaction of lazy queries and destructive updates, a common issue

in query languages with update facilities.

Declarative edge definitions At first sight, some edge definitions in JunGL may seem

to go against a declarative reading. This impression notably comes from the use of the first

operator, as in our earlier example:





| [ x ] parent ; e x i t [ ? y ]

)

The operator first is indeed reminiscent of a cut operator in impure logic programming.

However, the presence of first does not give any insight on the actual evaluation mechanism

of our queries. We shall see in the coming chapters that all our logical features in fact


translate to a variant of Datalog, a database query language with a declarative least fixpoint

semantics. Datalog programs can be evaluated in a multiple of ways, either top-down or

bottom-up, and authors of JunGL queries do not need to be aware of their precise evaluation

mechanism. The declarative nature of the logic features of JunGL lies in the existence of

such a hidden mechanism for evaluating and optimising logic queries, which we shall describe

in Chapter 4.

Termination for cyclic graphs Non-termination issues may naturally arise when dealing

with edge definitions that introduce cycles in the program graph. This may for instance be the

case in the above example of the defaultCFSucc edge, which is used for building the control

flow graph of a program on top of its AST. At this point, the original tree structure of the

AST is transformed into an arbitrary, possibly cyclic, graph. How can we then guarantee the

termination of queries for lazily constructed graphs? Very simply, we have a finite number of

initial AST nodes, and by ensuring that we only add edges between those nodes and never

retract some, we are guaranteed to compute a stable view of the final graph.

Indeed, edge definitions are part of the logical features of JunGL, and fully translate

to our variant of Datalog. We shall actually give the precise translation of defaultCFSucc

in Section 4.5.5. Hence, there is no way in JunGL to create an arbitrary edge between two

nodes programmatically. Edges are logical relations between nodes, and evaluated as Datalog

predicates. In other words, they are intensional views on the ground facts of the original AST

of the program. As we shall see in Chapter 3 and 4, the termination of queries is therefore

guaranteed by the Datalog framework we build on. However, AST nodes may be modified,

created or deleted via the functional features of the languages, which leads us to discuss the

tricky issue of update facilities in a query language.

Mixing queries and destructive updates We have shown in the previous sections how

to refer to predicates in stream comprehensions, and how to use functional expressions as

arguments in predicates, or as additional constraints. It may therefore be possible to perform

updates to the underlying AST of the program while evaluating logical edges or predicates.

As it can be foreseen, however, mixing declarative queries with such updates is likely to result

in weird evaluation behaviours, including non-termination.

Implementers of relational databases have been aware for more than thirty years of this

issue, which is commonly referred to as the Halloween problem. A precise account of the

history of the problem and an explanation of its name can be found in [Fit02]. The issue

is well illustrated with the following classical example. Say that for every row in a table,

you insert another row in that same table. If no special care is taken, new inserts may

themselves trigger other inserts, thus leading to non-termination. To prevent this, most

databases implement some kind of snapshot semantics where queries are run on a copy of

the structure to be queried. In the above example, instead of working on the current table

that is being updated, the query would be evaluated on a snapshot of the original table.


These snapshot semantics are also at the root of the recent W3C recommendation for

XQuery Update Facility 1.0 [CFM+08]. There, the XQuery processing model is extended so

that the result of an expression consists of both a normal XQuery result and a pending update

list, which represents node state changes that have not yet been applied. If the outermost

expression in a query returns a non-empty pending update list, all the changes are implicitly

invoked at that point. In effect, XQuery Update Facility therefore defines an entire query as

one snapshot. Such snapshot semantics at the level of the entire query, however, prevent the

ability to see the results of side effects during the computation of the query. To overcome

this limitation, an XQuery Scripting Extension has been proposed to define a deterministic

sequential order for XQuery expressions [CEF+08]. The snapshot granularity may hence be

reduced, with later expressions seeing the effects of the expressions that came before them.

In JunGL, we have not yet implemented any snapshot semantics. Currently, it is the

responsibility of the script author to ensure that the functions used in queries are side-effect

free. This is the same approach as in the attribute grammar system JastAdd [EH04] in which

attributes are expressed in Java, and hence may also have undesirable side effects. Another

issue, shared with attribute grammar systems, lies in the fact that any update to the the

underlying AST may invalidate previously constructed edges. So far, in our experiments, we

have always managed to mimic snapshot semantics and delay any update to the tree to the

end of the refactoring script, after which all lazy edges are invalidated. However, it would

be much more preferable to maintain edges incrementally on every change. We discuss this

future work in Chapter 7.

Finally, one has to be careful that results of stream comprehensions are returned lazily in

JunGL. Therefore, like in other frameworks such as LINQ [MBB06], special care is required

when performing updates on the results of a query. Again, this could be solved with snapshot

semantics, but we have opted until now for a simpler common workaround: results of the

query can be cached with the built-in function toList for converting a stream to a list, thus

forcing its eager evaluation. An example of its use is given in the scripts in Appendices.

2.4 Other features

Namespaces We use namespaces to avoid name conflicts in the presence of many different

functions or data types. In Figure 2.1, the AST data types could have been defined inside

the namespace While.Ast for instance:

namespace While . Ast {

. . .

}

A previously defined namespace can then be imported through the using construct. The

type Program can be referred to as While.Ast .Program from other namespaces or indeed

directly as Program in a scope using the namespace While.Ast as in:


using While . Ast {

. . .

}

Foreach In order to iterate on streams, we have added a foreach construct to JunGL, which

is just syntactic sugar for an iter function on streams. This imperative loop construct still

enables pattern matching on the values of a stream at each iteration step. To illustrate, here

is how we would traverse the stream of sibling variables:

foreach (x , y ) in { (?x , ? y ) | s i b l i n g (?x , ? y ) } do

. . .

External calls The problem we often encounter with Domain Specific Languages is that

there is always a need for an interaction we have not envisioned. That is the main practical

difference with embedded DSLs, where you can just make use of the full power of the host

language if required.

In particular, refactoring is an interactive process that often requires guidance from the

users. These interventions go from specifying a name or a set of methods during Rename

or Extract Interface refactorings, to resolving potential conflicts that might occur during a

transformation. The latter case is particularly useful when there is no obvious best solution

to the conflicts, and when we wish to minimise rejection of the transformation. Therefore, we

have added external UI features to JunGL. They are called as normal functions that belong

to particular namespaces.

For the purpose of demonstrating the use of JunGL in a broader context than just refac-

toring, we have also added external functions for building up a small IDE for the object

language one wishes to manipulate. It is indeed possible to plug some program analyses

written in JunGL into the editor of the object language. For instance, the function addEr-

rorFinder in the namespace Editor is used to plug on-the-fly compiler checks into the editor.

To illustrate, we shall now describe the toolkit we have built around JunGL, and show some

further examples for the While object language.

2.5 The toolkit around the language

JunGL is part of a toolkit that aims to be a complete end-to-end solution for prototyping

refactoring transformations on any language. The system consists of four components im-

plemented on the .NET platform: a graph data structure, an interpreter for the scripts that

manipulate this data structure, and two editors for the object language, not the scripts. For a

rich interactive experience, refactoring tools commonly guide users through ‘wizards’. We do

not support such complex UI components but provide basic support for script authors to ask


for user input. More advanced interaction can be achieved via other calls to external code.

A diagram of the toolkit’s architecture extracted from [VPdM06] is depicted in Figure 2.2.

We briefly describe the main components here.

Figure 2.2: Overview of the toolkit

2.5.1 The graph structure

JunGL manipulates the graph through basic operations defined in a small interface. We

provide a default implementation of this interface in C#. Before the construction of lazy

edges, the graph is a tree whose grammar is defined in JunGL through custom AST data

types. We have given in Figure 2.1 an example of such grammar. For a better integration

in our toolkit, AST data types are further annotatable. For instance, Figure 2.3 shows the

grammar example of Figure 2.1, this time with annotations.

The basic pretty printing annotation @pretty is used to render newly created nodes as

text. It only provides a basic, yet convenient, pretty-printing mechanism. It is not used to

describe the concrete syntax of the object language. Therefore the @parser annotation is

used to specify which parser needs to be called for building the AST of a program. One can

also create any AST from scratch, or update it programmatically. We enforce at runtime


that each node has one parent at most and that no cycle is introduced accidentally.

ASTs are turned into an actual graph only when lazy edges are evaluated. No edge can be

added imperatively. Each node has a list of edges that relate it to other nodes of the ASTs.

We apply the same caching techniques as the ones found in attribute grammar systems like

JastAdd [EH04]. Once an edge from a node n has been evaluated, it is cached in node n

until further modification of the underlying tree.

To work with a different object language from scratch, one simply provides another gram-

mar via AST data types, along with the new parser. There is no support in JunGL for syntax

definitions but all the work on Generalized LR parsing techniques [Tom87] could be reused

here in order to make JunGL a complete end-to-end solution. However, our architecture

already makes it easy to leverage an existing strongly-typed AST implementation. All one

needs to do is to make the existing AST classes implement the interface that the JunGL

interpreter uses for manipulating trees and graphs. This is particularly convenient if one

wishes to run JunGL in an existing development environment for instance.

2.5.2 The interpreter

The JunGL interpreter follows the usual pattern of an interpreter for a functional language

in a functional language. Indeed, JunGL is implemented in F# [Sym05], a variant of ML that

runs on top of the .NET framework. Because F# is fully integrated in .NET, it allows us to

work across languages. In particular, we can use the C# implementation of the graph in our

F# programs and vice versa. For now JunGL, like most other scripting languages, is only

dynamically typed. In future work, one may want to augment the language with at least

some form of soft typing, to provide more static safety.

The most interesting part of the interpreter is therefore its treatment of logical features,

which we shall explore in detail in the coming chapters.

2.5.3 Editors

In addition to the interpreter itself, we have implemented two editors for programs written

in the object language:

• a text editor to which it is possible to add features implemented as JunGL scripts, and

• a structure editor for visualising the graph that is manipulated by JunGL.

Both editors use the pretty-printing annotations of the AST datatype definitions to render

the AST.

The purpose of the text editor is to demonstrate the use of JunGL in a broader context

than just refactoring. We shall show in the next section a few examples of features that one

can plug in this editor. For instance, definite assignments of variables can be enforced with

a tiny JunGL script and violations marked via red squiggles on the program text. Another


type

@pretty ("|($statements )|" )@parser ("JunGLAddins :JunGLAddins .Parsers.WhileParser .WhileParser " )Program = { statements : Statement l i s t }

and

Statement =| @pretty ("’while (’ $condition ’) ’ \n \t $body" )

WhileLoop = { cond i t i on : Express ion ; body : Statement }| @pretty ("’if (’ $condition ’) ’ \n

\t $thenBranch \n

[ ’else ’ \n \t $elseBranch ]" )I f = { cond i t i on : Express ion ; thenBranch : Statement ;

e l seBranch : Statement }| @pretty ("$typeRef ’ ’ $name ’;’" )

VarDecl = { typeRef : Type ; name : s t r i n g }| @pretty ("$var ’ = ’ $expression ’;’" )

Assignment = { var : Var ; e xp r e s s i on : Express ion }| @pretty ("’{’ \n \t |($statements )| \n ’}’" )

Block = { statements : Statement l i s t }| @pretty ("’print(’ $expression ’);’" )

Pr int = { exp r e s s i on : Express ion }and

Express ion =| @pretty ("$name" )

Var = { name : s t r i n g }| L i t e r a l = (

| @pretty ("’true ’" ) True| @pretty ("’false ’" ) False| @pretty ("$value" ) Int = { value : i n t })

| @pretty ("$left ’ ’ $operator ’ ’ $right" )In f ixOperat ion = { l e f t : Express ion ; operator : In f ixOperator ;

r i gh t : Express ion }| @pretty ("$operator $operand " )

Pre f ixOperat ion = { operator : Pre f ixOperator ;operand : Express ion }

| @pretty ("’(’ $expression ’)’" )Paren thes i z edExpre s s ion = { exp r e s s i on : Express ion }

and

In f ixOperator =| @pretty ("’&&’" ) And | @pretty (" ’||’" ) Or| @pretty ("’+’" ) Add | @pretty ("’-’" ) Sub| @pretty ("’*’" ) Mul | @pretty ("’/’" ) Div| @pretty (" ’==’" ) Equal | @pretty (" ’!=’" ) NotEqual| @pretty ("’<’" ) LessThan | @pretty ("’>’" ) GreaterThan

and

Pref ixOperator =| @pretty ("’!’" ) Not | @pretty ( "’+’" ) Plus | @pretty ("’-’" ) Minus

and

Type =| @pretty ("’int ’" ) IntType | @pretty ("’bool ’" ) BoolType

Figure 2.3: Data types with annotations


example is to plug in a function that resolves the declaration of a variable reference, and

highlights it in the program.

The structure editor has a different purpose. By selecting blocks or nodes, we can visualise

the connections to other nodes in the graph, that we have added by defining lazy edges

in JunGL. We have found this tool indispensable in the interactive development of new

refactoring scripts.

2.6 Further examples on While programs

Before moving to the precise semantics of the logical features in JunGL, we first illustrate

many of the features we have just introduced, on small concrete applications for While.

2.6.1 Binding and definite assignment checks

One of the most basic but useful compiler checks is to ban the use of variables that have not

been declared:

l e t checkBinding program =

toL i s t { ?x | [ program ] ch i l d +[?x : Var ] & ! [ ? x ] lookup [ ] }

in

Editor . addErrorFinder Program checkBinding

"W01: not declared"

Given a program, the function checkBinding returns a list of such variables. We use a

stream comprehension to collect nodes of type Var that have no outgoing lookup edge. Then

we call the external function addErrorFinder to plug the checkBinding analysis to our editor.

Each undefined variable will be now underlined with the error message “W01: not declared”.

As a second example, we propose to check a common rule for modern languages: the

definite assignment rule, which enforces each local variable to be assigned before it is used.

We start by defining two new edges. The use edges link statements or expressions to the

variables that are read during their execution. Conversely, the def edges relate statements or

expressions to the variables that are written there.

l e t edge use x : Expres s ion → ?y = [ x ] c h i l d ∗ ; lookup [ ? y ]

l e t edge use x : Assignment → ?y = [ x ] expr e s s i on ; use [ ? y ]

l e t edge use x : Pr int → ?y = [ x ] expr e s s i on ; use [ ? y ]

l e t edge use x : I f → ?y = [ x ] cond i t i on ; use [ ? y ]

l e t edge use x : WhileLoop → ?y = [ x ] cond i t i on ; use [ ? y ]

l e t edge de f x : Assignment → ?y = [ x ] var ; lookup [ ? y ]


In the toy language While, variables can only be written by an Assignment statement. Ex-

pressions are side-effect free: there are no such things like post-increment and post-decrement

operators. There are no functions calls with reference parameters either. Therefore, the defi-

nitions of use and def for nodes of type Expression are straightforward. There is simply no def

edges from them and the use edges point to the declarations of each variable occurring as a

descendant of the expression. Similarly, we also define these edges at the level of statements.

With those edges now defined, we can write the definite assignment rule as just one path

query:

l e t checkDef in i teAss ignment program =

toL i s t { ? s | [ program ] ch i l d +[?x : VarDecl ]

( local ? z : [ ? z ] c f s u c c & ! [ ? z ] de f [ ? x ] )+ [? s ]

use [ ? x ] }

in

Editor . addErrorFinder Program checkDef in i teAss ignment

"W02: is used without being assigned"

A statement ?s violates the rule if it uses a variable declared at ?x , and there is a control-

flow path from ?x to ?s where no intervening statement ?z defines the variable declared at

?x . Again that path query is the almost direct translation of a definition one may find in a

compiler book.

2.6.2 Rename Variable

We now step to defining our first refactoring transformation, namely Rename Variable for

the While language. There, we mix both the logical features of JunGL to find elements of

interest and check preconditions, and the functional ML-like features to perform destructive

updates on the graph. The full script is just half a page:

l e t renameVariable program node newName =

(∗ Find the element o f i n t e r e s t ∗)

l e t dec = pick { ?d | [ node ] lookup [ ? d ] | equa l s ( node , ? d) } in

(∗ Check precond i t ion s ∗)

i f not dec i s VarDecl then

e r r o r "Please select a variable" ;

i f dec . name == newName then

e r r o r "Please give a different name" ;

l e t f i n dF i r s t x =

pick { ?y | [ x ] t r eePred+[?y : VarDecl ] &

(newName == ?y . name | ?y == dec ) } in

let mayBeCaptured =


{ ?x | [ program ] ch i l d +[?x : Var ] & ?x . name == newName } in

foreach x in mayBeCaptured do

i f f i n dF i r s t x == dec then e r r o r "Variable capture" ;

l e t needRename =

{ ?x | [ program ] ch i l d +[?x : Var ] lookup [ dec ] } in

foreach x in needRename do

i f f i n dF i r s t x != dec then e r r o r "Variable capture" ;

(∗ Transform ∗)


x . name ← newName ;

dec . name ← newName

The renameVariable function takes three variables: program, the root of the program on

which to do the transformation, a node inside the program representing either a variable

reference or a variable declaration, and a string newName.

The first step of the refactoring is to find the main element of interest, that is the dec-

laration of the variable to rename. We use the built-in function pick that returns the first

element of a stream (or null if the stream is empty). If node is a variable reference with an

outgoing lookup edge, then dec is the declaration of that variable at the end of the lookup

edge. Otherwise, through the use of the binding predicate equals, we assume we were passed

the declaration itself.

We check that assumption on the following line, and raise an error in the case of an

incorrect user selection. We also check that newName differs from the current name of the

declaration.

We then go on and check the main precondition of the refactoring: the renamed decla-

ration should not conflict with any pre-existing declarations with the same name. Indeed,

the way we have defined the lookup edge for the While language suggests that it is allowed

to hide a variable declaration with another declaration. Although this is unusual, we indeed

assume the following program is a valid While program:

int i ;

i = 0 ;

print ( i ) ;

int i ;

i = 1 ;

print ( i ) ;

Resolving the variable reference i in the last statement finds the closest declaration of i when

climbing up the program, i.e. the one on the fourth line. Under these circumstances, we

need to be careful when renaming a variable. Consider the following example for instance:


int i ;

i = 0 ;

int j ;

j = 1 ;

print ( i ) ;

print ( j ) ;

If we simply renamed j to i , our resulting program would still be valid but its semantics

would have changed because name bindings would have changed. Indeed, the i in the first

print statement would now bind to the second declaration of i .

Therefore we need to check that the declaration, once renamed, is not going to capture

any existing variable that is used later, like in the above example. In addition, we also need

to check that no existing declaration will capture the renamed variable.

To handle these two cases of variable capture, we define a unique function findFirst that,

given a node x , looks up the first declaration accessible from x that is either the declaration

we wish to rename or an existing declaration called newName already. Then, for the first

case, we consider all variable references that may be captured, because they have already for

name the new name we wish to give, and we check that calling findFirst on each of them

returns the original declaration and not the declaration we wish to rename. Similarly for the

second case, we compute all references to the declaration we are about to rename, and we

check this time that calling findFirst on each of them returns the same declaration (i.e. not

a capturing existing declaration).

This is it for all preconditions. We can now safely perform the transformation itself. This

part of the code is more operational and hence less interesting. For each variable reference

in the stream needRename, we assign the fresh name newName. Of course, on the last line,

we must not forget to rename the declaration itself. In the end, because of potential variable

conflicts, Rename Variable is not that obvious even for a simple language like While. We shall

see in Chapter 6, however, that the same approach scales very well to much more complex

languages.

2.6.3 Slicing

We conclude this series of illustrative examples by a slightly more ambitious application:

program slicing. The concept of program slicing was originally introduced by Mark Weiser

[Wei84]. He claimed a slice to be the mental abstraction people make when they are debugging

a program. A slice consists of all the statements of a program that may affect the values

of some variables at some location of interest. Many applications were foreseen: debugging,

code understanding, reverse engineering and program testing to list a few. Yet, only recently

was suggested the use of slicing for refactoring.

The Untangling refactoring we proposed earlier in [EV04, Ett06] indeed uses slicing. It

is like Extract Method, but instead of selecting a contiguous region of code, the programmer


selects a single expression. The tool then extracts the backward slice, namely the statements

that may have contributed to the value of that expression.

Slicing can be expressed elegantly in JunGL. More generally, one can define the Program

Dependence Graph [OO84, HRB90, HR92] via path queries, which in turn allows the correct

mechanisation of many different transformations that require reordering of statements.

A Program Dependence Graph is a graph whose nodes represent the statements of the

program like in the control-flow graph, but whose directed edges represent control depen-

dences, data dependences and structure dependences. We now give the definitions of these

three edges in JunGL.

The control dependence edge builds on the concept of post-dominance we have introduced

earlier, and on control-flow predecessors edges, the dual of cfsucc edges:

l e t edge c fp r ed x → ?y =

f i r s t ( [ ? y ] c f s u c c [ x ] | [ x ] parent ; entry [ ? y ] )

l e t edge controlDependentOn x : Statement → ?y =

[ x ] postDominates ∗ ; c fp r ed [ ? y ] & ! ( [ x ] postDominates [ ? y ] )

As we see, cfpred is not defined just as [?y]cfsucc[x]. With such a definition, the

first statement would not have any predecessor because we have not defined any successor

edges emanating from the entry dummy node. Of course, we could have decided to add that

successor edge instead.

In the second edge definition, x is control dependent on ?y if ?y is the control-flow

predecessor of x or any of the statement x post-dominates, but x does not post-dominate ?y

itself. This typically happens when x is in the body of a while loop ?y (here, we assimilate

the while loop with its conditional expression).

We now move on to data dependencies:

l e t edge dataDependentOn x : Statement → ?y =

[ x ] use [ ? v ] & [ ? y ] de f [ ? v ] &

[ ? y ] ( local ? z : c f s u c c [ ? z ] & ! [ ? z ] de f [ ? v ] ) ∗ ; c f s u c c [ x ]

Statement x is data dependent on ?y if x reads a variable ?v , y writes that same variable

?v , and there exists a path between ?y and x with no intervening definition of ?v .

Then, we define structure dependences such as the possible one between a statement and

its enclosing block, or the dependency between a statement and the declarations of variables

it reads and writes:

l e t edge structureDependentOn x : Statement → ?y =

[ x ] parent [ ? y : Block ]

| [ x ] use [ ? y : VarDecl ]

| [ x ] de f [ ? y : VarDecl ]


Finally, we define an edge that covers all kinds of dependencies. It is just the union of

the three previous ones:

l e t edge dependentOn x : Statement → ?y =

[ x ] controlDependentOn [ ? y ]

| [ x ] dataDependentOn [ ? y ]

| [ x ] structureDependentOn [ ? y ]

It is now straightforward to obtain a slice of a program from a given statement s , as it is

just a well-known reachability problem on the Program Dependence Graph. The following

stream comprehension yields the set of statements composing the slice:

{ ?x | [ s ] dependentOn ∗ [ ? x ] }

This approach might remind somewhat the relational approach of slicing explored by

Klint and Vankov in [Kli05, Van05] with RScript, a language based on relational calculus

for querying and analysing source code. In fact, here, we simply use JunGL edges to super-

impose the dependence graph of the program on top of its AST. Therefore relations are used

to represent that graph, and the transitive closure to compute the reachable statements from

a seed. Klint’s approach is slightly different as relations are used to represent Kill/Gen sets

for the computation of reaching definitions (as originally proposed by Weiser), and recursion

to find a fixpoint for those relations.

2.7 Summary and references

In this chapter we have introduced all the different features of JunGL: functional features

for manipulating a program tree, lazy edges to super-impose contextual information on that

tree, and logical features to query the graph structure resulting from the combination of these

edges and the initial tree.

The benefit of functional features for the construction of compiler-like tools is well-known

[App98]. We support pattern matching, algebraic data types (in a specialised form), and

strict higher-order functions in the tradition of ML [MTHM97, LDG+04]. In addition, we

provide streams which are lazily evaluated lists typically found in Haskell [Bir98].

Mechanising a refactoring requires finding elements in the code and checking static pre-

conditions. A common solution to collect that kind of information is to traverse the pro-

gram tree using different search strategies. That mechanism can either be expressed in

Haskell [Spi00, LV02], or provided as a built-in feature in a transformation system, e.g.

[BKVV06, BBK+07]. Often a context has to be propagated for complex static-semantic anal-

yses though. This is usually achieved by parameterising search strategies, as in [BMR07], or

via dynamic rewrite rules in rewrite systems [BvDOV06].


Our answer is to provide logical features, and allow the use of predicates in stream com-

prehensions. JunGL therefore resembles earlier approaches of combining functional and logic

programming, e.g. [RS82, SS99]. In JunGL, one may construct predicates as path queries.

Different styles of path queries were proposed for querying programs [dMLVW03, LRY+04].

Here we mainly reuse the syntax proposed in [LS06]. However, as we shall see in Chapter 4,

the new semantics we assign to them account for the order of logical matches.

Path queries are extremely powerful when querying graphs. JunGL provides an original

way for turning the initial syntax tree of a program into a graph that captures static-semantic

information, such as name binding or control flow. One may define lazy edges for linking two

nodes in the tree (e.g a variable reference and its declaration), which will be automatically

constructed when necessary. That mechanism is in a way similar to the use of reference

attribute grammars as in JastAdd [EH04].

Of course, when integrating destructive updates and declarative features, special care

is required to prevent non-termination and other weird evaluation behaviours. This issue

with mixing declarative queries and updates, long known as the Halloween problem in the

database community [Fit02], is usually addressed by implementing some kind of snapshot

semantics, as in recent XQuery extension proposals [CFM+08, CEF+08]. In JunGL, we

have not yet implemented such semantics: it is the responsibility of the script’s author to

adequately combine intuitive declarative queries and convenient imperative updates to the

program tree.

In this chapter, we have also briefly presented the toolkit around JunGL and described

its implementation on the .NET platform using both C# [SH04] and F# [Sym05]. Our im-

plementation is workable for quickly prototyping refactorings. One missing feature though is

the support for syntax definitions and GLR parsers [Tom87].

Finally, we have illustrated all the features of JunGL by defining a naive refactoring and

various static analyses for a toy language. In Chapter 6, we will show that our design actually

scales to similar tasks on mainstream languages. We now move on to presenting Datalog,

the database query language on which we have based the logical features of JunGL. As we

shall see, Datalog is an ideal candidate for querying program trees and graphs.

Chapter 3

Datalog

Datalog is a query language originally put forward in the theory of databases [GM78], which

has drawn a lot of interest in the eighties and early nineties. Datalog programs look syn-

tactically like Prolog and several classes of programs have been characterised. The most

well-known of these classes has a simple declarative semantics, and consists of safe Datalog

programs.

In this chapter, we introduce logic programs and the syntax of Datalog. We highlight

the requirements for a Datalog program to be safe, notably regarding the use of negation,

and explain the semantics for such programs by giving a simple evaluation strategy for safe

Datalog programs using relational algebra and least fixpoint computations. Then we present

ways of optimising the evaluation and different implementations. Finally, we discuss more

general classes of Datalog where the use of negation is less restricted. These classes admit

more expressive queries which, as we shall see later, are useful in the context of this thesis.

3.1 Logic programs and syntax of Datalog

Logic programs A logic program is a finite set of rules. Each rule has a head and a body.

These are written on both sides of the symbol ‘←’, which stands for reverse implication. A

head consists of one literal, while a body may contain several of them. Literals in a body can

appear either positively or negatively. They are also referred to as atoms or subgoals.

A literal is an n-ary predicate applied to an n-tuple of terms. It is written p(t1, . . . , tn)

or sometimes p(~t) for short.

A term can be a constant, a variable or a compound term (i.e. a function symbol with

other terms as arguments). A term without any variable is called a ground term and pred-

icates applied to tuples of ground terms are called ground atoms or facts. When a variable

occurs as an argument of a predicate in a positive atom, that variable is said to occur posi-

tively on the right-hand side and to be bound by that atom.

44

CHAPTER 3. DATALOG 45

In the body of a rule, subgoals are separated with a comma that stands for logical ‘and’.

An empty body is equivalent to true.

To illustrate,

p(X ) ← a(X ), not b(X ).

is a rule in which p(X ) is the head, a(X ) is a positive subgoal, and b(X ) a negative subgoal.

It reads “p(X ) if a(X ) and not b(X )”.

Although all logic programs follow that same structure, there are two rather different

classes of logic programming languages. The first class consists of Turing-complete languages

close to the machine level, where subgoals are regarded as procedure calls and where control

is still very much given by the programmer. Prolog is the most famous representative of that

class [Llo87]. The second class consists of database query languages, such as Datalog, where

programmers have much less control over the execution of their programs. These languages

are therefore often regarded as more declarative.

Datalog programs A Datalog program is a logic program where each term is either a

variable, denoted with X , or a constant (for instance an integer). In contrast with Prolog,

compound terms such as lists are not allowed. It is hence not possible to match directly on

tree patterns.

Another major difference with Prolog is the fact that the order of atoms in the body of

a Datalog clause does not matter, and indeed, answers to a query are not expected to be

given in any deterministic order. We will revise that definition, however, when we introduce

an ordered variant of Datalog in the next chapters. Until then, we use the word ‘Datalog’ to

refer to the standard version where order does not matter.

By “standard version”, we do not mean pure Datalog, which in the literature refers to

definite programs, i.e. programs with Horn clauses only — a Horn clause is a clause with

no negative subgoals. Instead we mean here Datalog with negation, that is most suitable to

express an adequate range of queries.

We should also mention that, in the pure tradition of Datalog, disjunction is achieved

using multiple clauses with the same relation in the head, and the use of ‘;’ for logical ‘or’

is regarded as syntactic sugar. We use it for conciseness though. For instance, we shall

sometimes write

p(X ) ← a(X ); b(X ).

for

p(X ) ← a(X ).

p(X ) ← b(X ).

Similarly, we sometimes apply ‘not’ to a conjunction or a disjunction rather than just a

literal, since these negations can be distributed away.


In addition, we shall allow tests and binding equalities. These are not normally found in

pure Datalog, but can be modelled as special kinds of predicates. We use tests to filter out

logical matches, and binding equalities to bind a variable to a constant (or to a variable that

is already bound). To illustrate, the rule

p(X ) ← X = 0, X < 1.

binds X to the value 0, and tests that X is indeed less than 1.

Finally, all the variables occurring in the head of a rule are implicitly governed by a

universal quantifier. Equivalently, we use the convention that those variables that occur in

the body but not in the head are governed by an implicit existential. For example,

p(X ) ← q(X ,Y ).

is equivalent to

p(X ) ← (∃Y · q(X ,Y )).

or to

p(X ) ← q(X , ).

where ‘ ’ stands for a don’t-care variable.

To summarise, Figure 3.1 shows the syntax of Datalog programs we consider.

⟨Program

⟩::=

⟨Rule

⟩+

⟨Rule

⟩::=

⟨Literal

⟩← [

⟨Expr

⟩] .

⟨Expr

⟩::=

⟨Literal

⟩∣∣ ⟨

Expr⟩

,⟨Expr

⟩∣∣ ⟨

Expr⟩

;⟨Expr

⟩∣∣ not

⟨Expr

⟩∣∣ ∃

⟨VariableName

⟩·⟨Expr

⟩∣∣ ⟨

Term⟩

=⟨Term

⟩∣∣ ⟨

Term⟩ ⟨

Operator⟩ ⟨

Term⟩

∣∣ (⟨Term

⟩)

⟨Literal

⟩::=

⟨PredicateName

⟩(

⟨Term

⟩?, )

⟨Term

⟩::=

⟨VariableName

⟩ ∣∣ ⟨Constant

⟩ ∣∣

⟨Operator

⟩::= <

∣∣ ≤∣∣ ≥

∣∣ >

Figure 3.1: Syntax of Datalog programs


3.2 Semantics

The semantics of Datalog are explained by regarding predicates as relations defined by enu-

merating the tuples inhabiting them. If p is a predicate, there is a corresponding relation, say

P , such that the fact p(t1, . . . , tn) is true if and only if there is a tuple (t1, . . . , tn) in relation

P . The relation P is sometimes called the extension or interpretation of the predicate p.

In effect, Datalog is usually viewed as a language for defining a larger database from a

smaller one. It defines the contents of new relations based on the contents of the original

relations, in the end producing a single representation. The original relations are often

referred to as ‘extensional database’ or ‘EDB’ predicates, while the new ones are called

‘intensional database’ or ‘IDB’ predicates.

In the remainder of this chapter, we illustrate our explanations with a common example,

namely the transitive closure of a child relation. Its Datalog definition may be as follows:

descendant(X ,Y ) ← child(X ,Y ).

descendant(X ,Y ) ← child(X ,Z ), descendant(Z ,Y ).

In words, Y is a descendant of X if either Y is a direct child of X , or if X has a

direct child Z , which in turn has transitively a descendant Y . Here, descendant is an IDB

predicate, while child is an EDB predicate, whose interpretation shall be the set of pairs

{(1, 2), (1, 6), (2, 3), (2, 4), (4, 5)} as depicted in Figure 3.2.

1

2

3 4

5

6

Figure 3.2: The sample child relation used in our examples

3.2.1 Minimal models and least fixpoints

Interpreting a Datalog program is to assign a collection of facts to that program. Such a

collection is said to be a model for the program if whenever constants are substituted for the

variables, the rules become true. Although the same Datalog program may admit different

models, it is both intuitive and commonly accepted to define the meaning of a Datalog

program through a minimal model.


A model is minimal if any strict subset either is missing an EDB fact or fails to be a

model because there exists a substitution of constants for variables that makes the body of

some rule in the program true but the head false.

A Datalog program should have a well-defined minimal model in order to be assigned a

non-ambiguous meaning. It is well-known that Horn rules have a well-defined minimal model

that is the smallest model which contains all logical consequences of the rules. The existence

of that smallest model is indeed guaranteed by the Knaster-Tarski theorem. Before stating

that theorem, we shall define a few notions.

A fixpoint of a function f is any value x , for which f (x ) = x . In the computation of a

program model, the function for which we compute a fixpoint is a step inference function,

commonly known as the immediate consequence operator, that takes an interpretation and

infers a new one with possibly new facts about the program. That function can be expressed

in terms of relational algebra operators. We describe later in this section how. Furthermore,

a fixpoint of a function that is included in any other fixpoint of that function is called least

fixpoint. A least fixpoint of the inference function for a set of rules corresponds to a minimal

model for that set of rules.

A partially ordered set consists of a set together with a binary relation⊆ that describes, for

certain pairs of elements in the set, the requirement that one of the elements must precede

the other. A lattice is a partially ordered set in which every pair of elements has a least

upper bound and a greatest lower bound. The power set of all different interpretations form

a complete lattice.

A function f is monotonic with respect to a partial order if, whenever x ⊆ y, f (x ) ⊆ f (y).

Typically, negation in a Datalog program is nonmonotonic but all other operations are. We

can now state Knaster-Tarski theorem: If L is a complete lattice and f : L→ L is a monotonic

mapping, then f has a least fixpoint.

Importantly, the existence of a least fixpoint is also guaranteed for monotonic mappings

on complete partially ordered sets (CPOs) and any finite partially ordered set is a CPO

[DP02]. This is crucial, as in the remainder of the thesis we shall work with finite partially

ordered sets that do not form a complete lattice.

A definite Datalog program (with no negation that is) is just a composition of monotonic

operations, and therefore it admits a least fixpoint, which is its minimal model. When the

rules of a program include negated subgoals, however, the minimal model of the program is

rarely well-defined.

For Datalog programs with negation, the database community therefore developed some

preferred models, based on the concept of negation as failure [Cla78]. Negation as failure

basically says that if a ground atom p cannot be proved, then it is allowed to infer not p.

Thus, if instantiated rules of a program with negation can be decomposed into modules that

do not mutually depend negatively on themselves, we can evaluate the minimal model of

these modules one at a time and give a precise meaning to a Datalog program with negation.

This condition depends both on the program and on the EDB input data, but it can be


approximated statically to depend on the program only and not on the data. Such a static

approximation is, together with more obvious syntactic constraints, what defines the class of

safe Datalog programs.

3.2.2 Safe Datalog

Datalog programs are safe if and only if the conditions below on range-restriction and strat-

ification are satisfied. Range-restriction is to guarantee that each computed IDB is finite,

while stratification refers to the static approximation mentioned above.

Range-restriction Every variable on the left-hand side of a clause must occur positively

on the right-hand side, and every variable on the right-hand side must occur positively. This

forbids definitions like

p(X ,Y ) ← q(X ).

which leaves Y unconstrained. It also rules out

p(X ) ← X < 0.

where X < 0 is not a literal but a test, and

p(X ) ← not q(X ).

Such queries would be undesirable because, to evaluate them, we would have to enumerate

the infinite set of all integers, lesser than 0 in the first case, and for which q does not hold in

the second case. By contrast,

p(X ) ← not q(X ), r(X ).

is fine, because while q(X ) is negated, we also have a positive occurrence of X under r .

In the literature on Datalog, queries satisfying this criterion are often called range-

restricted.

Stratification Negation must not be used in recursive cycles. For instance, we wish to

avoid

p(X ) ← not p(X ).

because, again, such recursions do not have a least fixpoint.

Formally, this requirement can be stated in terms of the dependency graph between

predicates. The nodes in this graph are relations defined in the Datalog program. There are

two kinds of edges, positive and negative, defined as follows. When p has a clause where q

appears positively (not under a negation) on the right hand side, there is a positive edge from

p to q. If q appears negatively on the right-hand side of a clause for p, there is a negative


edge from p to q. We require that there are no cycles in the dependency graph that contain

a negative edge. Datalog programs that satisfy this property are called stratified because

there is an algorithm working in terms of ‘layers’ or ‘strata’ for evaluating such programs.

The idea is that when a program is stratified, we can find an order for the predicates so we

can evaluate a predicate p only after we have evaluated all predicates on which p depends

negatively. We shall detail that algorithm in an instant.

Benefits of safe Datalog Assuming the primitive relations are finite, safe Datalog has a

number of highly desirable properties:

• All relations defined are finite, whether recursive or not.

• Recursion can be implemented with straightforward fixpoint iteration, and so the

declarative and operational semantics coincide. This fixpoint iteration always termi-

nates.

To appreciate the difference with Prolog, consider

p(X ) ← p(X ).

In Prolog, p(X ) is a non-terminating query. In Datalog, it just defines the empty relation,

because that is the smallest relation satisfying the above clause.

In the same vein, consider a variant of our transitive closure example, where descendant

appears before child in the second disjunct:

descendant(X ,Y ) ← child(X ,Y ).

descendant(X ,Y ) ← descendant(X ,Z ), child(Z ,Y ).

Using the standard goal-oriented SLD resolution [Llo87], the query descendant(1,Y )

would not terminate in Prolog because the left-to-right evaluation would loop forever on

the subgoal descendant(1,Z ). In Datalog, however, the above definition has precisely the

expected meaning, with no unpleasant surprises during the query evaluation. To overcome

the issue in Prolog and evaluate correctly the two examples above, a special technique called

tabled resolution has been proposed. We discuss it later in this chapter.

3.2.3 Mapping predicate calculus to relational algebra

We said earlier that the step inference function of a set of rules can be expressed in terms of

relational algebra operators. We illustrate informally here how such a mapping works. This

is particularly interesting as it highlights (together with the evaluation of strata to follow) a

simple operational semantics of Datalog programs.

The key observation is that, when predicates are represented as relations, each logical op-

erator in predicate calculus has a counterpart in set-based relational algebra. An introduction

to relational algebra can be found in any database book, e.g. [RG02].


For instance, the natural join is a counterpart of logical ‘and’. That is, if relations R and

S are the interpretations of predicates p and q respectively, then the natural join of R and

S , written R ./ S , is the relation representing the interpretation of the predicate p ∧ q. The

natural join is not a primitive operation in relational algebra, as it can be expressed using

cross-product (×), selection (σ), and projection (π). It is, however, a handy counterpart of

conjunction: it can be used even when two relations have no attribute in common: in the

case of p(x , y) ∧ q(z ), R ./ S is simply equivalent to the cross-product R × S .

Predicate calculus Relational algebra

p(X ,Y ) ∧ q(Y ,Z ) R ./ S (join)

p(X ,Y ) ∨ q(X ,Y ) R ∪ S (union)

p(X , 0) σY=0(R) (selection)

p(X ,Y ) ∧ ¬q(X ,Y ) R − S (set difference)

∃Y . p(X ,Y ) πX (R) ≡ πY (R) (projection)

Table 3.1: Logical operators and their relational counterparts

Table 3.1 summarises informally the relational counterparts of each logical operator used

in Datalog. Relations R and S still stand for the interpretation of p and q respectively. Most

of the time, it is convenient to express projection in terms of the fields that are projected

out rather than the fields on which the relation is projected. As the table shows, we use the

dual operator π for such purpose.

It follows that rules in Datalog can be seen as mathematical functions expressed in terms

of relational algebra operators. For instance, the recursive descendant rule can be defined by

an equation of the following form:

Descendant = Child ∪ π0,2(Child ./1=0 Descendant)

Note that, for simplicity and to avoid issues with renaming, we refer to each column of a

relation via its index in the relation rather than via its name.

3.2.4 Evaluation of strata

We shall now describe how to lift the above function to a set of mutually dependent rules in

order to evaluate safe Datalog programs.

Stratification guarantees that for any rule, any atom references a relation that is either

in a lower stratum, or in the same stratum. Furthermore, relations in the same stratum can

only be referenced positively. That said, the grammar for the relational algebra expression


corresponding to a safely stratified Datalog rule in stratum i is therefore:

Ri ::= ∅ empty relation

| U universal relation

| Ri−1 another relation in a lower stratum

| Ri another relation in stratum i

| Ri × Ri cross product

| Ri ∪ Ri union

| πX1,..,Xk(Ri) projection

| σtest(Ri) selection with arbitrary test

| not(Ri−1) negation

Note that we do not mention the natural join in this grammar as it can be expressed with

the other operators. Also, we favor the not operator in place of the set difference although

these operations are equivalent: R − S = R ./ (not(R)) and not(R) = U −R. The universal

relation U refers to a relation of a desired arity, say n. It contains all possible n-tuples one

can built with the domain D of values found in the EDB relations.

Strata of a safe Datalog program are strongly connected components of the predicate

dependency graph, and each stratum contains a number of mutually dependent predicates,

which we interpret as relational algebra expressions according to the grammar defined above.

A stratified program can hence be modelled as a list of strata [s0, . . . , sN ] sorted in topo-

logical order such that for any i and j , if a relation in si refers to a relation in sj then j ≤ i .

Furthermore, each stratum si consists of the relations {R1, . . . ,Rki}. We denote with ni,j

the arity of the j th relation in si . We can then model each individual Rj as the step function

that takes our current interpretations for all the relations in the stratum, and returns a new

interpretation for Rj :

fRj: PDni,1 × · · · × PDni,ki → PDni,j

In order to compute fRj(X1, . . . ,Xki

), we simply interpret the relational algebra primitives

in the usual way over sets of tuples of values. Due to stratification, the functions fRjare

monotonic.

Now we are in a position to lift this to define the step function fi of the entire stratum

si . For brevity, we write X = (X1, . . . ,Xki).

fi : PDni,1 × · · · × PDni,ki → PDni,1 × · · · × PDni,ki

fi(X ) = (fR1(X ), . . . , fRki

(X ))

Each step function fi is monotonic since each of its components is monotonic. Moreover,

its domain and codomain coincide. By Knaster-Tarski theorem, it has a least fixpoint.

Consequently we define

[[si ]] = lfp(fi) (3.1)


However, the value of [[si ]] depends on the values [[sj ]] for previous strata (j < i), so the

computation must start with s0. In other words, we first compute the relations denoted by

the bottom level (containing extensional predicates), and continue upwards in such a way

that we only evaluate [[si ]] = lfp(fi ) after all [[sj ]] with j < i , which means that the denotations

of relations in lower strata are available to fi . After evaluating this for all strata, we get a

model for our Datalog program, which is its meaning.

3.3 Evaluation strategies

In the evaluation of strata given above, the different strata of the clause dependency graph are

evaluated one at a time in topological order starting from the lowest strata — the ones that

refer to extensional predicates only — to the highest stratum where lies the query predicate.

Such an evaluation strategy is said to be bottom-up. By contrast, a strategy is said to be

top-down if it starts from the query itself using for instance a goal-oriented strategy that

resembles the SLD resolution found in Prolog implementations. A benefit of the top-down

resolution is that it usually infer the facts that are actually needed for the correct answer to

the query, whereas a bottom-up evaluation usually infers facts that are irrelevant and indeed

ignored in the computation of the final answer. In this section, we detail the two opposite

approaches, illustrate the issue of computing irrelevant facts and describe two techniques to

bridge the gap between the two approaches.

3.3.1 Top-down vs bottom-up

Bottom-up iterative computations on relations is the implementation strategy that directly

follows from the least fixpoint semantics of safe Datalog and the observation that each pred-

icate can be represented as a finite relation which simply consists of the tuples satisfying

that predicate. The main strength of the bottom-up approach is precisely that it works

with relations. It can therefore benefit from efficient implementation of relational opera-

tions, and notably hash joins. However, it is well-known that the performance of such an

approach is usually impeded by the unnecessary computation of irrelevant facts during the

query evaluation [BMSU86, Vie86].

The unnecessary computational overhead has two origins. The first one is in the com-

putation of a fixpoint for each stratum of the clause dependency graph. Indeed, at each

iteration of a naive fixpoint computation, we do not only infer new facts but also all facts

that have already been inferred in previous iterations. Table 3.2 shows an example of such

an expensive redundancy.

This kind of overhead can easily be overcome for linear recursive rules (i.e. rules that have

at most one recursive call in their body) using a less naive iteration, known as a seminaive

fixpoint computation. At each iteration step, rather than taking the full currently inferred

relation, we can restrict the input of the step function to the only facts that were freshly


Iteration Input Output

1 ∅ {(1, 2), (1, 6), (2, 3), (2, 4), (4, 5)}2 {(1, 2), (1, 6), (2, 3), (2, 4), (4, 5)} {(1, 2), (1, 6), (2, 3), (2, 4), (4, 5),

(1, 3), (1, 4), (2, 5)}3 {(1, 2), (1, 6), (2, 3), (2, 4), (4, 5), {(1, 2), (1, 6), (2, 3), (2, 4), (4, 5),

(1, 3), (1, 4), (2, 5)} (1, 3), (1, 4), (2, 5), (1, 5)}4 {(1, 2), (1, 6), (2, 3), (2, 4), (4, 5), {(1, 2), (1, 6), (2, 3), (2, 4), (4, 5),

(1, 3), (1, 4), (2, 5), (1, 5)} (1, 3), (1, 4), (2, 5), (1, 5)}

Table 3.2: Input and output of the step function at each iteration of the naive fixpointcomputation for the rule descendant .

inferred at the previous iteration. While in a naive fixpoint computation, the same facts are

inferred again and again, only new deductions are made at each step of a seminaive one.

Table 3.3 shows the effect of this optimisation. The computation stops when no new fact can

be inferred.

Iteration Input Output

1 ∅ {(1, 2), (1, 6), (2, 3), (2, 4), (4, 5)}2 {(1, 2), (1, 6), (2, 3), (2, 4), (4, 5)} {(1, 3), (1, 4), (2, 5)}3 {(1, 3), (1, 4), (2, 5)} {(1, 5)}4 {(1, 5)} ∅

Table 3.3: Input and output of the step function at each iteration of the seminaive fixpointcomputation for the rule descendant .

The second origin of the computational overhead is more inherent to the bottom-up

approach. To illustrate, consider the following query:

q(Y ) ← X = 2, descendant(X ,Y ).

In the bottom-up framework, the whole descendant relation is computed before retaining

only those pairs whose first element is 2. This clearly leads to unnecessary computations

since, in this particular query, it would be sufficient to just compute the descendants of node

2 as Figure 3.3 suggests.

A way to reduce the number of useless computations is to adopt a top-down approach.

One solution notably is to embed Datalog in a Turing-complete logic programming language

like Prolog. We have already mentioned however that the standard goal-oriented SLD reso-

lution does not ensure correct results, as it may lead to non-termination, by trying to solve

recursively the same subgoal again and again. To overcome that problem, it is crucial that

tabled resolution is used [War92].

The main idea of tabled resolution, sometimes also called tabling for short, is to memorise

intermediate subgoals and their answers that have been computed previously. More precisely,

if a subgoal is identical to or subsumed by a previous one, it is solved using answers computed

for the previous subgoal, instead of re-evaluating the rules of the program. This makes


1

2

3 4

5

6

Figure 3.3: Smallest set of nodes that need to be considered for solving the query q(Y ) ←X = 2, descendant(X ,Y ).

the evaluation of Datalog queries finite and avoids redundant computation due to repeated

subgoals in the search space. Such tabling contrasts with a bottom-up approach in the

sense that it is a tuple-at-a-time resolution method while the latter admits a set-at-a-time

resolution. Although in the end the effect is the same as a fixpoint computation, when

the subgoals of a rule are correctly ordered, many fewer irrelevant facts are computed with

tabling. For instance, by applying a left-to-right top-down resolution to our sample query, X

is first bound to node 2 and descendants are then computed from that node only. However,

if the subgoals of a rule are not correctly ordered, the search space may become very big. As

in any top-down approach, the efficiency of tabling is highly sensitive to predicate ordering.

In conclusion, the advantage of the bottom-up approach is that it works with sets, while

the advantage of the top-down approach is that it may infer less facts because the context of

the query is propagated inside calls.

Two techniques have been proposed to make the two approaches converge. One, known

as the Query-Subquery approach, is a set-based top-down resolution method. The other,

called the magic-set transformation, mimics the behaviour of the top-down approach in a

bottom-up framework.

3.3.2 Query-Subquery and magic sets

It is conventionally accepted that, in order to optimise a relational algebra query, selection

should be done before joins. Indeed, by pushing selection as early as possible in the evaluation

of a query, the relations that are manipulated later usually get smaller thus leading to better

performance. The Query-Subquery approach and the magic-set transformation technique

follow that idea and extend it to recursively defined programs.

Both method rely on the choice of a sideways information-passing strategy, also called a

sip strategy. This determines the context of each literal, i.e. the formulas that are evaluated

before that literal. The most common sip strategy is the left-to-right one, where the context


of each literal consists of all the formulas that appeared on the left of the literal. The sip

strategy basically indicates the flow of data between predicates.

The Query-Subquery approach The Query-Subquery approach (QSQ) [Vie86] uses the

framework of SLD resolution, but a set at a time thus enabling optimised relational algebra

operations. The idea is to constrain each predicate call by propagating bindings from one

atom to the next with respect to a given sip strategy. Each IDB literal of the original

program is adorned with a pattern to indicate which of its variables are considered bound

by its context, i.e. by the part of the rule that is evaluated before the literal. For instance,

following a left-to-right sip strategy, we adorn our sample descendant query as follows:

descendantbf (X ,Y ) ← child(X ,Y ).

descendantbf (X ,Y ) ← descendantbf (X ,Z ), child(Z ,Y ).

q f (Y ) ← X = 2, descendantbf (X ,Y ).

The pattern bf on the rule descendantbf means that it is used in a context where its first

argument is bound and its second argument is free. We shall use the general notation Rγ

where R is the name of the adorned rule, and γ is the actual pattern of the rule, consisting

of as many bs and fs as the arity of R.

From that initial adornment, each rule Rγ is assigned a set of additional temporary

relations, which do not appear in the program but are used during the evaluation. These

supplementary relations, of the form sup Rγk , identify for each position k in the rule body

the interesting variable bindings. Interesting variables are the ones already bound at the

respective position, and either used in the remainder of the body or from the head.

To illustrate, the adorned rules above have the following supplementary relations. Note

that we distinguish a supplementary relation of the first disjunct of descendantbf from one

of the second disjunct by adding a prime to the latter:

descendantbf (X ,Y )← child(X ,Y ).

↑ ↑

sup descendantbf0 (X ) sup descendantbf1 (X ,Y )

descendantbf (X ,Y )← descendantbf (X ,Z ), child(Z ,Y ).

↑ ↑ ↑

sup descendantbf0

′(X ) sup descendantbf1

′(X ,Z ) sup descendantbf2

′(X ,Y )

q f (Y )← X = 2, descendantbf (X ,Y ).

↑ ↑ ↑

sup q f0 sup q f

1 (X ) sup q f2 (Y )

The relations sup descendantbf0 (X ), sup descendantbf0

′(X ) and sup q f

0 represent the known

context at the entry of their corresponding rule body. The relation sup descendantbf1

′(X ,Z )


stores the context between descendantbf (X ,Z ) and child(Z ,Y ). Similarly, sup q f1 (X ) stores

the context between X = 2 (which can be seen as an EDB relation) and descendantbf (X ,Y ).

Finally, sup descendantbf1 (X ,Y ), sup descendantbf2

′(X ,Y ) and sup q f

2 (Y ) store the final

context of their corresponding rule. Observe that sup descendantbf2

′(X ,Y ) has no reference

to Z because Z is no longer needed. For the same reason, sup q f2 (Y ) only refers to Y since

X is not an argument of the query.

In addition, each set of rules Rγ is assigned two relation variables: inst Rγ (whose arity

is the number of bs in γ) and ans Rγ (whose arity is the same as R). Relation inst Rγ stores

the global input of Rγ , while ans Rγ stores the final result of Rγ .

A subquery for Rγ is then run as follows. Let T be the tuples in inst Rγ . Add T to

sup Rγ0 . At each position k , if the next atom is an EDB relation E , join sup Rγ

k with E and

store it in sup Rγk+1. Otherwise if the next atom is an IDB relation I δ, add to inst I δ the

new tuples found in sup Rγk , join sup Rγ

k with ans I δ and store it in sup Rγk+1. When the

final position n is reached, add sup Rγn to ans Rγ .

Initially, all inst Rγ and ans Rγ relations are empty, except for the input of the query

which is set to true. The final answer is computed by running all subqueries in turn until a

fixpoint is reached for all inst Rγ and ans Rγ .

For instance, on our above example with the sample child relation of Figure 3.2, three

iterations are needed. During the first iteration, q f is entered with the context true, so

sup q f0 (whose arity is 0) is also true. Joining true with the EDB-like relation X = 2 re-

sults in sup q f1 (X ) having a single tuple (2). This tuple is added to inst descendantbf , but

ans descendantbf being so far empty, we conclude that sup q f2 (Y ) is empty and ans q f too.

In the second iteration, inst descendantbf now contains the tuple (2). Consequently, when we

process the first disjunct of descendantbf , we add (2, 3) and (2, 4) to sup descendantbf1 (X ,Y )

and to ans descendantbf . Then, when we process the second disjunct of descendantbf ,

we now also get (2, 3) and (2, 4) in sup descendantbf1

′(X ,Z ), and therefore obtain (2, 5)

in sup descendantbf2

′(X ,Y ) which we add to ans descendantbf as well. Back to the query

rule again, we hence get all tuples (3), (4) and (5) in ans q f . Finally, in the third iteration,

we find out that inst Rγ and ans Rγ are stabilised, and that we have therefore computed

the final solution for ans q f .

In this example, and for any other definite program, the scope of the fixpoint iteration can

be the whole set of rules. For stratified programs, however, we need to compute the fixpoint

after each call to a rule in a lower stratum, in order to ensure the evaluation of a relation is

complete when we take its complement. We shall refer to this set-based top-down evaluation

method when we explore the evaluation mechanism of JunGL queries in Chapter 5.

Magic sets The idea of the magic-set transformation [BMSU86, BR87] is to express the

ingredients of a top-down approach in Datalog itself, by rewriting the original program to

make the context explicit. Starting from the same adornment as in the Query-Subquery

approach, the context of each IDB atom is isolated into a magic relation and added to it as


a filter. For instance, the magic version of our example is:

descendantbf (X ,Y ) ← magic descbf (X ), child(X ,Y ).

descendantbf (X ,Y ) ← magic descbf (X ), descendantbf (X ,Z ), child(Z ,Y ).

magic descbf (X ) ← X = 2.

q f (Y ) ← X = 2, descendantbf (X ,Y ).

Like for any top-down approach, where it is possible to improve the efficiency of a query by

reordering its subgoals, the efficiency of the magic-set technique depends on the sip strategy

that is used. Although the transformed program contains more joins, it succeeds in restricting

the computation of descendants to those of node 2 only, thus mimicking a top-down resolution.

If we were to find the descendants of a leaf in a deep tree, that transformation would boost

the query considerably.

In fact, magic sets try to achieve statically what the Query-Subquery technique does dy-

namically. The transformation is, however, not always optimal. It may sometimes introduce

unsafe recursion in an originally safely stratified program, and the cost for breaking these

unwanted cycles is the computation of more irrelevant facts. This actually highlights the

expressiveness limitation of safe Datalog. We shall discuss less restrictive classes of Datalog

programs, after briefly mentioning some existing implementations of safe Datalog.

3.3.3 Existing implementations

Safe Datalog can be implemented in a variety of ways. Most implementations either adopt a

bottom-up approach that manipulates sets (thus favoring magic sets to the Query-Subquery

approach), or follow the top-down tuple-at-a-time route with memoization.

Bottom-up implementations Early bottom-up implementations of Datalog were pro-

posed as part of deductive database systems, e.g. LDL [TZ86], Glue-Nail [PDR91], and

CORAL [RSSS94]. These systems have partly focused on complementing purely declarative

languages with some imperative constructs for manipulating relations, but they do support

various program transformations proper to fully declarative languages such as magic sets and

pushing forward projections to optimise queries.

Not surprisingly, an important design decision in implementing the bottom-up approach

is the choice of a representation for sets. One obvious route is to delegate that part to a

relational database. That way, we can easily leverage a scalable and persistent backend. EDB

relations are simply stored in the database, and Datalog queries compiled to procedural SQL.

This strategy is the one used in the code querying system CodeQuest [HVMV05, HVdM06].

CodeQuest implements a limited version of the magic-set transformation, named closure

fusion, that aims at optimising transitive closure only.


Another bottom-up implementation strategy is to represent relations via binary decision

diagrams (BDDs). The work of John Whaley and Monica Lam has demonstrated that such

a Datalog implementation is particularly suitable for evaluating queries that correspond to

advanced dataflow analyses [WACL05]. Indeed, the relations involved in a whole-program

dataflow analysis are sometimes so big that they cannot be efficiently manipulated by a

standard database system. By contrast, a BDD is a compressed data structure that can

efficiently represent a large relation and BDD operations take time proportional to the size

of that compressed data structure, not to the number of tuples in the relation.

Top-down implementations XSB is perhaps the most well-known example of a logic

programming system that offers this alternative approach to deductive database [SSW94]. It

extends Prolog’s SLD resolution with tabling, and actually, also adds to SLD a scheduling

strategy and delay mechanisms. The whole resolution method is known as SLG [CW96] and

can handle not only stratified Datalog but also more general logic programs that we discuss

now.

3.4 General logic programs

The stratified criterion of safe Datalog programs is quite strong, and unfortunately, there are

common and natural examples of queries that one cannot express in safe Datalog.

Perhaps the most celebrated example in the Datalog literature is a predicate inspired

from a stalemate game:

win(X ) ← move(X ,Y ), not win(Y ).

The rule says that X is a winning position if there is a move from X to Y and Y is

a losing position. It is not statically stratified because of the negative literal not win(Y ).

Nonetheless, if the relation move is acyclic, any query about win has a unique least model.

To illustrate, consider the domain of positions V = {1, 2, 3} and let move be the following

acyclic relation:

1 2 3

A way to resolve the query win(X ) is to instantiate the rule in all possible ways with the

position of our domain (and remove known false subgoals). That is:

win(1) ← move(1, 2), not win(2).




As we see, this set of instantiated rules is now correctly stratified and admits a well-

defined model. A program that can be instantiated that way into a stratified set of rules

is said to be locally stratified [Prz88]. Whether a program is locally stratified depends on

the data in its EDB predicates and therefore cannot be decided by looking at the program.

In opposition to static stratification, we say that local stratification is a dynamic criterion.

Clearly, if a program is statically stratified, then it is locally stratified.

However, it is sometimes possible to guarantee that a program will be locally stratified by

imposing some conditions on the database. Here for instance, the win query would actually

be locally stratified for any data as long as the move relation is acyclic. As we shall show

later in this thesis, it frequently happens that a program is not statically stratified but is

locally stratified given that some of the EDB relations it refers to are well-founded.

The local stratification criterion is a bit fragile though, because it does depend on the

structure of the program. Consider a variant of the above rule that is expected to have the

exact same model:

win(X ) ← play(X ,Y ), not win(Y ).

play(X ,Y ) ← move(X ,Y ).

If we instantiate that program in all possible ways, and because we know originally nothing

about play, we end up with a set of instantiated rules, a part of which being:

win(1) ← play(1, 2), not win(2).

win(2) ← play(2, 1), not win(1).

That part is not stratified, as win(2) depends negatively on win(1) and vice-versa. However,

it suffices to compute first the minimal model of the module that deals with the play rule to

realise that some subgoals, like play(2, 1), are false. These subgoals can then be pruned away

from the rule win in order to obtain a correctly stratified set of instantiated rules again.

Kenneth Ross made this observation and proposed a new class of Datalog programs with

negation, called modularly stratified programs [Ros94]. A program is modularly stratified if

and only if its mutually recursive components are locally stratified. Naturally, if a program

is locally stratified, then it is modularly stratified. Again, like for local stratification, we can

impose some restrictions on the EDB relations to guarantee that a program is modularly

stratified. We shall come back to the concept of modular stratification, which is more robust

than local stratification, in Chapter 5.

To conclude our brief overview of the different classes of general logic programs, we should

mention that other semantics have been proposed to deal with any general logic program with

no restriction whatsoever, notably the well-founded semantics [vRS91] and the stable model

semantics [GL88]. These two semantics are three-valued semantics: literals may be true,

false or undefined. A noticeable point is that when a program has a total semantics (i.e. a

model where any fact is either true or false), the well-founded and the stable model semantics

coincide and that happens for a larger class of programs than the class of modularly stratified


programs. The diagram in Figure 3.4 taken from [Ull94] summarises the containment of all

semantics classes for Datalog programs.

No negation

Statically stratified

Locally stratified

Modularly stratified

Two-valued well-founded semantics

Stable Well-founded

Figure 3.4: Containment of the different classes of Datalog programs.


In this chapter, we have presented Datalog [GM78] with an emphasis on the class of (stati-

cally) stratified programs, which has a clear least fixpoint semantics.

Datalog can be embedded in a Turing-complete logic programming system, such as XSB

[SSW94, CW96], where subgoals are regarded as top-down procedure calls and treated one

by one. In such a case, one shall not use the standard SLD resolution of Prolog [Llo87], which

may lead to non-termination and redundant computations, but instead recourse to tabled

resolution to avoid the infinite expansion of the search tree [War92].

Another implementation route relies on the fact that predicates can be seen as relations,

and logical operators as relational algebra operations [RG02]. The computation in that

case proceeds bottom-up treating one recursive stratum after another, each stratum being

a set of recursive rules that do not depend negatively on themselves. This is the approach

taken in many systems, e.g. [TZ86, PDR91, RSSS94, HVMV05, WACL05]. That approach,

however, may suffer from the unnecessary computation of irrelevant facts during the query

evaluation. The magic sets transformation is a well-know technique that tries to overcome

that problem [BMSU86, BR87]. The idea is to rewrite Datalog programs to materialise the

querying context of each predicate. The bottom-up resolution method with magic sets was


shown to be more efficient for definite programs than the tuple-at-a-time top-down approach

[Ull89].

An alternative, called the Query-Subquery approach [Vie86], is to evaluate programs top-

down but a set at a time, thus enabling optimised relational algebra operations. The idea is

similar to that of magic sets but the calling context of predicates is propagated at runtime.

We shall actually see in Chapter 5 that logical parts of JunGL scripts are evaluated using a

variant of that technique.

We have also noted in this chapter that some queries cannot be expressed in stratified

Datalog. The database community has introduced larger classes of Datalog programs with

negation [Ull94], namely the class of locally stratified programs [Prz88] and of modularly

stratified programs [Ros94], as well as the more general well-founded semantics [vRS91] and

stable model semantics [GL88]. A relevant class in the context of this thesis is that of modu-

larly stratified programs. Modular stratification can in the general case only be determined

at runtime since it depends on the input of the program. Nonetheless, strong enough con-

ditions on EDB relations, such as acyclicity, can guarantee the modular stratification of a

program.

We now turn to describing precisely how logical features translate to relational equations

reminiscent of the set-based evaluation of Datalog. As we shall see, the rationale of returning

results in a meaningful order shall lead us to departing from the usual Datalog semantics

and introduce an ordered variant of Datalog that works over sequences rather than sets.

Chapter 4

Ordered semantics of the logical

features

Our language is a functional language in the style of ML with embedded logical features.

The functional constructs have the expected ML semantics, but the evaluation of predicates

differs from the usual logical languages. It is not Prolog like since queries are guaranteed to

terminate (unless you call a non-terminating or impure function as a non-binding test inside

a query). It is not quite normal Datalog either as we have the additional requirement of

maintaining results in a sequential order. We discuss in this chapter the rationale behind

that requirement, and introduce a novel variant of Datalog which operates over duplicate-free

sequences rather than sets. We then give the semantics of the logical features in JunGL by

translating predicates, edges and path queries constructs to this ordered variant of Datalog.

4.1 Why order matters

Programs are ordered trees. A block is a list of statements; a method has a list of parameters.

The order of statements in a block encodes the meaning of the program, and that order

obviously needs to be maintained during behaviour-preserving transformations. For import

clauses or class members, the order does not encode any meaning, and a permutation of such

elements would not change the behaviour of the program. In the context of a source-to-source

transformation tool, however, preserving at best the layout of the original program is crucial.

The order in which elements occur in the code appears in fact to be relevant in almost all

cases.

Nevertheless, that order, also known as the document order in the XML community,

could be reconstructed at the end of each query. It is indeed straightforward to work out an

appropriate indexing scheme to rebuild an ordered tree from a flat set of elements. Each query

could be internally evaluated as a set, and results would then be returned in the document

63

CHAPTER 4. ORDERED SEMANTICS OF THE LOGICAL FEATURES 64

order.

The problem is that, often, the document order is not the order intended by the user. If

we look at one of our first edge definitions again, we realise the intent there is to find the

‘closest’ matching variable declaration:



By ‘closest’, we mean the element that is reachable with the minimum number of iteration

steps when navigating along a particular edge (here the treePred edge). In that case, the

indexing scheme approach would not have worked out properly. The whole predicate and

path queries evaluation mechanism ought to preserve results in an order that is intuitive to

the user. In the remainder of this chapter, we explain precisely what the result order is, and

how it is computed.

4.2 Duplicate-free sequences

The idea for encoding the order is to base the semantics of the logical features in JunGL

on relational operations over sequences of tuples that do not contain duplicates. In this

section, we first introduce some notations and functions related to duplicate-free sequences

and formally define the relational operators over these sequences.

4.2.1 Notations

Tuples We consider n-tuples over a finite domain of elements D. Each n-tuple is of the

form t = (x1, . . . , xn) ∈ Dn .

We use the notation X to denote all the columns of an n-tuple, and X .i to refer to its i th

column. In addition, we shall use {X1, . . . ,Xk} to denote an arbitrary set of columns among

X . In that case, each Xi (with i ≤ k ≤ n) is a unique reference to a column in X (e.g. X1

could refer to the last column of a 4-tuple).

Sequences A sequence sn = 〈t0, . . . , tN−1〉 is an ordered set of n-tuples. Like in sets,

duplicates are not allowed and we shall therefore represent a sequence by a total injective

function:

sn : [0 .. N − 1]→ Dn

The arity n of each tuple is the arity of the sequence, while N is the finite length of the

sequence that we also write |sn |. If N = 0, sn is the empty sequence that we shall write ε.

We use X sn to refer to the columns of sn . Naturally, we have |X sn | = n. Furthermore,

we refer to the range of a sequence sn with the usual notation, ran(sn), and by definition,

we also have |ran(sn)| = |sn | = N .


Finally, we use Seq to denote any set of sequences, and we write seq S for the set of all

sequences built from the elements of the set S . Notably, the set of all sequences of any length

over Dn is seq Dn . We also use seq kS to refer to the set of all sequences of at most k elements

in S . In particular, seq 1Dn is the set of all sequences over Dn of at most one tuple.

Some utility functions We introduce, for later use in the thesis, a function to turn a

sequence into a set:

setify : seq Dn → PDn

setify(sn ) = ran(sn)

We shall also need a function head that takes the head element of a non-empty sequence:

head : seq Dn \ {ε} → Dn

head(sn) = sn(0)

Haskell provides a similar function on lists. In fact, Haskell is ideal for expressing manip-

ulation on lists, and we shall hence use its model in our coming definitions for the sake of

readability. Note also that for brevity we sometimes omit the arity subscript of a sequence

and refer to sequences just with r or s .

4.2.2 Relational operations

For each standard relational operator on sets, we seek to define in Haskell an equivalent

operator on duplicate-free sequences. We assume a type Column for column references and

a type Tuple for tuples, as well as two basic functions to drop some columns of a tuple or to

project a tuple on certain columns:

tupleDrop :: [Column]→ Tuple → Tuple

tupleKeep :: [Column]→ Tuple → Tuple

In contrast to Chapter 3 where relations were sets of tuples, we wish to work now with

an ordered data structure, namely streams:

type Sequence = Stream Tuple

We still need to enforce, however, that no duplicates are present in the sequences we

manipulate. We shall therefore be careful to use the traditional nub function to rule out

duplicates. Its definition is:

nub :: Sequence → Sequence

nub [] = []

nub (x : xs) = x : [ y | y ← nub xs , y 6= x ]


In the remainder, we simply give Haskell definitions for relational operations over se-

quences.

Union The union of two sequences is the concatenation of the two, in which duplicates

have been removed. In Haskell:

∪seq :: Sequence → Sequence → Sequence

r ∪seq s = nub (r ++ s)

Projection The projection of a sequence of tuples of arity n on some columns X1, . . . ,Xk

(with k ≤ n) is the sequence where all tuples have been projected to these columns and

where duplicates have been discarded:

πseqX1,...,Xksn :: Sequence → Sequence

πseqX1,...,Xksn = (nub ·map (tupleKeep [X1, . . . ,Xk ])) sn

For convenience, we also introduce a projection-out operator that projects out some

columns of the tuples:

πseqX1,...,Xksn :: Sequence → Sequence

πseqX1,...,Xksn = (nub ·map (tupleDrop [X1, . . . ,Xk ])) sn

Selection We may filter a sequence of arity n in two ways: either by selecting all tuples

for which two columns Xi and Xj (with i , j ≤ n) share identical values, or by keeping only

the tuples in which a column Xi has the value d (with i ≤ n).

In Haskell, the selection with field equality is:

σseqXi=Xjsn :: Sequence → Sequence

σseqXi=Xjsn = filter f sn

where f x = (tupleKeep [Xi ] x == tupleKeep [Xj ] x )

Similarly, the selection with an arbitrary column test is defined as:

σseqXi=dsn :: Sequence → Sequence

σseqXi=dsn = filter f sn

where f x = (tupleKeep [Xi ] x == [d ])

Cross product In the cross product, or cartesian product, of two sequences of possibly

different arity rm and sn , the first tuple of rm is mapped to all the elements of sn , then the

second tuple of rm is mapped to all the elements of sn , and so on. Using list comprehensions:

rm ×seq sn :: Sequence → Sequence → Sequence

rm ×seq sn = [ x ++ y | x ← rm , y ← sn ]


Notice that we can omit the call to nub here as it is clear the list comprehension cannot yield

any duplicate if both rm and sn were themselves duplicate-free.

For reasoning later in the thesis, we shall use an equivalent definition in a combinatorial

style based on map and concat , plus the explicit call to nub. That is:

rm ×seq sn :: Sequence → Sequence → Sequence

rm ×seq sn = (nub · concat ·map (x → map (y → x ++ y) sn)) rm

Finally note that we shall sometimes use the exponential form rn as a shorthand for

r ×seq . . .×seq r︸︷︷︸n times

.

Negation The negation is expressed in terms of sequence difference. Of course, it may be

that there is no initial sequence to subtract from. In that case, since we work with a closed

world assumption, we can subtract the negated sequence sn from a universe sequence (of

similar arity) built out of our domain D. This implies that elements of the domain are also

ordered into a sequence. We denote that initial sequence with Dseq . Therefore, in the general

case, negation is formally expressed as:

notseq sn :: Sequence → Sequence

notseq sn = [ x | x ← Dnseq , x /∈ sn ]

If there is a sequence of greater arity (i.e. m ≥ n) to subtract from, it is usually less

expensive to express it directly as:

rm ∩seq (notseq sn) :: Sequence → Sequence → Sequence

rm ∩seq (notseq sn) = [ x | x ← rm , (tupleKeep X sn x ) /∈ sn ]

First We introduce an unusual operator that has no counterpart in set-based relational

algebra. Like projection, first is parameterised by some columns X1, . . . ,Xk (with k ≤ n).

The operator groups a sequence sn on these columns and takes the head of each subsequence:

firstX1,...,Xksn :: Sequence → Sequence

firstX1,...,Xksn = nub [ head (filter (f x ) sn) | x ← sn ]

where f x y = (tupleKeep [X1, . . . ,Xk ] x == tupleKeep [X1, . . . ,Xk ] y)

The following small example over a sequence with columns X illustrates that definition:

firstX .1〈(1, 2), (1, 3), (2, 3), (1, 5)〉 = 〈(1, 2), (2, 3)〉

firstX .2〈(1, 2), (1, 3), (2, 3), (1, 5)〉 = 〈(1, 2), (1, 3), (1, 5)〉

Note that, if we do not group on any column, first simply yields the head of the whole

sequence:

first sn = nub [ head sn | x ← sn ] = head sn


For completeness, we also define the remaining classical operators of relational algebra in

terms of the above primitives.

Intersection Intersection is still expressed using cartesian product, selection and projec-

tion. We use X and Y as a shorthand for X rnand Y sn :

rn ∩seq sn :: Sequence → Sequence → Sequence

rn ∩seq sn = πseqX .1,...,X .n

(σseqX .1=Y .1,...,X .n=Y .n

(rn ×seq sn))

To wit, the preserved order is the order of elements as they appear in the left-hand side

sequence rn . The previous definition is equivalent to the direct one:

rn ∩seq sn = [ x | x ← rn , y ← sn , x == y ]

Natural join Similarly, the natural join operation that combines information from two

sequences into a possibly bigger one can be expressed using cartesian product, selection and

projection. It is parameterised by the indexes of the columns on which to join, more precisely

by k pairs of indexes (X1,Y1), . . . , (Xk ,Yk) where the first and second elements of each pair

refer respectively to a column of rm and sn . The definition of join is:

rm./seq(X1,Y1),...,(Xk ,Yk )sn :: Sequence → Sequence → Sequence

rm./seq(X1,Y1),...,(Xk ,Yk )sn = πseqY1,...,Yk

(σseqX1=Y1,...,Xk=Yk(rm ×seq sn))

In words, we select the tuples of the cartesian product of rm and sn that have identical values

for the specified pairs of columns, and project out the redundant columns. In the remainder

of the thesis, we shall often omit the columns on which to join. For conciseness, we indeed

assume that sequences have labeled columns and that we join on columns that share the

same labels.

Sequential composition Finally, we shall refer to the sequential composition of two se-

quences. It is obtained by joining two sequences on the last column of the first sequence and

the first column of the second sequence, and projecting out the two intermediate columns.

We define it directly as follows:

rm ;seq sn :: Sequence → Sequence → Sequence

rm ;seq sn = πseqX .n,Y .1(σseqX .n=Y .1

(rm ×seq sn))

Again, the order is guided first by the sequence on the left-hand side. For instance:

〈(1, 2), (1, 3)〉;seq 〈(3, 5), (2, 4)〉 = 〈(1, 4), (1, 5)〉


4.3 Stratified Ordered Datalog

We shall now introduce a novel variant of Datalog which works on these duplicate-free se-

quences rather than usual sets to guarantee that results are returned in a deterministic order.

Quite naturally, we refer to this version of Datalog as Ordered Datalog. Ordered Datalog has

the same constructs as normal Datalog plus the operator first. However, just like negation

in normal Datalog, it looks like some of our relational operations on sequences (other than

negation) are nonmonotonic and prevent the correct computation of a least fixpoint. In this

section, we explore the stratification restriction that one shall impose on Ordered Datalog

to guarantee the existence of a least fixpoint. We shall notably study the monotonicity of

our relational operators over sequences, and see that stratified Ordered Datalog is just a

refinement of stratified Datalog with an additional order on the results.

4.3.1 Non-termination

To illustrate the problem of non-termination, we shall consider a simple example. Take the

domain made of two elements 1 and 2 and the initial sequence r = 〈(1, 2), (2, 1)〉 whose

setified version is depicted by the graph in Figure 4.1. We wish to compute the transitive

1 2

Figure 4.1: Setified graph representation of 〈(1, 2), (2, 1)〉

closure of r , and can think at four straightforward different relational definitions for it:

i. r+ = r ∪ r+; r

ii. r+ = r ∪ r ; r+

iii. r+ = r+; r ∪ r

iv. r+ = r ; r+ ∪ r

For each of these versions, we sketch each step of the fixpoint computation:

First version: r+ = r ∪ r+; r

0: r+0 = ε

1: r+1 = 〈(1, 2), (2, 1)〉

2: r+2 = 〈(1, 2), (2, 1), (1, 1), (2, 2)〉

3: r+3 = 〈(1, 2), (2, 1), (1, 1), (2, 2)〉 = r+

2

A fixpoint is reached.


Second version: r+ = r ∪ r ; r+

0: r+0 = ε

1: r+1 = 〈(1, 2), (2, 1)〉

2: r+2 = 〈(1, 2), (2, 1), (1, 1), (2, 2)〉

3: r+3 = 〈(1, 2), (2, 1), (1, 1), (2, 2)〉 = r+

2

Again, a fixpoint is reached.

Third version: r+ = r+; r ∪ r

0: r+0 = ε

1: r+1 = 〈(1, 2), (2, 1)〉

2: r+2 = 〈(1, 1), (2, 2), (1, 2), (2, 1)〉

3: r+3 = 〈(1, 2), (2, 1), (1, 1), (2, 2)〉

4: r+4 = 〈(1, 1), (2, 2), (1, 2), (2, 1)〉 = r+

2

5: . . . , and so on, no fixpoint being reached.

Fourth version: r+ = r ; r+ ∪ r

0: r+0 = ε

1: r+1 = 〈(1, 2), (2, 1)〉

2: r+2 = 〈(1, 1), (2, 2), (1, 2), (2, 1)〉

3: r+3 = 〈(1, 2), (1, 1), (2, 1), (2, 2)〉

4: r+4 = 〈(1, 1), (1, 2), (2, 2), (2, 1)〉

5: r+5 = 〈(1, 2), (1, 1), (2, 1), (2, 2)〉 = r+

3

6: . . . , no fixpoint is reached.

The evaluation does not terminate for the two latter versions. The intuition behind the

non-termination lies in the order in which we yield results. In the former versions, we return

paths of minimal length first: in front, the paths of length 1 ((1, 2) and (2, 1)), then the paths

of length 2 ((1, 1) and (2, 2)), and that is it as paths of greater lengths will have already been

yielded. In the latter versions however, we wish to give first the paths of maximal length,

and the evaluation, in the presence of cyclic data, thus enters an infinite loop while looking

at the biggest paths.


Our domain of elements being finite, such a non-termination can only come from a non-

monotonic relational operator when handling sequences rather than sets. To understand the

restrictions for Ordered Datalog programs to be safely evaluated through fixpoint computa-

tions, we need to pinpoint operators on duplicate-free sequences that are nonmonotonic.

4.3.2 Chasing nonmonotonic ordered operators

Evidently, the monotonicity property of each operator depends on the inclusion order we

choose for sequences. We shall briefly study here two alternatives: under subsequence order

and under prefix order. Because union and cross product are binary operators, we introduce

for each of them two unary operators defined by fixing one of the arguments (either on the

left or on the right). Thus, for every sequence y ∈ Seq, we define the left and right union

operators ⊕y and y⊕ such that:

∀x ∈ Seq · ⊕y(x ) = x ∪seq y

∀x ∈ Seq · y⊕(x ) = y ∪seq x

Similarly, for every sequence y ∈ Seq, we define the left and right cross product operators

⊗y and y⊗.

For our study of monotonicity, it is worth noticing a few distributive laws that can be

derived from the definitions of the operators we gave in the previous section. This approach is

similar to the work by Seres and Spivey on the algebra of logic programming [SSH99, Ser01]

but very much less complete — we are only interested in proving the monotonicity of our

operators. We note the following distributive laws over union:

πseq(r ∪seq s) = πseq(r) ∪seq πseq(s) (4.1)

σseq(r ∪seq s) = σseq(r) ∪seq σseq(s) (4.2)

y⊕(r ∪seq s) = y⊕(r) ∪seq y⊕(s) (4.3)

⊗y(r ∪seq s) = ⊗y(r) ∪seq ⊗y(s) (4.4)

All these laws are easily shown by using the Haskell model of our relational operations

over sequences. To illustrate, we shall prove here that left cross product distributes over union

(4.4). We first give a few useful laws involving nub, which are easily shown by induction on

lists. For any well-typed function f and any pair of lists r and s , we have:

nub ·map f · nub = nub ·map f (4.5)

nub · concat · nub = nub · concat (4.6)

nub (r ++ s) = nub (nub r ++nub s) (4.7)

Then, to prove the distributive law itself, we reuse the definition of cross product in


combinatorial style that we gave in the previous section:

(r ∪seq s)×seq y

= {definitions of cross product and union}

(nub · concat ·map f ) (nub (r ++ s)) where f t = map (t ′ → t ++ t ′) y

= {law (4.6)}

(nub · concat · nub ·map f · nub) (r ++ s)

= {law (4.5)}

(nub · concat · nub ·map f ) (r ++ s)

= {distributivity of map over ++ }

(nub · concat · nub) (map f r ++map f s)

= {law (4.6)}

(nub · concat) (map f r ++map f s)

= {distributivity of concat over ++ }

nub (concat (map f r)++ concat (map f s))

= {law (4.7)}

nub ((nub · concat ·map f ) r ++(nub · concat ·map f ) s)

= {definitions of cross product and union}

(r ×seq y) ∪seq (s ×seq y)

All other distributive laws can be proved in the same manner. We now focus on the

monotonicity of our operators on two different partial orders.

Monotonicity under subsequence order The subsequence order is the usual order to

consider on sequences (whether they are duplicate-free or not, infinite or not).

Definition 4.1 (Subsequence) A subsequence of some sequence is a new sequence which

is formed from the original sequence by deleting some of the elements without disturbing the

relative positions of the remaining elements.

Definition 4.2 (Subsequence order) The subsequence order is a binary relation ⊆ on Seq

such that for all r , s ∈ Seq, r ⊆ s if and only if r is a subsequence of s.

For instance, we have:

〈2, 3, 5〉 ⊆ 〈1, 2, 3, 4, 5, 6, 7〉

The subsequence order appears however to be inappropriate for the evaluation of Ordered

Datalog. The counterexample below shows that projection is indeed nonmonotonic with


respect to that order.

Counterexample: Take two sequences r = 〈(3, 4), (1, 3)〉 and s = 〈(1, 2), (3, 4), (1, 3)〉. From

the definition of subsequence, we have r ⊆ s . Now, consider the projection of each sequence

on their first column. Because we rule out duplicates in the projected sequence, we have

πseq1(r) = 〈3, 1〉 and πseq1(s) = 〈1, 3〉, hence πseq1(r) * πseq1(s).

Projection is implicitly used in almost all Datalog programs, and its use cannot be re-

stricted as such. This is unfortunate because all the other operators (except first and not of

course) would appear to be monotonic under subsequence order. Anyway, it is best to look

at another inclusion order.

Monotonicity under prefix order Another order that makes sense, and which is perhaps

more intuitive, is the prefix order.

Definition 4.3 (Prefix) A sequence r is a prefix of a sequence s if s consists of the sequence

r followed by zero or more other elements. That is, for all n such that r(n) is well defined,

r(n) = s(n).

Definition 4.4 (Prefix order) The prefix order is a binary relation v on Seq such that for

all r , s ∈ Seq, r v s if and only if r is a prefix of s.

On this order, the monotonicity of projection, selection, right union and left cross product

is easily drawn from the distributive laws we gave at the beginning of the section.

Proof: Let the function f represent any unary operator among projection, selection, right

union and left cross product. Suppose r , s ∈ Seq such that r v s . From the definition of

prefix, there is a smallest sequence t such that s = r ∪seq t .

f (s)

= {by definition of prefix}

f (r ∪seq t)

= {by distribution of f over union (4.1), (4.2), (4.3), (4.4)}

f (r) ∪seq f (t)

w {by definition of union}

f (r)

Hence f is monotonic. �

On the other hand, the remaining operators negation, left union and right cross product

are nonmonotonic. We note, however, that in the special case where the sequence y is of


length at most one, the corresponding right cross product y⊗ is monotonic. The case is

obvious for y = ε. Using the same proof as above, the case for y = 〈t〉 is also straightforward

if we show that 〈t〉 ×seq (r ∪seq s) = (〈t〉 ×seq r) ∪seq (〈t〉 ×seq s). That equality is apparent

if we reduce its two sides independently.

Finally, in Ordered Datalog, we allow the extra operator first. This operator has no

counter-part in normal Datalog. Nonetheless, if we were to draw an analogy, we would

suggest a non-deterministic operator choose that restricts each relevant group of tuples to

one of its elements. Unlike choose which would have to be handled like a nonmonotonic

aggregate in normal Datalog, first is similar to projection and hence monotonic — under

prefix order only.

We can hence summarise the monotonicity of our primitive sequence-based relational

operators as follows:

Operator Monotonicity

Projection (πSeq) !Selection (σSeq) !Right Union (y⊕) ! (for all y ∈ Seq)

Left Union (⊕y) # (for some y ∈ Seq)

Right Cross Product (y⊗) # (for some y ∈ Seq with |y| > 1)

Left Cross Product (⊗y) ! (for all y ∈ Seq)

Negation (not) #First (first) !

All other useful operators are derived from these primitive operators. In particular, the

intersection, join and sequence operators are expressed with cross product, projection and

selection. Consequently, their left versions are monotonic but not their right versions.

Following the concept of stratified normal Datalog, we conclude that if an Ordered Datalog

program is stratified in such a way that there is no use of negation, left union and right cross

product inside recursion, then it can be evaluated by computing the fixpoint of each stratum

one after the other in topological order. More formally, an Ordered Datalog program is safely

stratified if any rule Ri in a stratum si complies to the following grammar:


Ri ::= ε empty sequence

| Dseq universal sequence

| Ri−1 rule in a lower stratum

| Ri rule in the same stratum

| Ri−1 ∪seq Ri union

| Ri ×seq Ri−1 cartesian product

| πseq [X1,..,Xk ](Ri) projection

| σseqXi=Xj(Ri) selection with field equality

| σseqXi=d(Ri) selection with arbitrary test

| notseq(Ri−1) negation

| first(Ri) first

In the end, we realise that statically stratified Ordered Datalog has a limited form of

recursion. Indeed the restriction on cartesian product is quite impeding: it notably rules out

non-linear recursion. We shall see in the next chapter how to accept more general queries and

overcome the restrictions on union and cross product. We first wish to show that Ordered

Datalog programs are consistent with their counterparts in normal Datalog.

4.3.3 A refinement of stratified Datalog

Interestingly, stratified Ordered Datalog can be seen as a data refinement of normal stratified

Datalog where finite sets are refined to finite duplicate-free sequences. We can prove, like

Figure 4.2 shows, that we get the same results by transforming an Ordered Datalog program

into a normal Datalog program, and evaluating it with the set-based semantics, as by eval-

uating the original Ordered Datalog program with respect to the sequence-based semantics

and then removing the order from the results.

Sequence Sequence

Set Set

setify setify

fseq

f

Figure 4.2: Data refinement


We restrict the proof to stratified Ordered Datalog programs which contain no use of

first, since that operator has no counterpart in normal Datalog, and we reuse the fixpoint

formalism that we have introduced for the evaluation of strata in Chapter 3.

There we have defined a step function fRjfor each Datalog rule Rj in a stratum si and

lifted them to a step function fi for the entire stratum si . We could then define in (3.1) the

minimal model of each stratum si as the least fixpoint of its step function. We recall the

definition here:

[[si ]] = lfp(fi)

Ordered Datalog programs are evaluated stratum per stratum like normal stratified Dat-

alog program. We write fseqRjfor the step function of a corresponding Ordered Datalog rule

Rj and fseq ifor the step function lifted to the whole stratum si . By analogy with [[si ]], we

denote the least model of that stratum with:

〈〈si 〉〉 = lfp(fseq i) (4.8)

Our data refinement proof hence reduces to the proof that for any stratum si we have:

[[si ]] = setify(〈〈si 〉〉)

where setify is the function obtained by lifting setify up at the level of tuples of sequences:

setify : seq Dni,1 × · · · × seq Dni,ki → PDni,1 × · · · × PDni,ki

setify((X1, . . . ,Xk)) = (setify(X1), . . . , setify(Xk ))

Similarly, we shall write ∅ for (∅, . . . , ∅) and ε for (ε, . . . , ε).

Proof: We first note that each set-based relational operator ⊕ relates to its counterpart on

sequences ⊕seq with:

⊕ · setify = setify · ⊕seq

This is no surprise as we have defined our relational operators over sequences to follow the

semantics of their set-based counterparts.

Now by swapping each operator of a rule Rj one by one — e.g. ⊕1 · ⊕2 · setify =

⊕1 · setify · ⊕2seq = setify · ⊕1seq · ⊕2seq — we infer the same equality for the step function

of each Rj :

fRj· setify = setify · fseqRj

Finally, we can lift up the result to any entire stratum si and obtain:


fi · setify = setify · fseq i(4.9)

The end of the proof is then straightforward. By definition, there exists an n such that

[[si ]] = f ni (∅), and an n ′ such that 〈〈si 〉〉 = fseq

n′

i(ε). The values f n

i (∅) and fseqn′

i(ε) being

fixpoints, we can take N = max (n,n ′) so that:

[[si ]] = f Ni (∅)

〈〈si〉〉 = fseqNi

(ε)

Finally we can prove the equality:

[[si ]] = f Ni (∅)

= f Ni (setify(ε))

= {by N applications of (4.9)}

setify(fseqNi

(ε))

= setify(〈〈si 〉〉) �

The fact that stratified Ordered Datalog is a data refinement of stratified Datalog is im-

portant. Indeed stratified Datalog programs have proved to have a very intuitive semantics

and we now know that we follow the same intuitive semantics in Ordered Datalog. Further-

more, if we do not need query results in order, we can simply treat any first-free query as a

normal Datalog query.

We shall now give the semantics of the logical features in JunGL by translating predicates,

edges and path queries constructs to Ordered Datalog.

4.4 Data model

Before we explain the semantics of the logical features, we need to describe precisely the

underlying data structure that is being queried in JunGL. While introducing the design of

JunGL, we have stressed the fact that the representation of the object program is initially a

simple AST (or collection of ASTs). The tree is then further decorated through the definition

of edges, which turns it into a graph. We treat that initial tree as a collection of EDB relations

(i.e. a database representing the program), and we handle the super-imposed graph defined

by the various edges as IDB relations (i.e. a view on top of the program tree).

In this section, we describe the data model of the initial program tree and introduce some

useful functions for querying it.

Domain The database on top of which queries are evaluated consists of a collection of

ASTs that is stored in memory. Hence the values that are manipulated are mostly nodes but


not only. A node may indeed admit a field that is not a node. For instance, in the AST of a

While program (whose grammar is given in Figure 2.1), nodes of type Var have a field name

of type string.

We denote the full domain of values that can be queried with D, and the set of all nodes

among it with Node. Although D includes nodes, booleans, strings, integers, lists and tuples,

we must stress that it is finite. It does not include the set of all possible strings, or all possible

lists, but only the elements that are currently held in memory. As we will see shortly, we

ban the creation of new values during query evaluation. Notably, there is no mechanism for

binding a logical variable to a fresh constant. The set D is fixed during each query evaluation.

As we wish to return the results of each query in a deterministic order, the arrangement

of elements in the original domain D obviously matters. We hence need to assign an order

to values in memory. Let Dseq be the sequence over our domain that reflects that order, and

Nodeseq the subsequence that gives the order of node values only.

Types In the presentation of the semantics to follow, we need to refer to the precise data

type of each AST node. We denote the set of all types with Type and the subset of all

AST data types with NodeType. Each node has a type τ ∈ NodeType, and we introduce the

following function to retrieve the type of a node:

type :: Node → NodeType

We shall need a well-founded relation to reflect the type hierarchy of the AST data types.

We write τ ≺ τ ′ when τ is a proper subtype of τ ′, and τ � τ ′ if and only if τ = τ ′ or τ ≺ τ ′.

Fields We also need to refer to the labeled fields of each node. We call FieldName the set

of all field names, and we introduce a function to return all the field names of a node type:

fields :: NodeType → P FieldName

Furthermore, we need a function to retrieve the value of a field for a given node:

fieldValue :: Node → FieldName → D

Finally, we introduce a special function children that, given a field name `f and a node

n, returns the sequence of children of n present in field `f . Precisely:

children :: FieldName → Node → seq Node

children `f n = let v = fieldValue n `f in

if v is a node then 〈v〉

else if v is a list of nodes then sequenceOf v

else ε

The function sequenceOf simply turns a list into a sequence while preserving the order of

elements. We use it to make it explicit that we yield a sequence here. Note that the same


node cannot occur twice in a list of children nodes, and therefore we do not need to call nub.

If node n has no field `f , or if the field `f of n is neither a node nor a list of nodes, then

children n `f fails and returns the empty sequence.

Built-in tree navigation We also have the following navigation functions that relate

nodes to other nodes in the original tree representation of the program. These functions

correspond to the built-in edges described in Table 2.1.

parent :: Node → seq 1Node

child :: Node → seq Node

firstChild :: Node → seq 1Node

lastChild :: Node → seq 1Node

successor :: Node → seq 1Node

predecessor :: Node → seq 1Node

listSuccessor :: Node → seq 1Node

listPredecessor :: Node → seq 1Node

4.5 Translating predicates, edges and path queries

We shall now explain how we evaluate predicates, edges and path queries over the above

data model. First, we introduce the syntactic constructs for building up predicates, edge

bodies and path queries. Next, we give the semantics of each of the constructs by translating

them to relational equations over duplicate-free sequences that are given, for now, the least

fixpoint interpretation of stratified Ordered Datalog programs. We shall see in Chapter 5

that, in fact, we support more general ordered Datalog programs, but the translation scheme

we give here is general and shall remain the same.

4.5.1 Abstract syntax

For clarity, we do not use the exact parse tree of JunGL presented in Appendix A, but give

a core abstract syntax for predicates and path queries. Differences are minor though, and we

explain them briefly below.

First, we require any existential predicate local to have at most one single identifier.

This is not a restriction, since

local ?x ?y : p (?x , ?y )

is just syntactic sugar for

local ?x : local ?y : p (?x , ?y )


Furthermore, we represent simple tests (such as an equality ?x = ?y + 1) as a pure

function from identifiers to booleans. Indeed, tests never bind logical variables; they are just

used to filter tuples. Having only pure functions guarantees that we do not update the tree

structure during query evaluation: facts about the tree structure are extensional only, i.e.

known before the evaluation of any query.

In addition, the only terms we consider are logical variables. It might look like complex

non-ground terms are allowed in JunGL, but that is simply not the case. There is indeed no

unification in our evaluation mechanism and, through the simple use of predicate calls, it is

impossible to bind a logical variable to a freshly built value. Any complex expression, such as

a function call, a list constructor or an arithmetic expression, that appears in the argument

of a predicate call can be replaced by a fresh logical variable. In that case, the replacement

comes with an additional filter conjunct on the side of the predicate call, in order to enforce

the equality of the fresh variable to the expression that has been extracted. To illustrate,

p (? x + 1 , ?y )

translates, in our core language of logical features, to

local ? z : p (? z , ?y ) & f ? z ?x

where f is resolved in the environment to the following function:

f :: Integer → Integer → Bool

f z x = (z == x + 1)

Finally, we omit namespaces and assume we evaluate attributes beforehand so that we

can simply consider them as additional fields in AST structure. We also assume we have an

environment ρ for resolving names which is already extended with definitions of all predicates,

edges, functions and AST data types. We shall use the function resolve to lookup a definition

for a given name in that environment. Furthermore, when translating the body of an edge,

we assume we have access to the name of the variable capturing the source node of the edge.

That name is obtained as a singleton by calling the function sourceVar on our environment

ρ. If we are not currently translating an edge body, sourceVar returns the empty set.

In the end, we really focus on the semantics of predicates and path queries and their core

abstract grammar reads as follows:

i : LogicalIdentifier

`p : PredicateName

`e : EdgeName

`f : FieldName

`τ : TypeName

`λ : FunctionName


p : Predicate

p ::= true∣∣ false

∣∣p | p

∣∣ p & p∣∣ ! p

∣∣local i : p

∣∣first p

∣∣`p ( i1, . . . , in )

∣∣`λ ( i1, . . . , in )

∣∣pp

pp : PathPredicate

pp ::= np ( ep np )?

np : NodePredicate

np ::= [ i [ : [ ! ] `τ ] ]

ep : EdgePredicate

ep ::= `f∣∣

`e∣∣

( cep )∣∣

ep ; ep∣∣

ep +∣∣

ep *

cep : ComplexEdgePredicate

cep ::= ep [ pp ]∣∣

pp ep∣∣

local i : cep∣∣

cep & p

Note that we distinguish basic edge predicates from more complex ones. We mirror that

way an important syntactic restriction: a complex edge predicate, which can be a path

predicate without any starting node, or any ending node, needs to be bracketed to be a basic

edge predicate. In brackets, the complex edge predicate can be further exploited through

its (possibly reflexive) transitive closure. As an illustration, we recall a definition for data

dependency introduced in Chapter 2:

[ ? y ] ( local ? z : c f s u c c [ ? z ] & ! [ ? z ] de f [ ? v ] ) ∗ ; c f s u c c [ x ]

We are now ready to give the semantics of predicates and path queries by translating each


syntactic construct appearing in the above grammar to a relational equation. Each equation

shall be expressed with the relational operators over sequences that we have introduced at

the beginning of the chapter.

4.5.2 Relational equations

We introduce five functions Sp , Spp , Snp , Sep , and Scep to denote, respectively, the sequences

resulting from the evaluation of a predicate, a path predicate, a node predicate, an edge

predicate and a more complex edge predicate:

Sp : Predicate → seq D∗

Spp : PathPredicate → seq D∗

Snp : NodePredicate → seq Node

Sep : EdgePredicate → seq D∗

Scep : ComplexEdgePredicate → seq D∗

We use the notation [[. . . ]] to indicate the syntactic structure to which we give a meaning.

Bits of pure syntax are written in teletype font, whereas terms in italic fonts stand for other

constructs. The meaning of each construct depends on the meaning of these other terms.

We say the semantics are given by induction on the syntactic structure of the program.

To be fully precise, the evaluation functions should all be parameterised by the envi-

ronment ρ but we keep it implicit most of the time. When needed, we will simply write

S[[. . . ]]ρ.

Usual predicate constructs We start by giving the definition of Sp for each usual pred-

icate construct. Not surprisingly, this translation follows the usual mapping of predicate

calculus to relational algebra mentioned in 3.2.3 and extends it with the noteworthy first

operator present in JunGL.

We assume for ease of read that each column of a sequence is labeled with the name of

the corresponding variable in its predicate counterpart. In particular, the conjunction of two

predicates p and q is evaluated as a join of their relational interpretations on the columns

that have the same labels. Also note that we omit the usual renaming of columns that is

needed when evaluating a predicate call `p(i1, . . . , in).


Sp [[true]] = 〈()〉

Sp [[false]] = ε

Sp [[p | q]] = Sp [[p]] ∪seq Sp [[q]]

Sp [[p & q]] = Sp [[p]] ./seq Sp [[q]]

Sp [[!p]] = notseq(Sp [[p]])

Sp [[local i : p]] = πseq i(Sp [[p]])

Sp [[first p]] = firstS (Sp [[p]]) where S = sourceVar ρ

Sp [[`p(i1, . . . , in)]] = Sp [[resolve ρ `p ]]

Sp [[`λ(i1, . . . , in)]] = σseq f(Dn

seq) where f = resolve ρ `λ

Sp [[pp]] = Spp [[pp]]

Path predicates We now focus on the definition of Spp . Importantly, the sequence re-

sulting from the evaluation of a path predicate is not necessarily a binary sequence. Indeed,

some other logical variables may be bound inside a path predicate: the most obvious case

is when one binds some intermediate nodes inside the path. To illustrate briefly, here is a

stream comprehension in JunGL that is based on a path predicate of arity three:

{ (? x , ? y , ? z ) | [ ? x : Program ] ch i l d ∗ [ ? y : I f ] c ond i t i on [ ? z ] }

We search for paths from a node ?x of type Program to the guard ?z of a conditional state-

ment, and we are also interested in the value ?y of the If statement.

For path predicates, the translation to relational equations is as follows:

Spp [[np]] = Snp [[np]]

Spp [[pp ep np]] = (Spp [[pp]];seq Sep [[ep]]);seq Snp [[np]]

This last definition is valid since we have defined sequential composition to work on sequences

of arbitrary arity, and not just on binary sequences. In Spp [[pp]];seq Sep [[ep]] for instance, the

join occurs on the last column of Spp [[pp]] and the first column of Sep [[ep]], so that the ending

node of the path predicate pp is equal to the starting node of the edge ep.


Node predicates We shall now define Snp :

Snp [[[i]]] = Nodeseq

Snp [[[i:`τ]]] = σseq f(Nodeseq) where f n = true ⇔ type n � resolve ρ `τ

Snp [[[i:!`τ]]] = σseq f(Nodeseq ) where f n = true ⇔ type n � resolve ρ `τ

The first construct simply binds the logical variable i to any node. The second construct

binds i to any node whose type is a subtype of the type designated by `τ in our environment.

If this condition was to be translated to Datalog, that could just be a non-binding test.

Similarly, the third construct binds i to any node whose type is not a subtype of the type

designated by `τ .

Edge predicates Here we give the meaning for all different ways of constructing edge

predicates. The two first constructs require all our attention, since it is at that point that

fields and edges are viewed as relations. In both cases the strategy is the same: we build a

binary sequence where each node in Nodeseq is potentially mapped to the target nodes of the

edge.

In the case of fields notably, Sep [[`f ]] is the sequence of pairs that map each node to

its children nodes in field `f : a tuple (x , y) is in the sequence Sep [[`f ]] if and only if y ∈

children `f x . Furthermore, the order of tuples in Sep [[`f ]] is given both by the order of nodes

in Nodeseq , and by the arrangement of children nodes in the ASTs. The precise definition of

the built sequence is:

Sep [[`f ]] = buildSequence (children `f )

The function children retrieves the children nodes of a node — we have defined it in Section

4.4 — and the function buildSequence is defined as follows:

buildSequence :: (Node → seq Node)→ seq Node

buildSequence f = concat ·map (λn. map (λv . [n; v ]) (f n)) Nodeseq

In words, we map f to all nodes in the sequence Nodeseq , get a sequence of sequences that

we flatten to a single sequence using concat. Note that buildSequence takes a function f that

returns sequences of arity one, but itself returns a binary sequence.

Similarly, we wish to give the meaning of a call to an edge predicate. When resolving

the edge with label `e , there are however two possibilities. Either the edge has been defined

by the user, i.e. declared via a let edge definition, or it is built-in for navigating the tree.


In the latter case, we simply build a sequence from the appropriate tree navigation function

and hence relate each node of Nodeseq to its expected neighbours.

Sep [[‘parent ’]] = buildSequence parent

Sep [[‘child ’]] = buildSequence child

Sep [[‘firstChild ’]] = buildSequence firstChild

Sep [[‘lastChild ’]] = buildSequence lastChild

Sep [[‘successor ’]] = buildSequence successor

Sep [[‘predecessor ’]] = buildSequence predecessor

Sep [[‘listSuccessor ’]] = buildSequence listSuccessor

Sep [[‘listPredecessor ’]] = buildSequence listPredecessor

The definition is a bit more tricky, however, when è resolves to a user-defined edge. The

complexity is two-fold. First, the body of an edge definition is actually a predicate that may

refer to other user-defined predicates and edges, and notably to itself recursively. While field

accesses and built-in edges are easily regarded as EDB predicates (which are evaluated by

constructing the binary sequences described above), user-defined edges are inherently IDB

predicates and can be mutually recursive. Second, we have seen in Chapter 2 that one may

actually give different overriding definitions of the same edge for different source node types.

That means predicate dispatch must happen at runtime to determine which edge body should

be evaluated to retrieve the right targets. For now, we leave aside the detailed explanation

on how we encode predicate dispatch and simply write:

Sep [[`f ]] = Sp [[dispatch (resolve ρ è)]]

It is however important to stress again the potential presence of recursion in the relational

equation we give here. The predicate body given by dispatch (resolve ρ è) may indirectly

refer back to the edge predicate Sep [[`f ]] for instance. In that case, relational equations

must be solved using the least fixpoint interpretation of Ordered Datalog programs we have

explained earlier.

We now move on to the meaning of the remaining constructs for edge predicates, where


the presence of recursion is explicit:

Sep [[ep ; eq]] = Sep [[ep]];seq Sep [[eq]]

Sep [[ep +]] = µX · Sep [[ep]] ∪seq X ;seq Sep [[ep]]

Sep [[ep *]] = ρseq i=j(Nodeseq

2) ∪seq Sep [[ep +]]

Sep [[(cep)]] = Scep [[cep]]

More complex edge predicates For the sake of expressiveness of our language, we allow

more complex edge predicates which notably provide scopes for local logical variables that

may be bound to multiple different values across successive repetitions of an edge. One

can also name one end of an edge (that hence looks like a starting or ending path) to

further constraint that end in repetitions. Here is how these constructs translate to relational

equations over sequences:

Scep [[ep]] = Sep [[ep]]

Scep [[ep pp]] = Sep [[ep]];seq Spp [[pp]]

Scep [[pp ep]] = Spp [[pp]];seq Sep [[ep]]

Scep [[local i : cep]] = πseq i(Scep [[cep]])

Scep [[cep & p]] = Scep [[cep]] ./seq Sp [[p]]

Binding equality To be complete, we should also mention the presence of two special built-

in predicates. In JunGL ‘==’ can only be used as a non-binding filter and we have therefore

introduced the binary predicate equals to provide binding equality. A naive translation of a

call to equals is given by:

Sp [[equals(i , j)]] = ρseq i=j(Dseq

2)

In addition, it is often convenient to bind a variable with the values of a pre-computed

sequence. The predicate isIn, whose sequence argument s must be bound, translates to:

Sp [[isIn(i , s)]] = s


4.5.3 Ordered Datalog rules

We have exposed the translation of logical constructs to relation equations over duplicate-

free sequences, which can be interpreted as Ordered Datalog programs. For readability

purposes, we now propose to write these relational equations as Ordered Datalog rules in the

usual syntax of Datalog. We shall consider several EDB predicates for accessing the EDB

relations we have mentioned earlier. Notably, we call node the predicate whose interpretation

is Nodeseq , parent the predicate whose interpretation is (buildSequence parent), and so on for

the other navigation edge predicates. As for field accesses, we shall use field name to denote

the predicate whose interpretation is (buildSequence (children ‘name’)). Furthermore, we

introduce fresh predicate names for recursively-defined IDB relations. Hence, given our

translation to relational equations, the tiny JunGL query

{ ? c | f i r s t ( [ 1 ] c h i l d +[?c ] ) }

can be written as an Ordered Datalog query where we now adopt lowercase for variable

names, like in the original JunGL program:

child plus(x , y) ← child(x , y); ∃z · child plus(x , z ), child(z , y).

query(c) ← first(child plus(1, c)).

Note how the ‘+’ appended to child is translated to the recursive rule child plus.

Because we work with sequences, order at intermediate steps of the query evaluation is

preserved. If we evaluate this query on our sample child relation of Chapter 3, the result is

always 2.

4.5.4 Encoding dynamic edge dispatch

We have seen in Chapter 2 that edge definitions can be overridden. One can indeed define for

some source type τ an edge `e that is already defined for another source type τ ′. If τ ≺ τ ′,

we say the edge definition `e is overridden for τ . Here we illustrate how predicate dispatch

[EKC98, Mil04] is used for dynamic edge dispatch.

We shall consider a precise example to support our explanation, namely three AST data

types A, B and C such that B ≺ A and C ≺ A, as well as an edge e defined for nodes of all

these types.

type A =

| B

| C

l e t edge e x :A→ ?y = p(x , ? y )

l e t edge e x :B→ ?y = q (x , ? y )


l e t edge e x :C→ ?y = r (x , ? y )

The idea is to introduce some special unary predicates node A, node B and node C to

enforce a node variable to be of a specific type. We can then express the fact that p should

be called only if x is of type A, but neither of type B nor of type C , whereas q is called only

if x is of type B , and r is called only if x is of type C . To wit, the translation of the predicate

dispatch to an Ordered Datalog disjunct is as follows:

edge e(x , y) ← node(x ), (

node A(x ), not node B(x ), not node C (x ), p(x , y)

; node B(x ), q(x , y)

; node C (x ), r(x , y)

).

Note the presence of the predicate node(x ) at the beginning of the body. This is to force

the tuples of the edge to be returned in an order that follows the order of nodes in Nodeseq .

Otherwise, tuples would be returned in an order following the type of their first element:

first As that are neither Bs nor C s, then Bs and finally C s.

The predicate node A is simply defined with the following rule where the second disjunct

is a non-binding test on the type of the variable x :

node A(x ) ← node(x ), type x � A.

The predicates node B and node C are defined in the same way. It is easy to see that

the interpretation for node B is a subsequence of the interpretation for node A. Similarly for

node C regarding node A.

Also, because each node has a precise single type at runtime, we know that the interpre-

tations for node B and node C are disjoint. Therefore, we are guaranteed that only one of

the calls p(x , y), q(x , y) and r(x , y) is actually relevant for a specific x .

4.5.5 A full translation example

We conclude the section by presenting a more complete example of how the logical parts of

a JunGL program translate to an Ordered Datalog program. We shall draw our example

from Chapter 2. It consists of several ingredients for querying the control-flow graph of a

While program, namely a predicate for checking post-dominance plus some relevant edges.

The hierarchy of the different AST data types is the one given in Figure 2.1.






| [ x ] parent ; e x i t [ ? y ]

)

l e t edge c f s u c c x : Statement → ?y = [ x ] defaultCFSucc [ ? y ]

l e t edge c f s u c c x : Block → ?y =

f i r s t ( [ x ] f i r s tC h i l d [ ? y ] | [ x ] defaultCFSucc [ ? y ] )




l e t edge c f s u c c x : WhileLoop → ?y =

[ x ] body [ ? y ] | [ x ] defaultCFSucc [ ? y ]

l e t predicate postDominates (?x , ? y ) =

[ ? y : Statement ] c f s u c c +[?x : Statement ] &

! ( [ ? y ] ( local ? z : c f s u c c [ ? z ] & ?z != ?x)+[ : Exit ] )

The full translation to follow is, in our opinion, much less readable than the original

program but it is a good intermediate representation for building an evaluator. Note that,

for the sake of readability, we have even omitted some useless calls to the predicate node.

edge defaultCFSucc(x , y) ← node(x ), node Statement(x ), firstx (

listSuccessor(x , y)

; parent(x , y), node WhileLoop(y)

; (∃z · parent(x , z ), edge defaultCFSucc(z , y))

; (∃z · parent(x , z ), field exit(z , y))

).

edge cfsucc(x , y) ← node(x ), (

node Statement(x ), not node Block(x ),

not node If (x ), not node WhileLoop(x ),

edge defaultCFSucc(x , y)

; node Block(x ),

firstx (firstChild(x , y); edge defaultCFSucc(x , y))

; node If (x ), (

field thenBranch(x , y);

; firstx (field elseBranch(x , y); edge defaultCFSucc(x , y))


)

; node WhileLoop(x ),

(field body(x , y); edge defaultCFSucc(x , y))

).

cfsucc plus(x , y) ← edge cfsucc(x , y); ∃z · cfsucc plus(x , z ), edge cfsucc(z , y).

local cfsucc(i , z , x ) ← ∃z · edge cfsucc(i , z ), z 6= x .

local cfsucc plus(i , j , x ) ← local cfsucc(i , j , x );

∃k · local cfsucc plus(i , k , x ), local cfsucc(k , j , x ).

postDominates(x , y) ← node Statement(y), cfsucc plus(y, x )

node Statement(x ),

not ∃z · local cfsucc plus(y, z , x ), node Exit(z ).

One may wonder at that point where the lazy computation of edges comes in. Indeed, we

have stressed in Chapter 2 that edges are evaluated lazily, and not exhaustively computed

for all nodes in our program tree. Nonetheless, if we were to compute this Ordered Datalog

program with the usual bottom-up approach, we would have to compute all edges for all

nodes. We shall see in the coming chapter that, for this reason and less obvious ones, we

have in fact based the resolution of queries on the Query-Subquery approach.

A trained reader of Datalog programs may have also noticed that the rule local cfsucc is

not range-restricted since the variable x is not positively bound in its body. To overcome the

problem in a bottom-up framework, it would be sufficient to append an additional conjunct

node(x ) to bind x to all possible nodes. In a top-down framework however, the program is

just fine as it is because the third argument of local cfsucc is bound anyway at all call sites.

It can also be noticed that the subgoal node(x ) in edge defaultCFSucc(x , y) is useless given

the presence of node Statement(x ) afterwards, and it could be optimised away. One may

want to tackle this kind of optimisations in future work.


In this chapter, we have introduced a novel variant of Datalog, called Ordered Datalog,

whose least fixpoint semantics is based on duplicate-free sequences rather than sets. In order

to state the conditions under which an Ordered Datalog program is stratified, we have first

given a Haskell model of relational operators on such duplicate-free sequences. Then, we have

studied the monotonicity of our new relational operators with respect to a particular partial

order, namely prefix order. For that purpose, we have notably derived useful distributive

laws of our operators from our Haskell model, using an approach similar to the work by Seres


and Spivey on the algebra of logic programming [SSH99, Spi00, Ser01], but in a much less

exhaustive way.

Indeed our modelling of the first and orelse operators with sequences builds on a long

tradition of algebraic approaches to search. For instance, function composition based on a

monad with an extra plus operation and a zero element can be instantiated with either the

Maybe or the List monad, providing different models of nondeterminism: the plus operation

is the logical ‘;’ and it corresponds to ‘if-then-else’ when used with the Maybe monad; our

operator first can be seen as head in the case of using the List monad. To our knowledge, the

first to explain the semantics of functional strategic programming in these terms was Spivey

in [Spi90]. The same ideas were then extended further in the full algebraic account of logic

programming we have just mentioned.

One difference between that pioneering work and our own, however, is the way recursion

is treated. When using a shallow embedding of logic programming via these monads in a

language like Haskell, one inherits the semantics of recursion from the host language. As

we have argued in this chapter, the desired semantics of Datalog is instead one based on

a dedicated partial order on the given monad. The ρ-calculus introduced by Cirstea and

Kirchner [CK01] does not suffer that drawback, as it has an answer-set semantics supporting

various kinds of choice as well as an analogue of first. By contrast, the Stratego language

originally developed by Visser [BKVV06] could be thought of as mostly based on the Maybe

monad, supporting only the simple success/failure-based model.

Next, we have explained how to translate all logical features of JunGL to this novel ordered

variant of Datalog. The translation includes edge definitions, predicates and path queries for

querying the graph representation of a program. One notable feature of the translation is the

use of predicate dispatch to deal with potentially overridden definitions of edges. Predicate

dispatch has been proposed before to naturally unify and generalise several common forms

of dynamic dispatch, including traditional object-oriented dispatch [EKC98, Mil04].

The use of Datalog in software engineering tools has been explored before, both for

expressing precise program analyses [Rep93, DRW96, WACL05] and in the general context

of code queries [CMR92, HVMV05]. Liu et al. also proposed in [LS06] to translate path

queries into Datalog. The crucial difference here, however, is that we have introduced Ordered

Datalog and described a translation to this variant in order to maintain results in a meaningful

order.

One may therefore wonder why we did not simply embed XPath queries [Wad99b] into

JunGL, or even recourse to XQuery [W3C07] to encode our refactoring transformations. In

XPath-based languages, results have indeed a well-defined order that matters too for the

reconstruction of XML documents. Furthermore, XQuery has been considered before as a

meta-programming language and has proved to be fairly scalable and effective. Magellan,

an open static analysis framework to enable cross-artifact information retrieval, indeed offers

the possibility to write code queries in XQuery [EMOS04, EGM+06].

The semantics of XPath-based languages, however, only refer to the initial document


order [Wad99b, Wad99a]. This allows many optimisations as it is always sufficient to work

out an adequate indexing scheme to tag the position of nodes in the original tree document.

Yet, in our context, the fact that we sometimes wish to retrieve results in an order that is

not the document order rules out the adoption of XPath. In addition, as explained in [LS06],

XPath allows segments of queries to be skipped, but does not allow the expression of repeated

matching segments.

The case is different for XQuery. There, although the result of a path expression is

still returned in document order, the result of a For-Let-Where-Return expression can be

determined both by an eventual Order-by clause and by the expressions in its For clauses.

Hence the result of an XQuery query may reflect not only the implicit XML document order

but also the explicit order imposed in the query. In fact, edges in JunGL are quite comparable

to navigator functions in XQuery. Those extend the idea of axes, in the terminology of XPath,

to relate arbitrary nodes in the graph — in XPath, axes are restricted to navigation on the

tree structure only. Thus, it is possible to emulate edges by defining functions in XQuery.

Apart from the syntax that would be particularly verbose in that case, the main issue is

that, unlike XPath expressions, user-defined functions in XQuery admit arbitrary types of

recursion. To handle possibly cyclic queries on graphs, one needs to introduce adequate

guards to prevent the query evaluator from entering an infinite loop.

Instead, we have based the semantics of our logical features on an ordered variant of

Datalog, which like normal Datalog has a clear least fixpoint semantics and enables the

natural expression of complex cyclic queries. So far, we have presented this variant in its

stratified form, which limits the kind of recursion that we can handle. We shall now explain,

however, how to accept more general queries.

Chapter 5

Evaluating more general ordered

queries

In the previous chapter, we have explained how to translate logical features in JunGL to

Ordered Datalog, a variant of Datalog that operates over duplicate-free sequences rather

than sets, and studied the precise conditions under which Ordered Datalog programs are safe

— i.e. under which the existence of a least fixpoint is guaranteed for stratified programs.

However, stratified Ordered Datalog appears to be not enough expressive for our application

of scripting refactoring transformations. In this chapter, we highlight the need for more

general queries, and introduce a broader class of stratified programs, that is sufficiently

expressive for our needs but smaller than the class of modularly stratified programs presented

in Chapter 3. Furthermore, we shall describe the evaluation of this broader class in a demand-

driven manner on a top-down stream-based framework. Finally, in the last part of the chapter,

we shall discuss how to express some of our Ordered Datalog queries in normal Datalog.

5.1 On accepting more queries

Consider the edge definition in JunGL to encode an ancestor relationship between nodes:

l e t edge ance s to r x → ?y =

[ x ] parent ; ance s to r [ ? y ] | [ x ] parent [ ? y ]

Following the translation of path queries to relational equations given in Chapter 4, this edge

is equivalent to the Ordered Datalog predicate:

edge ancestor(x , y) ← node(x ), parent(x , z ), edge ancestor(z , y), node(y)

; node(x ), parent(x , y), node(y).

93

CHAPTER 5. EVALUATING MORE GENERAL ORDERED QUERIES 94

Unfortunately, although the edge ancestor predicate is safely stratified in normal Datalog, it

is not in Ordered Datalog for two reasons. A recursive call to edge ancestor appears both on

the right-hand side of a cross product and on the left-hand side of a union.

Out of context, it is difficult to understand why the order expressed in edge ancestor is

important, and why we simply do not rewrite the predicate to a safely stratified rule by

swapping the two disjuncts and moving ancestor(z , y) up in front of all conjuncts:

edge ancestor(x , y) ← node(x ), parent(x , y), node(y)

; edge ancestor(z , y), node(x ), parent(x , z ), node(y).

We shall therefore consider a more concrete example. In the process of encoding static-

semantic information for different languages, we have found a recurrent scenario where such

a recursion on the left of the union operator is needed. When we try to find the first match

of a series of alternatives, it may actually be the case that one of the disjuncts (and not the

last one) involves a recursion. To illustrate, we recall here an edge that we have defined back

in Chapter 2:





| [ x ] parent ; e x i t [ ? y ]

)

We have translated it to Ordered Datalog in Chapter 4:

edge defaultCFSucc(x , y) ← node(x ), node Statement(x ), firstx (

listSuccessor(x , y)

; parent(x , y), node WhileLoop(y)

; (∃z · parent(x , z ), edge defaultCFSucc(z , y))

; (∃z · parent(x , z ), field exit(z , y))

).

The last but one alternative in the definition of the edge defaultCFSucc relies recursively

on defaultCFSucc itself. Such a scenario is very common in the definition of contextual

semantics information for mainstream languages. We will see in Chapter 6 that looking up

entity references in Java for instance is a typical case where we first try to resolve something,

and if that fails we try something else.

Now a relevant question is whether such a query, where we wish to choose the first

matching alternative, could be expressed in stratified normal Datalog. Unfortunately, we

would need to guard the last disjunct with a check that the third disjunct does not have any


proper solution, in order to prevent getting solutions for both the third and last alternatives

in our final result. Yet, if we negate the third disjunct, we end up with negation inside

recursion which is not allowed in stratified Datalog programs either.

Stratified Datalog is not expressive enough to encode the contextual semantic information

that we need for expressing refactoring transformations. And neither is stratified Ordered

Datalog. Therefore we need to look at accepting a more general class of logic programs.

In our introductive overview of Datalog in Chapter 3, we have mentioned the class of

modularly stratified programs, and the example of the win rule that we recall here:

win(X ) ← move(X ,Y ), not win(Y ).

Remember that if move is acyclic, by instantiating the rule in every possible way such that

move subgoals are true, we obtain a stratified program. The situation is similar for our

definitions of edge ancestor and edge defaultCFSucc: the rules are modularly stratified if the

relation parent is acyclic, which is indeed the case.

Unfortunately, modularly stratified programs cannot be evaluated within the usual set-

based bottom-up framework of safe Datalog. To overcome the problem, Ross proposed in

[Ros94] a variant of Datalog with extra operators to track dependencies between atoms.

Modularly stratified programs can be transformed to that extended Datalog and evaluated

through a succession of bottom-up fixpoints. Another solution is to use a goal-oriented

top-down resolution method with tabling and delaying such as SLG [CW96].

Here we wish to suggest an evaluation strategy reminiscent of the Query-Subquery ap-

proach. Unlike Ross’s solution, it uses the standard operators of Datalog. Also, it contrasts

with SLG by being set-based thus leveraging efficient implementation of relational opera-

tions. Therefore, we now introduce the new notions of partial interpretation and partial

stratification, which apply both to normal Datalog programs and Ordered Datalog programs.

5.2 Beyond stratified Ordered Datalog

5.2.1 Partial instantiation

We adopt the same terminology of a complete program component as in [Ros94], except that

we assume that there are never two rules with the same head in a program, since we can

express union with ”;” instead.

Definition 5.1 Let F be a program component (i.e. a subset of the rules) of a logic program

P. We say F is a complete component if for every predicate p appearing in the head of a

rule in F , if p is recursive through a predicate q, then the rule in P with head q is in F .

If the predicate p appears in the head of a rule in F then we say p belongs to F. If the

predicate q appears in the body of a rule in F , but does not belong to F, the we say q is used

by F.


Furthermore, we write HeadVars(F ) for the set of head variables found in a program

component F . To avoid name conflicts, we annotate each head variable with the name of the

rule it occurs in. For instance, in the set F of rules below, HeadVars(F ) = {xp , yp , xq , yq}:

p(x , y) ← q(x , y), not (∃y.q(x , y)).

q(x , y) ← r(x , y).

We can now define partial instantiation (which is akin to the idea of partial evaluation of

logic programs, for instance mentioned in [War92]).

Definition 5.2 (Partial instantiation) Let F be a program component, V a subset of

HeadVars(F ) and D a domain of values, i.e. a set of constants. The partial instantiation

IV

D (F ) of F with respect to V and D, is the set of rules obtained by substituting constants

from D for all variables in V in every possible way.

We rewrite each partially instantiated rule (i.e. rules that have a head variable in V ) by

moving the head variables that have been instantiated to the name of the rule. Furthermore,

for each set of instantiations of the same rule R, we introduce a new rule, called the dispatch

rule of R, whose head is the same as in R and whose body is the union of the instantiated rules

of R in which each disjunct has been amended with a binding equality for the instantiated

head variables.

We illustrate that definition on our Ordered Datalog example of edge ancestor , which

we recall is not statically stratifed. With F = {edge ancestor}, D = {1, 2, 3} and V =

{xedge ancestor}, IV

D (F ) reads as follows:

edge ancestor1(y) ← node(1), parent(1, z ), edge ancestor(z , y), node(y)

; node(1), parent(1, y), node(y).





edge ancestor(x , y) ← x = 1, edge ancestor1(y),

; x = 2, edge ancestor2(y),

; x = 3, edge ancestor3(y).

In the case of Ordered Datalog, the order of the disjuncts in edge ancestor(x , y) obviously

matters. For our particular application, we shall apply partial instantiation to the first

argument of edge predicates only. We have explained in Section 4.5.4 that the resulting

order of an edge predicate follows the order of nodes in Nodeseq . Furthermore, the law (4.4)

tells us that for any sequence s :

Nodeseq ×seq s =⋃

t∈Nodeseq

seq(〈t〉 ×seq s)


Therefore it suffices to make the introduced disjunct follow the same order as in Nodeseq to

preserve the general order of the query.

5.2.2 Partial stratification

We now turn to define the partial reduction of a component.

Definition 5.3 (Partial reduction) Let F be a program component, S be the set of predi-

cates used by F, V a subset of HeadVars(F ). Suppose furthermore that S is fully defined by

a model M and that D is the domain of values appearing in M and as constants in F .

Form the partial instantiation IV

D (F ) of F with respect to V and D. Replace any call to

a dispatch rule R in IV

D (F ) by a call to a specialised version of R where all disjuncts that are

known to be irrelevant at the call site with respect to M have been pruned away.

We call the obtained rules RVM (F ), the partial reduction of F modulo M with respect to

V .

This definition of reduction differs from the definition of reduction in [Ros94] because

we obtain rules that are not fully instantiated (i.e. RVM (F ) contains some free variables).

Again, we illustrate that definition with the partial reduction of edge ancestor with respect

to D = {1, 2, 3} and V = {xedge ancestor}. For the example we define M to be:

{node(1),node(2),node(3), parent(1, 2), parent(1, 3)}

We therefore obtain the following set of rules for RVM (F ):

edge ancestor1(y) ← node(1), parent(1, z ), edge ancestor{2,3}(z , y), node(y)


edge ancestor2(y) ← node(2), parent(2, z ), edge ancestor∅(z , y), node(y)


edge ancestor3(y) ← node(3), parent(3, z ), edge ancestor∅(z , y), node(y)


edge ancestor{2,3}(x , y) ← x = 2, edge ancestor2(y),

; x = 3, edge ancestor3(y).

edge ancestor∅(x , y) ← false.

Note that the set of Ordered Datalog rules RVM (F ) is now statically stratified (thanks to the

parent relation being well-founded). This leads to the definition of partial stratification.

Definition 5.4 (Partial stratification) Let ≺ be the dependency relation between com-

plete components. We say the program P is partially stratified with respect to a set of head

variables V if, for every component F of P,

• there is a total well-defined model M for the union of all components F ′ ≺ F, and


• the partial reduction of F modulo M with respect to V is statically stratified.

The class of partially stratified programs is smaller than the class of modularly stratified

programs (i.e. any partially stratified program is modularly stratified), but it highlights an

interesting evaluation mechanism that follows the top-down strategy of the Query-Subquery

approach. We can generate the partial reduction of each component one partial subgoal at a

time, but evaluate each reduction in a set-based framework.

That is exactly the strategy we use in JunGL, and we therefore define the set of JunGL

programs we admit as follows. Let J be a JunGL program and P be the Ordered Datalog

program obtained by translating the predicates, edges and path queries of J as explained

in Chapter 4. Take V the set of all the first head variables of the edge predicates in P .

If P is partially stratified with respect to V , then we accept J as a valid JunGL program.

Less formally, the idea in the case of JunGL is to evaluate edge predicates one source node

at a time in a top-down manner, but to compute the targets of each specific source node

using a set-based evaluation, or rather a sequence-based evaluation in the context of Ordered

Datalog. If no specialised edge predicate (i.e. instantiated for a specific node) depends on

itself through a nonmonotonic construct, then P can be safely evaluated.

Of course, the main difficulty remains in generating the correct reductions of the edge

predicates on the fly. The generation is correct and fairly straightforward when, at each call

site of a dispatch edge rule, the first parameter of the call is already bound. As we are about

to see, it is however more complex if the source parameter is not yet bound.

We shall now explain the importance of laziness in the construction of edges and how

we achieve it using two complementary mechanisms, namely the top-down evaluation of

predicates and the use of streams. We will then come back to the generation of partial

reductions for edge predicates, and see how it fits in the top-down evaluation.

5.3 Demand-driven evaluation

Demand-driven evaluation is crucial for a language that aims at expressing refactorings be-

cause transformations are often run in an interactive setting. In fact, much of the needed

contextual semantic information and many of the transformations are fairly local. Demand-

driven evaluation takes advantage of that locality and enables to run transformations in

acceptable time, whereas a full analysis of the program would simply be inconceivable. It is

clear enough that we should not adorn the object program tree with all possible edges, given

that most of the time only a few of these will actually be required during a refactoring.

We have explained in Chapter 4 how we translate logical features in JunGL to Ordered

Datalog programs. Edges notably translate to edge predicates using predicate dispatch.

Therefore, in the end, edges are simply seen as pairs of nodes inhabiting the interpretation of

their corresponding edge predicates. A query that is run, for instance, to find some elements

in the program or to check preconditions of a transformation is also an Ordered Datalog


program that refers to a certain number of edge predicates. If we were to evaluate the whole

Ordered Datalog program bottom-up, we would have to compute all the edges of a particular

kind that are referred from the query. On the other hand, a top-down framework minimises

the computation of irrelevant facts, i.e. of useless edges. In a top-down framework, edges

are evaluated only when their value is needed.

5.3.1 Top-down sequence-based evaluation

We have presented in Chapter 3 two top-down approaches. One is a memoised version of

SLD resolution and works a tuple at a time. The other, the Query-Subquery approach, is

set-based and benefits from efficient algorithms for relational algebra operations, notably

hash joins.

The idea of the latter approach, we recall, is similar to the more popular transformation

of magic sets. The aim is to minimize the computation of irrelevant facts by pushing the

calling context of each predicate inside calls. A sideways information-passing strategy is used

to drive the propagation of the context. It merely consists in an appropriate ordering of the

subgoals in rule bodies. In our setting of Ordered Datalog, we cannot arbitrarily reorder

subgoals. The left-to-right order is the meaningful order and that is the one we use.

We shall apply the Query-Subquery approach adapted to Ordered Datalog (i.e. to work

with sequences) on a small example and show that it indeed reduces the computation of

irrelevant edges. The example in question is taken from Chapter 2 and refers to the abstract

grammar of Figure 2.1. The function assignedVariable returns the declaration of the variable

assigned in a statement a (the function pick actually returns the first element of the stream

or null if the stream is empty):





l e t edge de f x : Assignment → ?y = [ x ] var ; lookup [ ? y ]

l e t a s s i gnedVar i ab l e a =

pick { ?d | [ a ] de f [ ? d ] }

The translation rules of Chapter 4 transform the logical part of this JunGL snippet to the

following Ordered Datalog program that we adorn with binding information to support our

coming explanations.

edge treePredbf (n, pred) ← node(n), firstn(


listPredecessor(n, pred)

; parent(n, pred)

).

treePred plusbf (x , y) ← edge treePredbf (x , y)

; ∃z · treePred plusbf (x , z ), edge treePredbf (z , y).

edge lookupbf (r , dec) ← node(r), node Var(r), firstr (

treePred plusbf (r , dec), node VarDecl(dec),

r .name == dec.name

).

edge def bf (x , y) ← node(x ), node Assignment(x ),

∃z · field var(x , z ), edge lookupbf (z , y).

qbf (a, d) ← edge def bf (a, d).

We shall comment step by step the top-down evaluation on the following While program

@p. We annotate nodes to be able to refer to them in our explanations.

[

[ int i; ]@a

[ [ i ]@u = 0; ]@b

[ while ([ [ i ]@v≤ 10 ]@t )

[ {

[ [ int i; ]@e

[ [ i ]@x = [ i ]@y + 1; ]@f

} ]@d

]@c

[ print([ i ]@z ); ]@g

]@p

As a side remark, the program is incorrect, because i in @y is used before being assigned.

We consider the precise query qbf (@f , d). Hence, the initial calling context for the rule

edge def bf (x , y) is C(x ) = 〈@f 〉. Inside that rule, the same context is propagated down to

field var(x , z ), like in the Query-Subquery approach, by joining it first with node(x ) then

with node Assignment(x ). After field var(x , z ), the context becomes C(x , z ) = 〈(@f , @x )〉

(@x is indeed the field labeled var of node @f ) and we are now faced with a call to

edge lookupbf (z , y).


The process there is similar but with the context C(r) = 〈@x 〉 and we reach the call to

treePred plusbf (r , dec) with an unchanged context again. Now the call is slightly different

because treePred plusbf is recursively defined. We need to introduce a fixpoint computation.

We start with the first iteration. The left-hand side disjunct calls edge treePredbf and

returns the sequence 〈(@x , @f )〉 because @x has no list predecessor but has for parent @f .

In the second disjunct, however, the nested call back to treePred plusbf fails at this stage.

The first iteration treePred plusbf thus returns 〈(@x , @f )〉 only. Next, the second iteration

evaluates the left-hand side disjunct in the same way, but this time also succeeds on the

right-hand side because treePred plusbf is not empty anymore. We end up with the new

match (@x , @e) where @e is the list predecessor of @f . We continue these iterations until

no new tuple is found. The final resulting sequence for the call treePred plusbf (r , dec) in

edge lookupbf is:

〈(@x , @f ), (@x , @e), (@x , @d), (@x , @c), (@x , @b), (@x , @a), (@x , @p)〉

Each result found for a particular calling context is cached with its context so that no

predicate is evaluated twice with the same context. The cache corresponds to the inst Rγ

and ans Rγ relations in the Query-Subquery approach (see Section 3.3.2 for more details).

In particular, we do not evaluate edge treePredbf (n, pred) with the same value for n more

than once during the fixpoint computation.

With the results from treePred plusbf , we continue our computation inside the body of

first in edge lookupbf . This filters the sequence to keep 〈(@x , @e), (@x , @a)〉. Applying first

reduces it to 〈(@x , @e)〉, in effect discarding the farthest match.

The result then propagates back to the top of the program and the final result of the

query is 〈@e〉, the closest declaration of i indeed.

We see through this detailed description of the evaluation that we have not computed

too many irrelevant edges. Notably, we have computed the lookup of @x only. It is true,

however, that most of the transitive closure of treePred from v was computed then discarded

with first. We shall actually see that, when using streams, this does not even happen. But

first, we wish to look at a different query and raise a problematic case in our adoption of the

Query-Subquery approach.

5.3.2 The issue with first

We now turn to the converse JunGL query that finds all the references of a specific declaration:

l e t r e f e r e n c e s d =

pick { ? r | [ ? r ] lookup [ d ] }

This time, the adornment of the equivalent Ordered Datalog program looks like this:

edge treePredbf (n, pred) ← node(n), firstn(



; parent(n, pred)

).

treePred plusbf (x , y) ← edge treePredbf (x , y)

; ∃z · treePred plusbf (x , z ), edge treePredbf (z , y).

edge treePredbb(n, pred) ← node(n), firstn(


; parent(n, pred)

).

treePred plusbb(x , y) ← edge treePredbb(x , y)

; ∃z · treePred plusbf (x , z ), edge treePredbb(z , y).

edge lookupfb(r , dec) ← node(r), node Var(r), firstr (

treePred plusbb(r , dec), node VarDecl(dec),

r .name == dec.name

).

q fb(r , d) ← edge lookupfb(r , d).

We now consider the query q fb(r , @a) and focus our attention on the body of first in

edge lookupfb . Both r and dec being bound by our context, the conjunction evaluates to:

〈(@u, @a), (@v , @a), (@x , @a), (@y, @a), (@z , @a)〉

Applying first on the first column of each pair leaves the sequence unchanged. Therefore the

final result of the query is the sequence 〈@u, @v , @x , @y, @z 〉, which is clearly not what we

expect. Only @u, @v and @z resolve to the declaration of i in @a. The references @x and

@y resolve to @e.

The problem comes from an unauthorised step that we take during the Query-Subquery

propagation of the context. Indeed, we are not always allowed to propagate the context inside

first. A conjunct c can be moved inside a first if and only if each variable that is bound by

c is considered by the operator first for grouping. Formally,

c(~x), firstS (p(~x , ~y)) = firstS (c(~x ), p(~x , ~y)) iff ~x ∈ S


In particular, we are not allowed to push the binding for the variable dec in the first of

edge lookupfb :

c(r , dec), firstr (treePred plus(r , dec), node VarDecl(dec), · · · )

6= firstr (c(r , dec), treePred plus(r , dec), node VarDecl(dec), · · · )

If we do not assume dec to be bound inside the first of edge lookupfb , the query evaluates

correctly.

This of course has consequences on the demand-driven nature of the evaluation of our

queries. It basically means that, in this case, we have to lookup the definitions of all references

in our program. However, it is often possible to reduce considerably the amount of useless

computations. The idea comes from the following observation:

first~x (p(~x , ~y)) = p(~x , ~y), first~x (p(~x , ~y))

If ~y is bound but not ~x , we can first evaluate p to bind ~x , and move that binding inside

the first during the evaluation. This is expressed by the same equation as above but with

binding information and an extra predicate c just to make the context explicit:

c(~y), first~x (pff (~x , ~y)) = c(~y), pfb(~x , ~y), first~x (pbf (~x , ~y))

Although we have not implemented that optimisation, it is clear that it reduces the

number of useless computations in many of our queries. For instance, in our example query

where we search for references to a specific declaration, only references that can reach that

declaration through a chain of treePred edges will be considered in the computation of lookup.

5.3.3 Streams

The use of streams allows us to specify a search problem in a nice compositional way: generate

a stream of successes, and use the operator first on streams to take the first answer — no

further elements will be computed. We employ a technique originally due to Mycroft and

Jones, who were the first to model the operational semantics of logic programs in terms of

streams [JM84]. The same technique was used by Spivey and Seres in their embedding of

Prolog in Haskell [SS99]: there, they used the lazy lists of Haskell to conveniently represent

streams.

In contrast, JunGL is implemented on the .NET platform. For implementing sequences,

we took our inspiration from Cω [BMS05], a language developed at Microsoft Research, where

streams are generated using the same iterator constructs that are available in C# 2.0. An

iterator function is a function that returns an ordered sequence of values by using a yield

statement to return each value in turn. When a value is yielded, the state of the iterator

function is preserved and the caller is allowed to execute. The next time the iterator is


invoked, it continues from the previous state and yields the next value. Iterators are a special

kind of well-known coroutines, which generalise subroutines to allow multiple entry points

and suspending and resuming of execution at certain locations. It is commonly accepted

that coroutines are well-suited for implementing familiar program patterns such as iterators,

infinite lists and pipes.

To illustrate the use of iterators in C#, we give the interface details of the function Union

that takes two source sequences (modeled as IEnumerable<T>) and yields a new sequence

that is the lazy union of the two:

static IEnumerable<T> Union<T>(

GetKey<T> getKey ,

IEnumerable<T> source1 ,

IEnumerable<T> source2

)

The parameter getKey is a delegate to a function that takes a T and returns a key. One may

wonder why we need such a parameter. This is in fact because we do not simply append

the two sequences but we also filter any duplicates. Two elements are similar if they have

the same key. We have defined similarly all the sequential relational operators of Chapter 4.

Another more complex example is Join:

static IEnumerable<V> Join<T, U, V>(

GetKey<T> getInnerKey ,

GetKey<U> getOuterKey ,

Function<T, U, V> append ,

IEnumerable<T> inner ,

IEnumerable<U> outer

)

There we need three delegates: one to get the key for elements of the inner sequence, another

to get the key for elements of the outer sequence, and a last one to append two source

elements into a result (like ++ in our Haskell definitions). For efficiency reasons, we have

not implemented a nested loop but a hash join. We have also found useful to have a function

to memorise a stream. The function returns a generator that saves all the elements as they

are first discovered, so that any new iteration on the same sequence will directly return the

elements previously discovered.

All these definitions would benefit from the new features of C# 3.0 and LINQ [MBB06]

with very few changes. Our implementation is indeed strikingly similar to some parts of the

LINQ API which has also its roots in Cω. For instance, there, Join is defined with:

static IQueryable<TResult> Join<TOuter , TInner , TKey , TResult >(


this IQueryable<TOuter> outer ,

IEnumerable<TInner> inner ,

Express ion<Func<TOuter , TKey>> outerKeySe lector ,

Express ion<Func<TInner , TKey>> innerKeySe lector ,

Express ion<Func<TOuter , TInner , TResult>> r e s u l t S e l e c t o r

)

The principal difference is that it does not accept functions like we do, but expression trees of

the functions. This is to allow the runtime interpretation of the trees for instance to generate

SQL code and delegate the query to a database system.

We now turn back to the evaluation of our Ordered Datalog programs. We translate each

query to a pipeline of operations on streams. This pipeline may of course contain recursion

if the query that we represent contains recursively defined predicates. To illustrate, we have

drawn in Figure 5.1 the recursive pipeline of treePred plus .

treePred ;seq

∪seq

treePred plus

left

right

left right

Figure 5.1: Example of recursively defined pipeline

In the case of a recursive pipeline, results are yielded before the end of the whole com-

putation. Calling treePred plus with a special context returns a sequence. Retrieving one

element of that sequence triggers the first fixpoint iteration. When all the elements of the

first iteration are discovered, asking for a next element triggers the second iteration, and so

on. The benefits of such a pipeline is clear when we use first on a sequence and all elements

we group on are already known (i.e. bound). When we have found all the first tuples for

each of the elements we group on, we do not need to explore the sequence any further. This

reduces the number of irrelevant computations.

5.4 Generating partial reductions

We shall now turn to a different aspect of the implementation, which interestingly relies on

the top-down evaluation mechanism explained in the previous section. The class of partially


stratified programs we have introduced earlier is indeed defined through partial reductions of

strongly connected components, and we explain here how we generate them. We recall that

a partial reduction of a recursive rule R(~x , ~y) with respect to ~x is a partial instantiation of

R for all values of ~x where any recursive call to R in a specialised version Ri(~y) is further

restricted to a context that prevents a call back to Ri .

We shall take again the example of the descendant edge. We make it not statically strati-

fied (in the context of Ordered Datalog) on purpose in order to support our explanations. We

give here its translation to Ordered Datalog together with a query and binding information:

edge descendantbf (x , y) ← node(x ), child(x , z ),

edge descendantbf (z , y), node(y)

; node(x ), child(x , y), node(y).

qbf (x , y) ← edge descendantbf (x , y).

Now suppose that we wish to evaluate qbf (@p, y) where @p still refers to our sample program

of the previous section. When the context C(x ) = 〈@p〉 reaches the call to edge descendantbf ,

the rule edge descendantbf (x , y) (because it is not statically stratified) is called with the spe-

cialising context x = @p. This is equivalent to generating on the fly the partial instantiation

of edge descendantbf (x , y) for x = @p:

edge descendant f@p(y) ← node(@p), child(@p, z ),

edge descendantbf (z , y), node(y)

; node(@p), child(@p, y), node(y).

There, the context after the call child(@p, z ) in the first disjunct contains only the children

of @p, that is C(z ) = 〈@a, @b, @c, @g〉. Consequently, we refine the partial instantiation

edge descendant f@p(y) and transform the call to edge descendantbf (z , y) to a union of four

calls in different specialising contexts:

( edge descendant f@a (y); edge descendant f@b(y)

; edge descendant f@c(y); edge descendant f@g(y) )

Fortunately, there is no presence of @p in that context. So edge descendant@p is safely

stratified. Because child is acyclic, there will actually never be @p in the calling context

of a recursive call. If there were, we would raise an error at runtime. The Query-Subquery

approach thus allows us to generate the partial reductions of edge predicates on the fly as

we propagate down our binding context.

However, it is not always so simple and following the approach just sketched might wrongly

reject some programs that are partially stratified. Take the following variant of our example

where we swap the two middle conjuncts in the first disjunct:


edge descendant f@p(y) ← node(@p), edge descendantff (z , y),

child(@p, z ), node(y)

; node(@p), child(@p, y), node(y).

The issue there is that, before the recursive call to edge descendantff (z , y), the context

for z contains all nodes, and notably @p. We would end up with an error, although any tuple

with z = @p would later be discarded because child(@p, @p) is false.

To overcome that problem, we take inspiration from SLG. In order to handle possible

loops through negation, SLG supports a delaying operation of subgoals to dynamically adjust

a rule, along with a simplification operation to resolve away delayed subgoals when their

truth value becomes known [CW96]. When generating partial reductions, we should allow a

nonmonotonic recursive call to the same specialised version of an edge predicate (for instance

edge descendant f@p(y)), but directly return a singleton sequence with a fake node and a special

marker saying that this ground atom is unsafe (i.e. unknown). We denote such sequence

with 〈⊥⊥〉. The superscript ⊥ is the marker. We could then propagate that marker to any

fact that is inferred using an unsafe fact. At the end of the evaluation, if the result contains

an unsafe fact, then we raise an error.

More formally, we should change the Haskell definition of a sequence we gave in Chapter

4 to:

type Sequence = Stream (Tuple ×Bool)

where the second element of each pair is true if and only if the tuple is unsafe. We also need

to change the functions tupleDrop and tupleKeep in the obvious way for preserving the state

of the input tuple, and make the concatenation of two tuples unsafe if one of them is unsafe:

++ :: (Tuple ×Bool)→ (Tuple ×Bool)→ (Tuple ×Bool)

(t1, b1)++(t2, b2) = (t1 ++ t2, b1 ||b2)

The definitions of σseq also need to be modified to look at the value of tuples only:

σseqXi=Xjs :: Sequence → Sequence

σseqXi=Xjs = filter f s

where f (t , b) = (tupleKeep [Xi ] t == tupleKeep [Xj ] t)

σseqXi=ds :: Sequence → Sequence

σseqXi=ds = filter f s

where f (t , b) = (tupleKeep [Xi ] t == [d ])

With these new definitions, all but one relational operators on sequences now propagate


correctly the unsafe marker. For instance,

〈(1, 2), (1, 3)⊥〉 ∪seq 〈(1, 2), (1, 3)〉 = 〈(1, 2), (1, 3)⊥, (1, 3)〉

〈(1, 2), (1, 3)〉;seq 〈(3, 5), (2,⊥)⊥〉 = 〈(1,⊥)⊥, (1, 5)〉

πseqX .1 〈(1,⊥)⊥, (1, 5)〉 = 〈1⊥, 1〉

σseqX .2=2〈(1, 2), (1, 3)⊥, (1, 3)〉 = 〈(1, 2)〉

Note that we keep copies of the same tuple if one is unsafe and the other is not. Two tuples

are indeed considered equal if they have the same marker state, except in a filter operation.

Negation as failure, however, needs a more important change:

notseq sn :: Sequence → Sequence

notseq sn = [ (t ,marker t) | t ← Dnseq , (t , false) /∈ sn ]

where marker t = (t , true) ∈ sn

If t is unsafe in sn then we cannot deduce anything about it in the complement of sn and it

is marked as unsafe there too.

Back to our example, the recursive call to edge descendantff (@p, y) returns 〈(@p,⊥)⊥〉

now. The context after the union of all specialised calls is then C(x , z , y) = 〈(@p, @p,⊥)⊥, · · ·〉

but the first unsafe tuple is then filtered away when intersecting with the result of child(@p, z )

that does not contain (@p, @p).

We have described how we adapt the Query-Subquery approach to evaluate partially

stratified programs. Most parts of this evaluation mechanism are actually not specific to

Ordered Datalog, and we shall discuss now whether we could, in fact, have based logical

features on sets.

5.5 Back to sets

5.5.1 Motivation

With partial stratification, we have overcome the restrictions of safe Ordered Datalog. Or-

dered Datalog programs have however other pitfalls compared to normal Datalog. In par-

ticular, they are not good candidates for optimisations because their subgoals cannot be

arbitrarily permuted. In that sense normal Datalog appears to be more declarative than

Ordered Datalog. Furthermore, having logical parts of the script expressed in normal Dat-

alog would allow us to integrate JunGL in a tool chain and notably reuse existing efficient

implementations of Datalog. Finally, Datalog seems to be a better choice for reasoning about

the scripts because it is closer to first-order logic. We therefore discuss in this section how

we could fall back to logical features that are based on sets rather than on sequences but still

support a large range of useful scenarios.


During our experiments with JunGL, we have actually found out that many parts of the

scripts do not rely on any order at all. Notably the order is hardly relevant when checking

the preconditions of a refactoring or when computing dataflow properties. Having said that,

we know from Section 4.3.3 that it is perfectly fine to evaluate these parts as normal Datalog

programs. By simply annotating the queries where order does not matter, we could benefit

from the advantages of Datalog over Ordered Datalog mentioned above.

There are many cases however where the order is of course relevant. The order of an

Ordered Datalog query is expressible as a normal Datalog program if we give up on strat-

ification. The order of a sequence is indeed just a binary relation on tuples. By flattening

each pair of tuples to tuples of double arity, we can represent the order of a sequence-based

predicate as a set-based predicate. However, we would need to encode the behaviour of our

sequence-based relational operators with set-based relational operators. Beside the fact that

such encoding would be very verbose, the presence of the function nub (which, we recall, en-

forces that no duplicate is present in a sequence) in almost all our definitions is challenging.

It can be encoded with negation, but this may lead to unsafe recursion. In the next section,

we address a recurrent scenario where the order is always needed, namely when the operator

first is used. We notably propose a convenient set-based construct to replace it.

5.5.2 The B operator

A way to get rid of the construct first is to introduce a new binary operator B (pronounced

orelse), which tries to satisfy its right-hand side predicate if and only if its left-hand side

predicate fails. To illustrate, we update some of our earlier examples in Chapter 2 with that

new operator. The treePred edge definition now reads:


[ n ] l i s tP r e d e c e s s o r [ ? pred ] B [ n ] parent [ ? pred ]

In words, if n has a list predecessor, then ?pred is the list predecessor of n, or else ?pred

possibly matches the parent of n. Similarly, for defaultCFSucc, we have:


[ x ] l i s t S u c c e s s o r [ ? y ]

B [ x ] parent [ ? y : WhileLoop ]

B [ x ] parent ; defaultCFSucc [ ? y ]

B [ x ] parent ; e x i t [ ? y ]

Finally, cfsucc is defined as follows:



| ( [ x ] e l s eBranch [ ? y ] B [ x ] defaultCFSucc [ ? y ] )


All three definitions with B are elegant and even more readable than the original ones.

Note, however, that the equivalence of the definitions is pending on the kind of edges that

are used. Here, the new definitions are equivalent to the previous ones because the edges

that are used have only one target at most: a node has one list predecessor at most, one

parent at most, one list successor at most, and so on. Therefore we are guaranteed to match

at most one node as if we were using first.

We now propose to give the translation of B to Datalog. As an example, consider the

treePred edge again. We want to match the parent of n only when n has no list predecessor.

We could hence write the edge as follows:


[ n ] l i s tP r e d e c e s s o r [ ? pred ]

| ( ! [ n ] l i s tP r e d e c e s s o r [ ] ) & [ n ] parent [ ? pred ]

Note again the asymmetric role of the two variables n and ?pred . We reflect this asymmetry

in the definition of B at the level of Datalog by annotating the operator like for first:

a(x , y) Bx b(x , y) = a(x , y); not a(x , ), b(x , y)

In JunGL however, we can omit the annotation. Indeed we assume that it is implicitly

given through the existing asymmetry between the source and the target variables of an

edge. In the end, the notation is very elegant, and we have decided to add it to our language.

There, we have made it work on sequences but with the idea that it can be evaluated as

normal Datalog when order does not matter. This is a win over first, which itself does not

have any counterpart in normal Datalog.

Unfortunately, the queries that use first to take the first success of a list of alternatives

cannot always be directly expressed with B. For instance, the problem is more complex in

the case of the lookup edge also defined in Chapter 2:



To use B, the idea is to unroll the transitive closure as in:


[ r ] t r eePred [ ? dec : VarDecl ] & r . name == ?dec . name

B [ r ] t r eePred ; tr eePred [ ? dec : VarDecl ] & r . name == ?dec . name

B . . .

Therefore, we would need to introduce an auxiliary recursive predicate:


l e t predicate lookupFrom (? from , ? r , ? d) =

[ ? from ] treePred [ ? d : VarDecl ] & ? r . name == ?d . name

B(? from , ? r ) [ ? from ] treePred [ ? p ] & lookupFrom (?p , ? r , ? d )


lookupFrom ( r , r , ? dec )

We find this definition harder to express and harder to read than our original definition using

first. A possibility though would be to introduce, for such a use pattern, yet another operator

that would translate to the appropriate auxiliary predicate.


In this chapter, we have seen that stratified Datalog programs (whether ordered or not) were

not expressive enough for our application of scripting refactoring transformations. Indeed,

the conditions on static stratification were found to be too restrictive to successfully express

the computation of static-semantic information with JunGL. The limited expressiveness of

stratified Datalog has been brought out at many occasions in the Datalog literature, e.g.

[Ull94, Prz88], but more rarely in the context of a particular application.

To augment the expressiveness of JunGL, we have therefore introduced the broader class

of partially stratified Datalog programs. Partial stratification is the idea that a Datalog

program, when partially instantiated and reduced with respect to some of its head variables,

becomes stratified. Partial instantiation is akin to the old idea of partial evaluation of logic

programs [War92]. The class of partially stratified Datalog is a subset of the class of modularly

stratified programs [Ros94], but it highlights an interesting evaluation mechanism that follows

the set-based top-down strategy of the Query-Subquery approach [Vie86]. We can indeed

perform the partial reduction of components (that are not initially statically stratified) at

runtime, i.e. when the calling context of each relevant predicate is precisely known. Unlike

the solution proposed in [Ros94] for evaluating modularly stratified programs bottom-up, our

approach uses standard relational operators. Furthermore, in contrast to SLG, it is set-based

thus allowing to leverage efficient implementation of relational operations.

When generating the partial reduction of a partially stratified component, however, cycles

through nonmonotonic constructs may still occur. This is due to the fact that the reduction

is sensitive to the order of subgoals. To overcome that issue, we propose to temporarily allow

such cycles but to mark tuples inferred from them as unsafe. This proposal is inspired from

the technique of delaying subgoals in SLG resolution [CW96].

Apart from allowing the evaluation of partially stratified components, the Query-Subquery

approach has two other benefits for the evaluation of JunGL scripts. First, it enables the

demand-driven computation of edges. Second, it allows caching of intermediate results. In-

deed, if a calling context of an edge predicate is identical to or subsumed by a previous one,


the edge predicate is solved using answers already computed. In the end, this is roughly

similar to the caching technique used in attribute grammar systems like JastAdd [EH04].

Another interesting point in the implementation of the logical features is the use of

streams. As we shall see in the next chapter, it is convenient to specify a search prob-

lem in a compositional way, generate a stream of successes, and use the operator first on

streams to take the first answer. The technique of modelling the operational semantics of

logic programs in terms of streams was first proposed by Mycroft and Jones [JM84], and

exploited by Spivey and Seres in their embedding of Prolog in Haskell [SS99].

Our implementation of streams follows the one originally proposed in Cω, where streams

are typically generated using iterators of C# 2.0 [ecm06]. A key aspect in Cω is that streams

are always flattened so as to coincide with XPath and XQuery sequences. Of course, in

Ordered Datalog, we manipulate flat sequences too. Many of the research ideas of Cω have

reappeared in the recent LINQ framework [MBB06]. If we had started the implementation of

JunGL slightly later, we would probably have used the LINQ API rather than implementing

relational operations on streams ourselves.

Finally, we have explored how we could base the semantics of the logical features in JunGL

on sets rather than sequences, notably to facilitate reasoning about the transformations.

Thanks to our formalism, most parts of the scripts — the ones in which order does not

matter — would actually require no change. Indeed, if we ignore the order, we have shown

earlier that the stratified evaluation of an Ordered Datalog program leads to the same results

as the stratified evaluation of its normal Datalog counterpart. Yet, other parts of the scripts,

which rely on the operator first, would benefit from a new set-based operator B. To close the

gap with normal Datalog even more, we have actually introduced that operator in JunGL.

Finally, the picture is less clear for the remaining parts. Although it is certainly possible to

express any desired order in (non-stratified) normal Datalog, that approach would be quite

verbose. In future work, one may wish to explore the best translation of any Ordered Datalog

program to normal Datalog and its consequences on stratification.

Having exposed the design, the semantics and the implementation of JunGL, we are now

ready to put it to test. We shall show, in the next chapter, that JunGL enables the clear

and concise specification of complex real refactoring transformations.

Chapter 6

Scripting refactorings

In this chapter, we wish to validate the design of our language and show that our approach

scales to the expression of refactorings for mainstream languages. We present the implemen-

tation of three of the most frequently used refactorings, which besides are very different in

nature: Rename Variable, Extract Interface and Extract Method. Rename Variable deals with

name binding and scoping. Extract Interface alters the type hierarchy of a program. Finally,

Extract Method manipulates the control and data flow of a program.

We shall describe these refactorings for subsets of mainstream object-oriented languages

like Java or C#. We do not fully support a single language, but we show how to handle the

language features that present a challenge in the correct mechanisation of these transforma-

tions.

6.1 Rename Variable

The automation of Rename Variable is far beyond a simple search-and-replace mechanism,

because it requires variable binding information and the ability to detect potential conflicting

declarations of variables with a similar name.

Conflicting declarations To understand more precisely the intricacies of renaming, let

us consider the following Java code:

class A {

int i ;

public int g e t I ( ) {

int j = 0 ;

return i ;

}

}

113

CHAPTER 6. SCRIPTING REFACTORINGS 114

One may want to rename the local variable j to i , although the instance member i is used

in the same context. In Eclipse or Visual Studio, post-transformation checks are performed

to ensure that variable bindings have not changed, and in particular no inadvertent variable

capture occurred. In the example, the transformation would be rejected a posteriori — in

Visual Studio, after the tool has offered a view of how the transformation applied.

In a past version of IntelliJ IDEA (5.0 precisely), the above refactoring resulted without

any warning in code where the occurrence of j had simply been changed to i . In such a case,

the code still compiles but i in the second statement of the method is no longer bound to the

instance member, but to the freshly renamed local variable. This situation is certainly the

worst in a refactoring process since your code remains compilable, but now has a different

meaning. JetBrains fixed that bug in IntelliJ IDEA 5.1 shortly after we reported it.

Aim and outline of the script Using JunGL, we wish to detect such conflicts before

the actual transformation, and also attempt to resolve them. We shall make the reasonable

assumption that Rename Variable is correct if all name bindings are preserved by the trans-

formation. That is, any reference to a declaration d should refer to the same declaration d

after the transformation.

In the specification of the refactoring, we hence aim that the freshly renamed declaration

does not conflict with any pre-existing declaration and that none of the pre-existing decla-

rations conflicts with the renamed declaration. We shall allow shadowing of a declaration

only if its references are not endangered or if all of them can be qualified appropriately to

make sure that they still refer to the same shadowed declaration. In the above problematic

example for instance, we could remove the ambiguity by changing i in the return statement

to this .i in order to refer to the instance member, even in the presence of a new local variable

i .

The remainder of this section is organised as follows. First, we present an object language

that is both simple for the clarity of our explanations and challenging for the automation of

Rename Variable. That language indeed follows closely the complex name lookup rules of the

Java language. We then describe how to express in JunGL the computation of name lookup

for that language. Finally, we present two versions of the Rename Variable refactoring. One

checks for conflicts, but rejects the transformation if any variable capture occurs. The other is

an extension of the former that tries to minimise rejection by recomputing a non-ambiguous

access for the captured references.

6.1.1 The object language

We consider a subset of Java inspired from the language used in [EH06]. We support packages,

top-level and nested classes, fields declarations, class intialisers, local variable declarations

(as the only kind of statements) and any type or variable reference. In addition, we include

super, this and cast expressions. We call NameJava that particular new subset of Java.


As we did before for the toy language While, we can give the abstract grammar of Name-

Java via the following JunGL data type definitions:

type

Program = { compUnits : CompUnit l i s t }

and

CompUnit = { packageName : s t r i n g ; c l a s sDe c l s : C las sDec l l i s t }

and

BodyDecl =

| MemberDecl = (

| ClassDec l = { name : s t r i n g ; super :Name ;

bodyDecls : BodyDecl l i s t }

| Fie ldDec l = { f i e ldType :Name; name : s t r i n g ; expr : Expr }

)

| I n i t i a l i z e r = { block : Block }

and

Block = { stmts : Stmt l i s t }

and

Stmt = (

| Loca lVar iab leDec l = { varType :Name ; name : s t r i n g ; expr : Expr }

)

and

Expr =

| ThisOrSuperOrName = (

| Name = (

| SingleName = { name : s t r i n g }

| DotName = { l e f t : Expr ; r i g h t : ThisOrSuperOrName }

)

| This

| Super

)

| Parenthes isedExpr = { expr : Expr }

| Cast = { castType :Name ; expr : Expr }

In words, a program is a list of compilation units. Each compilation unit has a package

declaration and a list of class declarations. A class declaration ClassDecl has a name and an

optional extends clause that refers to the name of its superclass. That optional superclass

name is potentially qualified. The data type ClassDecl has therefore a field labeled super of

type Name which is indeed a simple name or a qualified name.

A ClassDecl has also a list of body declarations, each of which is either a class initialiser

(i.e. a block of local variable declarations) or a member declaration. A member declaration


is in turn either a class declaration (thus allowing nested classes) or a field declaration. Field

and local variable declarations have the same structure: they admit a type name, a variable

name and an initialiser expression.

Finally, an expression is either a parenthesised expression, a cast, a super reference, a this

reference, a simple name reference or a qualified name reference. Note that we only use one

data type DotName to represent qualified names or expressions. This grammar therefore al-

lows programs that are not valid NameJava programs. However, such a single representation

is convenient to treat similar cases at once and we use a less permissive grammar for parsing

NameJava programs anyway.

Naturally, we make NameJava follow the same name lookup rules as in Java [jls05].

NameJava therefore exhibits most of the intricacies of name resolution in Java that present

a challenge in the context of Rename Variable. For instance, in the program of Figure 6.1,

the local variable l is initialised with the value of the field f in A.B . In the initialisation

of m, the different access C .this .f also resolves to the field f in A.B . The reference f in

the initialisation of n refers to the field f of the directly enclosing class D . Finally, one can

package a ;class A {

class B {int f ;

}class C extends B {

int g ;class B {}class D extends A.B {

int f ;class x {}int x ;{

int l = super . f ;int m = C. this . f ;int n = f ;int o = ( (A.B)C. this ) . f ;int p = x ;x x ;

}}

}}

Figure 6.1: A NameJava program


refer to a member of a class via a fairly complex qualifier: ((A.B)C .this).f also refers to the

field f in the superclass A.B of the enclosing class C . Note that we could not simply write

((B)C .this).f in that case as B would resolve to the class B in C . Furthermore, in contrast

to C#, it is possible in Java to give the same name to two members of the same class if one

is a field, and the other a class. In our example, it is perfectly fine to define both the class x

and the field x as members of the class D . The context of a reference is then used to resolve

its correct declaration. In the initialisation of p, x refers to the field x of D . On the following

line, however, the local variable x is declared of type x , which is the class in D .

6.1.2 Name lookup

Now that we have introduced our object language informally, we shall describe how we specify

in JunGL the computation of name bindings. Precisely, we give several edge definitions for

relating a type or variable reference to its declaration. The reference might be of the form of

a simple name or a qualified name. Therefore we give an edge definition for both alternatives:

l e t edge lookup x : SingleName → ?y =

f i r s t ( [ x ] lookupAl l [ ? y ] & getName x == getName ?y )

l e t edge lookup x :DotName → ?y = [ x ] r i g h t ; lookup [ ? y ]

The first edge definition from SingleName node x retrieves all visible declarations ?y in

a precise order and takes the first one with a name that matches the name in x . The second

definition extends the lookup mechanism to DotName nodes. Resolving the declaration

referred by a qualified name x simply reduces to resolving the declaration from the qualified,

right subtree of x . This is because when looking up a single name, we account in fact for

its surrounding context. Indeed, the declarations visible at a single name x depend on the

specific sort of reference that is expected at the position of x (e.g. a variable reference

or a type reference), and also obviously on the presence of a qualifier for x . The former

constraint is handled in the definition of lookupAll, while the latter is treated in the definition

of lookupAllWithDotContext:

l e t edge l ookupAl l x : SingleName → ?y =

[ x ] lookupAllWithDotContext [ ? y ] &

( isVariableName (x ) & ( [ ? y : Fie ldDec l ] | [ ? y : Loca lVar iab leDec l ] )

B isTypeName (x ) & [ ? y : Clas sDec l ]

B isPackageOrTypeName(x ) & ( [ ? y : Clas sDec l ] | [ ? y : CompUnit ] )

B isAmbiguous (x ) )

l e t edge lookupAllWithDotContext x : SingleName → ?y =

onTheRightOfDot (x ) &


[ x ] parent ; l e f t ; typeLookup ; lookupAllMembers [ ? y ]

| ! onTheRightOfDot (x ) & [ x ] lookupAl lDec l s [ ? y ]

| ! onTheRightOfDot (x ) & [ x ] lookupAllPackages [ ? y ]

The lookupAll edges of x is computed by filtering the lookupAllWithDotContext edges of x

with the information on the kind of reference that is expected at x . If x is expected to be

a variable, then we keep only field and local variable declarations in the stream of possible

lookups. If x is expected to be a type, we keep class declarations only. If x can be a package

or a type reference, then we keep both class declarations or compilation units (we represent

a package by the set of its compilation units). Finally, if x is in an ambiguous context, then

we keep all declarations.

We do not present here the predicates isVariableName, isTypeName, isPackageOrType-

Name and isAmbiguous. Their definition is straightforward and can be found in Appendix

B. We shall however illustrate their behaviour. In the expression ((A.B)C .this).f , f must

be resolved as a variable name, C as a type name, B as a type name too, and A as package

or type name.

More interesting is the account for context in the definition of lookupAllWithDotContext.

There, the stream of declarations depends on whether the reference is qualified or not. If it

is, we resolve the static type of the receiver and we look up its members. If it is not, we first

return all visible declarations from that unqualified context, and then all packages.

The definition of the edge typeLookup is simple in a language with few kinds of expres-

sions. Details can be found again in the full script for Rename Variable in Appendix B. We

shall rather focus here on the definitions of lookupAllMembers, lookupAllDecls and lookupAll-

Packages in turn.

All members The edge lookupAllMembers is defined for ClassDecl and CompUnit nodes:

l e t edge lookupAllMembers x : Clas sDec l → ?y =

[ x ] ( super ; lookup ) ∗ [ ? s ] &

( [ ? s ] bodyDecls [ ? y : Fie ldDec l ] | [ ? s ] bodyDecls [ ? y : Clas sDec l ] )

l e t edge lookupAllMembers x : CompUnit → ?y =

[ x ] c l a s sDe c l s [ ? y : Clas sDec l ]

In words, the potentially visible members of a class declaration x are the fields and nested

classes of x , or of any of the direct or transitive superclasses of x . Of course, not all these

members are actually visible from class x . The order in which we build the stream of edges

is therefore crucial, since it captures member hiding rules. The free variable ?s will match

in order first x itself, then the parent class of x , then the parent of the parent class of x

and so on. To find the parent class of x , we simply recursively call the lookup edge on the

name reference of the superclass of x . For each ?s in the ordered sequence of parent classes,


we then first look up field declarations and then class declarations. Indeed, if a class has

both a field n and a nested class n, then we need to match the field declaration first, as any

ambiguous reference with name n should resolve to that field.

The edge definition of lookupAllMembers is much more obvious for compilation units. We

simply return all top-level classes.

All declarations We shall now describe lookupAllDecls edges. There are three definitions:

one for all nodes in the program and two overridden definitions for ClassDecl and CompUnit

nodes.

l e t edge l ookupAl lDec l s x → ?y =

[ x ] enc lo s ingStmt ; l i s tP r e d e c e s s o r +[?y : Loca lVar iab leDec l ]

| [ x ] enc l o s ingScope ; lookupAl lDec l s [ ? y ]

l e t edge l ookupAl lDec l s x : Clas sDec l → ?y =

[ x ] equa l s [ ? y ]

| [ x ] lookupAllMembers [ ? y ]

| [ x ] enc l o s ingScope ; lookupAl lDec l s [ ? y ]

l e t edge l ookupAl lDec l s x : CompUnit → ?y =

[ x ] parent ; compUnits [ ? cu ] lookupAllMembers [ ? y ] &

( ?cu . packageName == x . packageName | ?cu . packageName == "" )

For a node x that is neither a class, nor a compilation unit, we first try to find an enclosing

statement of x and search for local variable declarations preceding that statement. Then

we move up to the direct enclosing scope of x (i.e. either its direct enclosing class or its

compilation unit), and search for all declarations potentially visible from that point.

The potentially visible declarations of a class declaration x are first the class x itself, then

all members of x , and finally all declarations visible from the enclosing scope of x . Again,

the order of the disjuncts is significant. This time, it captures the shadowing rules of our

language: a member with name n shadows any declaration with the same name n of an

enclosing class.

Finally, the visible declarations of a compilation unit are all the declarations of compilation

units in the same package, or declarations in the root package. Again, we do not describe

here auxiliary edges like enclosingStmt or enclosingScope. Their full definition is given in

Appendix B.

All packages Finally, we shall define the edge that points to all packages. This is straight-

forward as we represent each package by the compilation units it contains. Therefore, it

suffices to climb up to the program root and find all compilation units:


l e t edge lookupAllPackages x → ?y =

[ x ] parent ∗ [ : Program ] compUnits [ ? y ]

At this point, one might be concerned about the efficiency of our variable binding mecha-

nism. It would be more efficient to compute bindings in a single pass, like in classical compiler

construction. Nevertheless, it is very convenient for prototyping to declaratively specify the

binding rules like we did, by translating the specifications of the language to concise edge

predicates. Our implementation is workable as it stands, and yet improvements are possible,

for instance by specifying additional edges for storing binding information in intermediate

nodes such as blocks.

We conclude the description of the name lookup rules with a pictural overview of the

lookup process in Figure 6.2. The declarations potentially visible at a point x are returned

in a meaningful order. We first look at members of the direct enclosing class C0,0 of x . Then,

we inspect all inherited members in the chain of superclasses of C0,0, i.e. in all C0,k , first

with k = 1, then with k = 2, and so on. Finally, we process recursively on the enclosing class

of C0,0 itself, that is C1,0. In our figure, the vertical axes represent the inheritance chains

while the horizontal axis represents the nesting chain. For instance, C1,0 is nested in C2,0.

Note that once we have started moving up in an inheritance chain to look for members, we

cannot move to an enclosing class of a superclass.

x

C0,0C1,0C2,0

C0,1C1,1C2,1

inheritance

nesting

Figure 6.2: Ordered stream of declarations following first the chain of inheritance, then thatof nesting.

NameJava provides no support for access controls and interfaces. One might rightly

wonder how we would cope with these in our style of specification. The different rules of

accessibility can be modelled as filters on the stream of visible declarations as we did to

account for the context of a reference. Interfaces, however, brings in multiple inheritance.

As explained in Section 6.4 of the Java language specification [jls05], a class may have two

or more fields with the same simple name if they are declared in different interfaces and

inherited. In that case, it is not possible to refer to any of these fields by its simple name. In

lookupAllMembers, we would hence be careful not to retrieve members whose simple name


refers to more than one member in all superclasses.

6.1.3 Detecting conflicts and renaming

We may now turn to scripting the Rename Variable refactoring. We shall first limit ourselves

to a basic version of it where we reject the transformation in case of any conflict or variable

capture. Interestingly, that basic version is very similar to the Rename Variable script we

gave in Section 2.6.2, although the name binding rules of NameJava are much more complex

than those of the While language we used back there.

In both cases, we have defined the lookup edge of a variable reference x as the first match

in the flow of declarations potentially visible from x . We recall here the very simple definition

of lookup for While programs:



This is to compare to the lookup definition for NameJava:

l e t edge lookup x : SingleName → ?y =


The complexity of the lookup is in fact hidden in the stream of potentially visible declarations.

In While, it suffices to climb up the tree of statements. In NameJava, that stream is defined

by carefully traversing classes along inheritance and nesting axes.

Therefore, we can detect variable captures exactly like we did for While programs, by

checking that the declaration to be renamed is not going to capture any existing variable

and that no existing declaration will capture the renamed variable. The full script reads as

follows:

l e t renameVariable program node newName =

l e t dec = pick { ?d | [ node ] lookup [ ? d ] B equa l s ( node , ? d) } in

i f not i sVa r i a b l eDe c l a r a t i o n dec then

e r r o r "Please choose a variable" ;



i f a l r eadyEx i s t s dec newName then

e r r o r "Declaration already exists" ;

l e t f i n dF i r s t x =

pick { ?y | [ x ] lookupAl l [ ? y ] &

(newName == getName ?y | ?y == dec ) } in

let mayBeCaptured =

{ ?x | [ program ] ch i l d +[?x : SingleName ] &


?x . name == newName } in

let needRename =

{ ?x | [ program ] ch i l d +[?x : SingleName ] lookup [ dec ] } in

foreach x in mayBeCaptured do

i f f i n dF i r s t x == dec then e r r o r "Variable capture" ;


i f f i n dF i r s t x != dec then e r r o r "Variable capture" ;


x . name ← newName ;


The description of the core part for detecting variable capture can be found in Section

2.6.2. The only difference in this version is the additional check that no declaration with the

new name and under the same enclosing class already exists. We do not spell out the details

here. The definition of alreadyExists is also given in Appendix B.

We shall now try to resolve variable capture in order to minimise rejection.

6.1.4 Minimising rejection

Consider again the example program of Figure 6.1 and suppose that we wish to rename the

field g in class C to f . By doing so, we are hiding the field f of the superclass B of C . This

is a case of variable capture because f of B is actually referred deeper in the program with

C .this .f . Let us trace what our previous script would do in that case. The reference to f in

C .this .f would be classified as a mayBeCaptured reference simply because it is named after

the new name we wish to give to g. Then it would be checked that, in the flow of declarations

potentially visible from the qualified reference C .this .f , the declaration of g in C does not

appear before that of f in B . Since this is the case, an error would be raised to prevent

the capture. Indeed, we can not simply rename g to f in C because that would change the

binding of C .this .f to point to that renamed declaration instead of f in B .

Nonetheless, it is actually possible here to change the reference C .this .f to a more explicit

one, say ((A.B)C .this).f . In the remainder of this section, we describe how to implement

this process in JunGL.

First thing is to notice that any reference qualified with a this or super access is of the

form ((〈Y 〉)〈X 〉.this).f where 〈X 〉 and 〈Y 〉 are both optional qualified type names, and f is

a variable name. In a surrounding class B that extends A, any qualified access of the form

B .super .f can always be replaced with ((A)B .this).f . In addition, any qualified reference,

whose receiver is a general expression, is of the general form ((〈Y 〉)〈expression〉).f where 〈Y 〉

is again any optional qualified type name, and 〈expression〉 is any access of that same form

or of the previous form.

From this observation, we shall amend our initial script to rewrite any reference that is

endangered with variable capture, instead of rejecting the transformation. There are two


different rewrite cases.

Self references The first kind of rewrite proceeds on any reference that is either unqualified

or qualified by a this or a super access. We call them self references for short. Let d be the

declaration node of a field f , and x a self reference to f (i.e. of the form f or A.this .f for

instance). We shall get rid of the qualifier of f (because it is not explicit enough) and rebuild

a new access of the form ((〈Y 〉)〈X 〉.this). Therefore, we need to instantiate the types X and

Y that allows us to refer to d in the context of x . Figure 6.3 shows how to find such types.

The class Y is the direct enclosing class of d , C is the direct enclosing class of x , and X is

x

CX

Y

d

inheritance

nesting

Figure 6.3: Finding X and Y for building the access ((〈Y 〉)〈X 〉.this).

both an enclosing class of x and a subclass of Y .

In JunGL, we can find C , X and Y with the following path query:

[ x ] e n c l o s i n gC l a s s [ ?C] en c l o s i n gC l a s s ∗ [ ?X]

( super ; lookup ) ∗ [ ?Y] bodyDecls [ d ]

Type references We now need to build a name to access the class declarations X and

Y from the context of x . One might think that it is always safe to build the fully qualified

name of the class, but in Java and also NameJava the context might prevent us to refer to a

class with its fully qualified name. Take the example below:

package a ;

class C {

class A {

}

class B {


class a {

}

class A {

}

{

C.A x ;

}

}

}

In the declaration of x , it is not possible to write a.C .A because a would resolve to the class

in that context, not to the package. Similarly, it is not possible to simply write A because

that would reference the closest class A. Thus, we need to be careful when building such

type accesses. We sometimes even have to reject the transformation if no valid access can be

built. Arguably, this is a flaw in the design of Java, and the problem could easily have been

avoided. In C# for instance, one can always refer to a member in the global namespace by

qualifying it with global::.

Suppose we wish to build a type access to the class Y . The idea is again to write a

path query to find an enclosing class E of Y which is itself visible from the context of x .

To test for visibility, we check that the first visible class or package declaration that has the

same name as E is E itself. If we cannot find any E that is visible from x , then we have to

reject the transformation. The function buildTypeReference is as follows. The first part for

checking visibility uses an auxiliary function lookupScopeFrom. The second part for building

the actual access uses the foldr function, which is standard in functional programming:

l e t lookupScopeFrom x name =

pick { ? s |

f i r s t ( [ x ] al lTypesOrPackages [ ? s ] & getName ? s == name) }

l e t bui ldTypeReference x c =

l e t es = pick { ? es |

f i r s t ( [ c ] enc l o s ingScope ∗ [ ? e s ] &

? es == lookupScopeFrom x ( getName ? es ) ) } in

i f es == null then

e r r o r ("Cannot build type access for " + c . name)

else

let chain = toL i s t { ? i c |

[ c ] enc l o s ingScope ∗ [ ? i c ] enc l o s ingScope+[ es ] } in

let esRef = new SingleName { name = getName es } in

L i s t . f o l d r


( fun node i c → new DotName {

l e f t = node ,

r i g h t = new SingleName { name = getName i c }

})

esRef chain

Note that we are sometimes rejecting too much. Indeed, we might reject the transfor-

mation if we cannot build the type access for Y in the qualified reference ((〈Y 〉)〈X 〉.this).f

although it would have been possible to build a type access for Y ′, a subclass of Y such

that the reference ((〈Y ′〉)〈X 〉.this).f is also valid. In an improved version, we could actually

incorporate the test for building the type access inside the path query that finds X and Y .

Foreign references The second kind of rewrite is simpler. It applies to any qualified refer-

ence whose receiver is a general expression, but not a self reference. We call such a reference

a foreign reference. In that case, we shall cast the original receiver with an appropriate type

name 〈Y 〉 and build ((〈Y 〉)〈expression〉). Let d be the declaration node of a field f , and x a

foreign reference to f (i.e. of the form ((B)a).f or A.this .b.f for instance). We simply have

to cast the receiver of f in x to the enclosing type Y of d and build a type access for Y .

Again, we might reject the transformation if we cannot build such a type access.


class B {C f ;


int g ; // rename to fclass B {

}class D {{

C x = f . f ;}

}}

}


class B {C f ;


int f ;class B {

}class D {{

C x = ( (A.B) ( (A.B)C. this ) . f ) . f ;}

}}

}

Figure 6.4: Rename Variable scenario successfully handled by our script.

Concluding example We have explained the mechanisation of Rename Variable with

careful checks and little rejection. The full script in Appendix B (for name lookup and the

version of Rename Variable that tries to minimise rejection) is five-page long. To conclude


the section, we illustrate in Figure 6.4 what our script does on a small but tricky example.

Of course, one could argue that the way we deal with variable hiding is undesirable because

the resulting code might be sometimes much less readable. In our view this objection comes

more under coding style and best practices, and such concerns could also be checked with

JunGL.

6.2 Extract Interface

We shall now change the focus of our discussion to a different kind of refactoring transfor-

mations: the ones that alter the type structure of an object-oriented program. Perhaps the

most popular example of such a type-based refactoring is Extract Interface.

In the mechanised version of that refactoring, one selects a class from which to extract

a new interface, chooses a name for the new interface and decides on the members to pull

up there. The tool then automatically creates the new interface with the chosen name and

member signatures and makes the original class implement the new interface. This is fairly

straightforward and available as is in Visual Studio 2005 for instance. To illustrate, we give

in Figure 6.5 an example transformation of a C# program: a new interface IContainer is

created with methods void Put(int) and int Get(), and Singleton is made to implement this

new interface.

class S ing l e ton{

private int e ;public void Put( int i ){

e = i ;}public void Put( S ing l e ton s ){

e = s . Get ( ) ;}public int Get ( ){

return e ;}

}

interface IConta iner{

void Put ( int i ) ;int Get ( ) ;

}class S ing l e ton : IConta iner{

private int e ;public void Put ( int i ){

e = i ;}public void Put ( S ing l e ton s ){

e = s . Get ( ) ;}public int Get ( ){

return e ;}

}

Figure 6.5: Example of Extract Interface in its simple version.


Generalising declared types Frank Tip et al. have proposed a more advanced variant

of Extract Interface, and a rigorous method for automating it, where they attempt to change

the type of each declaration involving the refactored class to use the newly-created interface

[TKB03]. This enhanced version is motivated by the observation that not updating these

declarations leads to overspecific variable declarations, which is not good object-oriented

design. In our example of Figure 6.5 for instance, it would be safe (and desirable) to change

the type of the parameter s in the second method to IContainer . Indeed, the only method

that is called on s is the method Get which has been pulled up to the type IContainer . The

mechanisation of that process consists of three steps: generate a set of type constraints from

the source code, solve the constraints to find the upper bound of each type variable, and

modify the type references that can be generalised. Eclipse supports that more advanced

version of Extract Interface.

Aim and outline of the script We have shown for Rename Variable how to specify the

name lookup rules of a subset of Java with very few expressions. For Extract Interface,

we are however interested in how JunGL scales up to include many more constructs of a

mainstream language. The aim is to express name and type lookup for all these constructs,

together with the type constraints they imply. During Extract Interface, it then suffices to

collect the relevant type constraints, solve these constraints externally and use the results to

modify the original program.

Our presentation of the script shall be much more succinct than that of Rename Variable.

Extract Interface was indeed implemented in JunGL by Arnaud Payement for a large subset

of C# and one can refer to [Pay06] for full details. In this section, we briefly mention the

key ingredients of the automation. We start by describing informally the object language

and the static-semantic information required for Extract Interface. Then, we illustrate type

constraints and explain how they are collected using JunGL. Finally we discuss very briefly

the technique used to solve the constraints and incorporate back the results in the original

program.


The object language considered here is a substantial subset of the C# 2.0 language [ecm06].

That subset notably includes non-trivial features such as generics or structs. We do not

spell out all the details of the abstract grammar of C#, but wish to give an overview of the

data types. One particularity is the support for both source and libraries. Indeed, to handle

realistic programs, we need to have access to namespaces, types and members declared in

external .NET assemblies, which we shall model as compilation units for simplicity. Hence,

a compilation unit shall be either a source file or an assembly:

type

Compilat ionUnit =


| SourceUnit = { us ing s : Using l i s t ;

members : NamespaceMemberDecl l i s t }

| Assembly = { members : NamespaceMemberDecl l i s t }

A source unit and an assembly both encapsulate a list of namespace member declarations.

Such a declaration introduces either a namespace or a type member. A type member decla-

ration is either a type declaration, a field declaration or a callable declaration. In turn a type

is either a class, a struct, an interface or a type parameter. All these kinds of declarations

are represented with the following data types:

NamespaceMemberDecl =

| NamespaceDecl = { name : s t r i n g ;

members : NamespaceMemberDecl l i s t }

| MemberDecl = (

| TypeDecl = (

| ConcreteTypeDecl = (

| ClassDec l = . . .

| StructDec l = . . .

)

| I n t e r f a c eDe c l = . . .

| TypeParamDecl = . . .

)

| Fie ldDec l = . . .

| Ca l l ab l eDec l = . . .

)

A callable is either a method or a constructor. We shall come back to them and to

statements when we discuss the automation of Extract Method. For now, we focus on sup-

porting expressions, as they are involved in most of the type constraints required for Extract

Interface. The following data types represent the different kinds of expressions we support:

type

Express ion =

| ObjectCreateExpr = { typeRef : TypeRef ;

arguments : MethodArgument l i s t }

| ArrayCreateExpr = { typeRef : TypeRef ;

q u a l i f i e r s : Qua l i f i e r l i s t }

| MethodInvokeExpr = { t a r g e t : Expres s ion ;

arguments : MethodArgument l i s t }

| ArrayAccessExpr = { t a r g e t : Expres s ion ;

q u a l i f i e r s : Qua l i f i e r l i s t }


| MemberAccessExpr = { t a r g e t : Expres s ion ;

ent i tyRe f : EntityRef }

| Reference = (

| EntityRef = { name : s t r i n g ; typeArgs : TypeRef l i s t }

| ThisRef

| BaseRef

| TypeRef = { path : NamespacePath ; q u a l i f i e r s : Qua l i f i e r l i s t }

)

| AssignExpr = { l e f t : Expres s ion ; operator : Ass ignOperator ;

r i g h t : Expres s ion }

| BinaryExpr = { l e f t : Expres s ion ; operator : BinaryOperator ;

r i g h t : Expres s ion }

| Pref ixExpr = { operator : Pre f ixOperator ;

t a r g e t : Expres s ion }

| Postf ixExpr = { t a r g e t : Expres s ion ;

operator : Pre f ixAndPost f ixOperator }

| Parenthes isExpr = { t a r g e t : Expres s ion }

| Primit iveExpr = (

| S t r i n gL i t e r a l = { va lue : s t r i n g }

| CharL i t e r a l = { va lue : s t r i n g }

| I n t e g e r L i t e r a l = { va lue : s t r i n g }

| Rea lL i t e r a l = { va lue : s t r i n g }

| Null | False | True

)

and

MethodArgument = { d i r e c t i o n : ParamDirection ; t a r g e t : Expres s ion }

As you see, it is a substantial set of constructs. One may wonder what is the field direction

in the data type MethodArgument. It simply indicates the passing mode of the argument and

we shall come back to that when we discuss Extract Method. An important construct that is

actually not apparent here is the ability to cast an expression to a given type. In fact, cast

is defined as a prefix operator:

type

Pre f ixOperator =

| UnaryAdd | UnarySub | Not | OnesComplement

| Cast = { typeRef : TypeRef}

| Pre f ixAndPost f ixOperator = (

| Increment | Decrement

)


In the end, we support a large subset of the language and notably features that are often

considered harder to accommodate with in the construction of compilers. The principal

features that we do not support are exceptions, labeled statements, multiple variable and

field declarations, parameter arrays, constructor initializers, operator declarations, indexer

declarations, delegates and partial classes. This seems to be a fairly large list, but most of

the constructs we cite here are specific to C# 2.0 and are not present in a language like Java,

apart of course from exceptions, labeled statements and multiple declarations. Omitting

these simplify our next discussion of Extract Method, and we do not envision any difficulty

in supporting them.

6.2.2 Name and type lookup

The name and type lookup edges are built from about 50 sub-edges and predicates, by

naturally translating into JunGL the ECMA specifications of the language [ecm06]. The

name lookup edge declLookup links an entity reference or a method call to its definition.

Definitions of that edge are therefore similar to what we have expressed for NameJava in

the previous section. On the other hand, the type lookup edge typeLookup links an entity

reference or a method call to the declaration of its type.

To wit, the name lookup edge emanating from a method call points to the most abstract

definition of the method that may be called, while the type of a method call is the return

type of the method that is invoked:

l e t edge typeLookup e : MethodInvokeExpr → ? t =

[ e ] declLookup ; typeRef ; typeLookup [ ? t ]

Another example of typeLookup is for resolving the type of literals. For example:

l e t edge typeLookup e : I n t e g e r L i t e r a l → ? t =

[ e ] root ; sys temClasses [ ? t ] & ? t . name == "Int32"

l e t edge typeLookup e : Nul l → ? t =

[ e ] root ; sys temClasses [ ? t ] & ? t . name == "Object"

The edge root climbs up to the root node of the program, that is the node holding assemblies

and source units. As for the edge systemClasses, it points directly to the system classes in

the namespace System of the core .NET assembly.

There are other interesting aspects of the implementation of name and type lookup for C#.

More details can be found in [Pay06]. In particular, the definition of accessibility domains

follows the wording of the rules given in the specifications of the language [ecm06].


6.2.3 Generating type constraints

We now illustrate what kind of type constraints are generated. A constraint is composed

of two elements and one operator, which is either a strict subtyping, a subtyping, or a type

equality. We distinguish two different types of elements: variables and constants. Variables

are nodes that need to be typed while constants are types of the program.

As a brief example, we shall draw a few type constraints from our example program of

Figure 6.5. We write type variables in square brackets.

Code Constraints

class Singleton : IContainer Singleton < IContainer

e = i [i ] ≤ [e]

e = s .Get() [s .Get()] ≤ [e]

[s .Get()] = [int ]

[s ] ≤ IContainer

return e; [e] ≤ [int ]

As we see, constraints are generated on the refactored program. Extract Interface has

indeed two distinct phases. We first create the new interface, pull up selected members and

make the original class implement the new interface. This corresponds to the naive version

of the refactoring. Then we look at type constraints on the refactored program to eventually

find a more general type for each declaration involving the refactored class.

Type constraints can be numerous even for a fairly small program. Therefore, we do

not want to generate all of them. We are only interested in those relevant to declarations

involving the refactored class. We can define an edge declarationPoint that finds all the

declarations whose type refers to a particular class declaration x :

l e t edge de c l a r a t i o nPo in t x : Clas sDec l → ? s =

( [ ? s : F ie ldDec l ] | [ ? s : MethodDecl ] | [ ? s : ParamDecl ]

| [ ? s : Var iableDeclStmt ] | [ ? s : ForEachStmt ] ) &

[ ? s ] typeRef ; typeLookup [ x ]

Once we have found all the declarations to be potentially generalised, we need to identify

any node in the program whose type constraints may involve these declarations. These are

all the declarations themselves, plus all expressions and statements containing a reference to

them:

l e t edge cons t r a in tPo in t x : Clas sDec l → ?p =

[ x ] d e c l a r a t i onPo in t [ ? p ]

| ( [ ? p : Expres s ion ] | [ ? p : Statement ] ) &

[ ? p ] ch i l d ∗ [ : EntityRef ] declLookup [ ? d ] &


[ x ] d e c l a r a t i onPo in t [ ? d ]

If x is the class declaration being refactored, we can then get a stream of constraints with:

{ bu i ldCons t r a in t ? s | [ x ] cons t r a in tPo in t [ ? s ] }

where buildConstraint would be a function that takes a node that is either a declaration, a

statement or an expression and builds a set of constraints similar to the example constraints

of the above table. One might want to look at the report of Payement [Pay06] for a full

description of type constraints, which are the adaptation for C# of the constraints given in

[TKB03] for Java.

6.2.4 Solving and transforming

The solving process is in the same vein as the work done on Soot, a Java bytecode opti-

misation framework, for efficiently inferring static types at the level of bytecode [GHM00].

The constraints collected using JunGL are turned into a graph whose nodes are elements of

the constraints and whose edges represent constraints themselves. Edges hence correspond

either to strict subtyping, to subtyping or to type equality, and are labeled accordingly. The

graph is then simplified through a succession of operations, namely collapsing of nodes and

transitive reduction, in order to find the upper bound of each type variable. Again, the

full solving process is thoroughly described in [Pay06] together with an optimisation for the

special problem of Extract Interface, where the set of constant types can be simplified to two

elements: one that represents the newly-created interface and another one that represents

any other type.

Currently, the constraint solver is external to JunGL and interfaced using external calls.

It could have been implemented using the ML features of JunGL, but a more interesting

future work would be to express constraints as predicates in JunGL. The work by Speicher

et al. with GenTL [SAK07] suggests that it can be done elegantly.

Once the constraints are solved, they are exploited as follows. We take all the elements

that have been collapsed into the node containing the freshly-created interface, say I , and

we change the type of their declaration to I . Any other declaration remain unchanged.

Note that we need to be careful when changing the type of a declaration. Indeed, we

cannot simply replace the type reference by the single name of the new interface, as another

member with the same name might be hiding the interface declaration. To illustrate, we

show in Figure 6.6 a flaw in Eclipse 3.3. Types of declarations are correctly generalised, but

type accesses are not properly updated. In the refactored field declaration, I points to the

wrong interface. It should have been a.I . Since we must account for the binding context of

the declaration, the solution is to reuse exactly what has been done in Rename Variable for

generating type references.


package a ;

class A {class I {}A a = this ;

}

package a ;

interface I { }

class A implements I {class I {}I a = this ;

}

Figure 6.6: Issue with type references in Eclipse 3.3.

We have presented the main ingredients for automating Extract Interface in JunGL. Nat-

urally, other type-based refactorings can be mechanised, for instance to introduce generic

types [DKTE04, vDD04, FTK+05, KETF07] or to support class library migration [BTF05].

Most of these refactorings involve an analysis of the class hierarchy and require solving type

constraints as we have described here.

6.3 Extract Method

Let us now turn to the Extract Method refactoring. We have already described informally

what this refactoring is about in Chapter 1, notably by quoting the informal recipe that is

typically found in refactoring books, e.g. [Fow99]. We have also reported a few flaws in IDEs

like Visual Studio 2005 or Eclipse 3.3. The problem is either that no true control and data

flow analyses are performed, or that preconditions of the transformation are not correctly

implemented. As we shall see, these preconditions are fairly complex, and being able to give

a clear executable specification of them is one of the main benefits of JunGL.

Aim and outline of the script The aim here is to give a precise, concise and rigorous

specification of Extract Method. Notably, we shall use logical path queries to express the

control and data flow properties that are necessary for the correct automation of the trans-

formation. Our object language shall be the same as for Extract Interface, namely a large

subset of C#, but we focus this time mostly on functions and statements.

The remainder of the section is organised as follows. We first present the abstract gram-

mar of the statements that we shall consider, and describe how we super-impose the control-

flow graph of a method on its statement nodes. Only then we turn to the implementation

of the refactoring. Its input is a name for the new method and two statement nodes in the

graph, namely the start and the end of the region to be extracted. There are four major

phases in the implementation, and we shall consider each in turn: checking the validity of

the selection, determining what parameters must be passed, where declarations should be


moved, and finally doing the transformation itself.


The object language is roughly the same subset of C# as for Extract Interface. Here we give

the data types for callables and statements. We start with callables:

type

Ca l l ab l eDec l = (

| MethodDecl = { name : s t r i n g ; mod i f i e r s : Modi f i e r l i s t ;

parameters : ParamDecl l i s t ; b lock : Block }

| ConstructorDecl = { name : s t r i n g ; mod i f i e r s : Modi f i e r l i s t ;

parameters : ParamDecl l i s t ; b lock : Block }

)

and

ParamDecl = { d i r e c t i o n : ParamDirection ; typeRef : TypeRef ;

name : s t r i n g }

and

ParamDirection =

| Value | Ref | Out

A callable is either a method or a constructor, and indeed we wish to allow the extraction

from a constructor too. A callable has a (possibly empty) list of parameter declarations

ParamDecl. Each of the parameters has of course a reference to a type and a name, but also

a parameter direction that indicates the passing mode of the parameter.

While the default passing mode is by value, C# also allows for two other modes, namely out

and ref. Output and reference parameter passing modes are used to allow a method to alter

variables passed in by the caller. The caller of a method which takes an output parameter

is not required to assign the variable passed as that parameter prior to the call; however,

the callee is required to assign the output parameter before returning. In a way, output

parameters are like additional return values of a method. In contrast, reference parameters

must be initially assigned by the caller, and therefore the callee is not required to assign them

before their use. In effect, reference parameters are passed both in and out of a method.

The presence of these alternate passing modes has for consequence that Extract Method

in C# is less likely to be rejected compared to the same refactoring in Java. Indeed there are

more opportunities for handling parameters. Yet, the transformation has to account for all

three passing modes. We shall describe how we do that with JunGL in the remainder.

Other data types are important for the transformation, in particular the ones for repre-

senting different kinds of statements:

Statement =


| VariableDeclStmt = { mod i f i e r s : Modi f i e r l i s t ; typeRef : TypeRef ;

name : s t r i n g ; i n i t i a l i z e r : Expres s ion }

| ExprStmt = { t a r g e t : Expres s ion }

| ReturnStmt = { t a r g e t : Expres s ion }

| BreakStmt

| ContinueStmt

| I fStmt = { cond i t i on : Expres s ion ; thenBranch : Statement ;

e l s eBranch : Statement }

| Loop = (

| WhileStmt = { cond i t i on : Expres s ion ; body : Statement }

| DoWhileStmt = { cond i t i on : Expres s ion ; body : Statement }

| ForStmt = { i n i t i a l i z e r : Statement ; cond i t i on : Expres s ion ;

i t e r a t o r : Expres s ion ; body : Statement }

| ForEachStmt = { typeRef : TypeRef ; name : s t r i n g ;

t a r g e t : Expres s ion ; body : Statement }

)

| Block = (

| EmptyStmt

| BlockStmt = { s ta tements : Statement l i s t }

)

Note how we group all different kinds of loop statements under a common abstract data type

Loop. Similarly, we consider that empty statements and block statements are just Blocks.

Semantically, an empty statement is indeed just a block with an empty list of statements.

This allows us to simplify our coming reasoning on the control flow.

There is no support here for exceptions, labeled statements, goto statements, switch

statements, lock statements, using statements and anonymous methods. We do not express

the complete static-semantic rules of C# 2.0 required for the automation of Extract Method,

but we illustrate how one can fully accomplish it.

6.3.2 Control and data flow

For the mechanisation of Extract Method, we primarily rely on three sorts of static-semantic

information: name binding, control flow and data flow. We have addressed name binding

before and we assume we can lookup the declaration of a reference by following its declLookup

edge. We shall focus here on control and data flow only.

We recall that initially we have a raw syntax tree with no static-semantic information.

We must therefore define lazy edges to super-impose control-flow information on that tree.

We proceed like we did in our examples with the While language in Chapter 2. That is, we

first define two dummy attributes for the entry and exit of a callable:


type Entry

type Exit

l e t attribute c a l l a b l eEn t r y c : Ca l l ab l eDec l = new Entry {}

l e t attribute c a l l a b l eE x i t c : Ca l l ab l eDec l = new Exit {}

Then, we introduce an edge that links a statement to its exit statement:

l e t edge e x i t x : Statement → ?y =

[ x ] l i s t S u c c e s s o r [ ? y ]

B [ x ] parent [ : Loop ] cont inue [ ? y ]

B [ x ] parent ; e x i t [ ? y ]

B [ x ] parent ; c a l l a b l eE x i t [ ? y ]

In most cases, the exit of a statement is simply the following statement in the list that

contains it. This is handled by the fist disjunct of the B alternative. The last statement in

a loop, however, exits to the part of the loop that needs to be evaluated after each iteration.

This is what the second attempt expresses and we shall come back to the edge continue in

an instant. The third disjunct tries to exit from a block to the exit of its parent. Finally, if

none of the previous predicates succeeded, we exit the method or the constructor itself.

The edge continue is defined for all different loops as follows:

l e t edge cont inue x : WhileStmt → ?y = [ x ] cond i t i on [ ? y ]

l e t edge cont inue x : DoWhileStmt → ?y = [ x ] cond i t i on [ ? y ]

l e t edge cont inue x : ForStmt → ?y = [ x ] i t e r a t o r [ ? y ]

l e t edge cont inue x : ForEachStmt → ?y = [ x ] t a r g e t [ ? y ]

After each iteration of a while or a do-while loop, the control flows to the condition of the

loop. In the case of a traditional for loop, it is however the iterator expression that should be

evaluated first. Finally, in the case of an enhanced foreach loop, we assimilate the invocation

of MoveNext on the target collection [ecm06] with the target itself.

Again, we only wish to give a taste of how to define the control flow and we do it at

the level of statements only. We give here a few examples. The successor of an expression

statement is the exit node of that statement (as defined above):

l e t edge c f s u c c x : ExprStmt → ?y = [ x ] e x i t [ ? y ]

The successor of a return statement is the dummy exit of the callable:

l e t edge c f s u c c x : ReturnStmt → ?y =

[ x ] parent+[ : Ca l l ab l eDec l ] c a l l a b l eE x i t [ ? y ]


This is partly because we do not support try-catch-finally clauses. If we were to handle

them, we would have to make any return statement enclosed in a try-catch block exit to the

corresponding finally clause. The successor of an if statement is its guard expression, because

we consider the statement itself as an intermediate meaningless node in the control flow:

l e t edge c f s u c c x : I fStmt → ?y = [ x ] cond i t i on [ ? y ]

The control-flow successor of a break statement is the exit of its enclosing loop:

l e t edge c f s u c c x : BreakStmt → ?y =

f i r s t ( [ x ] parent+[ : Loop ] e x i t [ ? y ] )

In contrast, the successor of a continue statement is the the part of the loop that needs to

be executed after each iteration:

l e t edge c f s u c c x : ContinueStmt → ?y =

f i r s t ( [ x ] parent+[ : Loop ] cont inue [ ? y ] )

To express dataflow properties, we shall need information about variables that are used

and defined in each statement or expression. A use edge links a statement or an expression

to the variables that are read during its execution. Dually, a def edge relates a statement

or an expression to the variables that it writes. We shall give here a couple of definitions

only, for handling the different parameter passing modes. A method argument x uses all the

variables found in its expression if its passing direction is not out:

l e t edge use x : MethodArgument → ?y =

! [ x ] d i r e c t i o n [ : Out ] & [ x ] t a r g e t ; use [ ? y ]

On the other hand, a method argument x defines a variable declared in ?y if its passing mode

is either out or ref:

l e t edge de f x : MethodArgument → ?y =

( [ x ] d i r e c t i o n [ : Out ] | [ x ] d i r e c t i o n [ : Ref ] )

& [ x ] t a r g e t [ : EntityRef ] declLookup [ ? y ]

To conclude, we also define a useOrDef edge which is shorthand for the union of the two

former edges:

l e t edge useOrDef x → ?y = [ x ] use [ ? y ] | [ x ] de f [ ? y ]


6.3.3 Checking validity

We now turn to specifying preconditions of the transformation. The refactoring will first

need to check that it is a valid selection: for instance, one can only extract a block of code

into a method if it is single-entry single-exit. These are the usual conditions: the start

node dominates the end node, the end node post-dominates the start node, and the set of

cycles containing the start node is equal to the set of cycles containing the end node. These

conditions are easily expressed in terms of path patterns like we did in Chapter 2. For

example, here is the definition of dominates:

l e t dominates entryNode startNode endNode =

Stream . isEmpty

{ ( ) | [ entryNode ]

( local ? z : c f s u c c [ ? z ] &

?z != startNode )∗

[ endNode ] }

It takes three parameters: the entry node of the method or constructor that contains the

block, the start node of the block, and the end node of the block. By definition, the start

node dominates the end node if all paths from the entry node to the end node pass through

the start node. The predicate

[ entryNode ]

( local ? z : c f s u c c [ ? z ] & ?z != startNode )∗

[ endNode ]

signifies a path all of whose elements are not equal to the start node. We hence require that

no such path exists, by testing that the above set is empty. The function isEmpty is simply

defined as:

l e t isEmpty s = pick s == null

Other similar checks are required. The control-flow graph lacks indeed some scoping

information, and therefore, we also need to check that the selection does not straddle different

scopes.

6.3.4 Inferring parameters

When we have verified that the selection is indeed amenable to method extraction, the next

task is to determine what the parameters of the method should be, and what results must

be returned. We shall consider different sets of parameters for the different passing modes.

Those parameters are chosen among the local variables that are used and defined in the

selection.


We start by describing how we compute that set of local variables. The JunGL script

that follows is an excerpt of the full script given in Appendix C. The node outerEndNode is

the direct successor of endNode, i.e. the first node which follows the selection but is not in

the selection.

l e t s e l e c t i o nS ta t emen t s = { ? s |

[ s tartNode ] ( local ? z : c f s u c c [ ? z ] &

?z != outerEndNode &

?z != exitNode ) ∗ [ ? s ] } in

let predicate mayUseOrDefInSelection (? x ) =

i s I n (? s , s e l e c t i o nS ta t emen t s ) & [ ? s ] useOrDef [ ? x ] in

let v a r i a b l e s = { ?x |

mayUseOrDefInSelection (? x ) &

( [ ? x : Var iableDeclStmt ] | [ ? x : ParamDecl ] ) }

As we see, we first select the statements that are contained in the selection, namely the

statements reachable from startNode without going through outerEndNode. Then we define

one local predicate: mayUseOrDefInSelection(?x) holds for variables that are used or defined

inside selection statements. Finally, we compute the stream variables by restricting the

variables, for which mayUseOrDefInSelection holds, to local variables and parameters.

We now turn to classifying variables. A variable x in variables will become a value

parameter if the following conditions are satisfied:

• x is live upon entry in the extracted block, that is it may be used in the selection, and

it is not redefined before it is used. The condition that x may be used is obvious; if x

is always redefined before such a use, there is no need to pass it as a parameter, as its

value can be computed locally in the extracted method.

• It is not the case that x may both be redefined in the selection, and used before it is

redefined after the selection. If x is live at the end of the selection, but not redefined

in the selection, it is fine to pass it by value.

We can thus compute the set of value parameters as follows:

l e t valueParams =

{ ?x | i s I n (? x , v a r i a b l e s ) &

mayUseBeforeDefInSelect ion (? x ) &

! ( mayDefInSelect ion (? x ) &

mayUseBeforeDefAfterSe lect ion (? x ) )

}


The predicates used here have again an elegant definition in JunGL. To illustrate, consider

mayUseBeforeDefAfterSelection(?x). This predicate holds if there is a path from the end

node to a use of x with no intervening definition of x . A node u uses x if it has a user-defined

lazy edge labeled use to x . Similarly, an intervening node z does not define x if it has no

lazy edge labeled def to x .

l e t predicate mayUseBeforeDefAfterSelect ion (? x ) =

[ outerEndNode ]

( local ? z : [ ? z ] c f s u c c & ! [ ? z ] de f [ ? x ] ) ∗

[ ? u ] use [ ? x ]

| [ method ] parameters [ ? x ] d i r e c t i o n [ ? : ! Value ]

Note that this definition also deals, thanks to the second disjunct, with the possibility for a

use outside the callable method where the extraction occurs, namely when x is a non-value

parameter. Like outerEndNode, the node method is retrieved at the beginning of the script.

All details are exposed in Appendix C.

We now consider when a variable x should become an output parameter of the extracted

method. Here the specification consists of three conjuncts:

• First, there exists a potential use without prior definition of the variable x after the

selected statements: without such a potential use, there is no point in returning x as a

result of the method.

• Second, there should be no use of x before a definition of x in the selection itself. If

there was such a use, it would not be sufficient to pass x merely as an output parameter:

its initial value is important too.

• Third, x must actually be defined in the selection. If it were not, then the result of

the refactoring would not be compilable, because C# requires all output variables to be

definitely assigned.

In summary, we can define the set of output parameters as follows:

l e t outParams =

{ ?x |

i s I n (?x , v a r i a b l e s ) &

mayUseBeforeDefAfterSelect ion (? x ) &

! mayUseBeforeDefInSelect ion (? x ) &

mustDefInSe lect ion (? x )

}

Again, the definitions of these predicates are all straightforward in JunGL, and the details

can be found in Appendix C


At this point, we have precisely defined what should be the value and output parameters

of the extracted method. It remains to define the reference parameters. At first glance, one

might say that any variable in the selected block that is not a value or output parameter is

a reference parameter. Such a criterion would however be much too crude. Some variables

will just be local to the selection, and such variables do not need to be passed as parameters

at all. They will become local variables of the extracted method body. A more accurate

definition of the set of reference parameters is therefore as follows:

l e t refParams =

{ ?x |


( mayUseBeforeDefInSelect ion (? x ) |

( mayDefInSelect ion (? x ) & ! mustDefInSe lect ion (? x ) ) ) &


! i s I n (?x , valueParams ) &

! i s I n (?x , outParams )

}

That is, x may be used before it is redefined in the selection or it is only potentially defined

in the selection, x may be used before it is redefined after the selection, and it is not a value

or output parameter.

It is interesting to work out the effect of these definitions on an example such as the

one of Figure 1.1. For convenience, we recall here the full method and the selection under

consideration:

public void F(bool b)

{

int i ;

// from

i f (b )

{

i = 0 ;

Console . WriteLine ( i ) ;

}

// to

i = 1 ;

Console . WriteLine ( i ) ;

}

Clearly b is classified as a value parameter. But what about i? As explained in the introduc-

tion, the bug in Visual Studio was that i became an output parameter (and being the only


such parameter, in fact the method result). In our definition, that is prevented by the final

conjunct in the definition of out because we have

! mustDefInSe lect ion ( i )

Note that we also don’t get i as a value parameter because there is a definition before its

use in the selection. Finally, it does not become a reference parameter because it is defined

before being used after the selection. We conclude that according to our definition, i does

not become a parameter at all.

6.3.5 Placing declarations

Having decided on the parameters of the extracted method, we now turn to placing declara-

tions for its local variables. In doing so, we consider three cases: declarations that must be

moved out of the selection, declarations that must be moved into the selection, and finally

those that need to be duplicated. We discuss each of these in turn.

A declaration needs to be moved out of the selected block if it is declared there, and if it

is used or defined outside the selection:

l e t needDecMoveOut =

{ ?x |

de c InSe l e c t i o n (? x ) &

mayUseOrDefOutOfSelection (? x )

}

Conversely, if a declaration does not occur in the selected block, it is defined or used in that

block, and it is not a parameter, then the declaration should be moved into the extracted

method’s body:

l e t needDecMoveIn =

{ ?x |


! d e c I nSe l e c t i o n (? x ) &

! i s I n (?x , valuePrams) &

! i s I n (?x , outParams ) &

! i s I n (?x , refParams )

}

Finally, there are the declarations that must be duplicated. This can happen because the

use of a variable in the selection is in fact independent of the use of the variable outside the

selection: effectively, we can split the variable into two independent ones. The declarations

in question are defined by:


l e t needDecDupl icat ion =

{ ?x |

i s I n (?x , needDecMoveIn ) &


| i s I n (?x , needDecMoveOut) &


! i s I n (?x , outParams ) &

! i s I n (?x , refParams )

}

To wit, the declaration of x needs to be moved into the existing declaration (as we have just

defined it), but there are also uses and/or definitions outside the selection; or the declaration

x needs to be moved out, but x is not passed as a parameter to the new method.

Again, let us return to Figure 1.1 and see what happens to the variable i . Because it

is not a parameter of any kind, but it occurs in the selection and it is not declared in the

selection, i will be a member of needDecMoveIn. However, note that because it also occurs

after the selection, it will in fact be classified as a declaration that needs duplication: the

two uses of i , inside and outside the selection, have been correctly separated.

6.3.6 Transforming

Armed with all the necessary information, we can now actually perform the required trans-

formation of creating a new method. This is, in fact, the least interesting part of the code:

all that needs to be done is to reconstruct the relevant portions of the graph.

As a small example fragment, consider the operation of inserting a new statement before

an existing one:

l e t i n s e r tS ta tementBe fo r e n s =

i f not Ut i l s . isEmpty { ?b | [ s ] parent [ ? b : BlockStmt ] } then

i n s e r tBe f o r e n s

else let block = new BlockStmt in

replaceWith s b lock ;

b lock . s ta tements ← [ n ; s ]

First we check whether s is itself in fact part of a sequence in the AST. If so, we simply add

n as the left-hand sibling of s . If not, however, we first need to create a new block statement,

which replaces s in the AST; both n and s become descendants of this new block statement.

The functions insertBefore and replaceWith are built-in functions of JunGL to manipulate

the syntax tree of the program. There are also insertAfter, detach for detaching a node from

its parent and clone for cloning a subtree. Note that it is not necessary to define control-flow

edges (cfsucc) on the new block statement, because we defined these to be lazy, so they will



int i ;// fromi f (b){


}// toi = 1 ;Console . WriteLine ( i ) ;

}


int i ;NewMethod(b ) ;i = 1 ;Console . WriteLine ( i ) ;

}

private void NewMethod(bool b){

int i ;i f (b){


}}

Figure 6.7: Correct refactoring of Figure 1.1.

be automatically constructed when necessary. We need, however, to recompute the edges

that may have been invalidated by the transformation. Currently, we do not provide any

support for incremental evaluation, and therefore, we flush all edges after each refactoring.

We return once again to the example of Figure 1.1. Figure 6.7 shows the result of applying

this refactoring in our own tool. Note that at present, we do not detect that the selected

block did not contain any instance references, so as yet we only make it static if the original

method was itself static — it would however be very easy to add that improvement.

In the exposal above we have assumed that the original program compiles without errors.

Of course in practice it is very common to apply refactorings to programs that cannot be

compiled for subtle reasons such as the definite assignment rule of C# (which states that every

local must be initialised before it is used). In such cases, the refactoring should preserve at

best the compilation errors in the result of the transformation. By refining our predicates,

it would be fairly easy with JunGL to conservatively transform such slightly faulty input

programs.


We have discussed JunGL scripts for specifying three refactorings very different in nature:

Rename Variable, Extract Interface and Extract Method.

Automatically renaming a variable requires variable binding information and the ability to

detect potential conflicting declarations of variables with a similar name. We have modelled

name lookup with streams of visible declarations. This allows us to check, via a simple


traversal of a stream, whether a variable reference is in the scope of a variable definition with

a similar name. Whenever variable capture occurs, we try not to reject the transformation

and instead add a more explicit qualifier to variable references so as to avoid their capture.

Properly handling name bindings is crucial in many program transformations. Strangely,

there is little work in formalising name visibility and binding semantics for mainstream

languages in which binding rules are numerous and complex. Perhaps the closest work is

that of Vorthmann on modelling and specifying name binding rules for Ada via visibility

networks [Vor93]. Our streams of potentially visible declarations are akin to such networks.

Of course, compilers for mainstream languages have to implement those complex binding

rules. JastAddJ is a full compiler for Java 5 using the attribute grammar system JastAdd

[EH07]. There, binding rules are expressed as a set of attributes. The general mechanism was

illustrated in [EH06] on a non-trivial subset of Java [jls05]. It is actually that subset plus the

additional support for this , super and casts that we have considered for our implementation

of Rename Variable.

The idea of introducing an access of the complex form ((〈Y 〉)〈X 〉.this).f is due to Schafer

et al. [SEdM08]. In their work, Y is called the source and X the bend. Their approach for

resolving X and Y is, however, very different. Their framework for renaming is based on

JastAddJ and they express the computation of accesses by inverting lookup attributes in a

systematic way. Consequently, their framework is easily extensible to new constructs with

new lookup rules: one simply needs to define, for each new lookup rule, a corresponding rule

for the access computation.

Extract Interface is a type-based refactoring which alters the type structure of a program.

It consists of two phases: collecting type constraints over the original program and solving

them to find out whether some variables can be given the type of the newly introduced inter-

face. Payement has followed the approach first introduced in [TKB03] for Java and adapted

the type constraints to a large subset of C# 2.0 [ecm06]. A report of the implementation of

Extract Interface with JunGL is available in [Pay06].

Currently, the constraint solver is external to JunGL and it would be interesting to

express constraints as predicates like it has been discussed in [SAK07]. However, the fact that

many other type-based refactorings have been proposed [DKTE04, vDD04, BTF05, KETF07]

suggests that such a constraint solver could also be a built-in functionality.

Finally, Extract Method is a low-level refactoring requiring control and data flow informa-

tion about the program. There are four phases: checking validity of the selection, determining

what parameters must be passed, where declarations should be moved, and finally doing the

transformation itself. Control and data flow information is used in the first two steps. We

gave a first account of the automation of Extract Method using JunGL in [VEdM06]. Back

there, the language was quite different as we were computing static-semantic information

via lazy functions rather than edge predicates. The preconditions of the refactoring and the

classification of parameters were, however, the same.

Of course we are not the first to attempt a precise description of Extract Method. Gris-


wold and Notkin in [GN93], and Fowler in his book [Fow99] gave quite detailed recipes, but

unfortunately no precise hint for mechanising the transformation. One noteworthy work is

that of Ralf Lammel in [Lam02] towards language-parametric refactoring based on the Stra-

funski style of functional strategic programming in Haskell [LV03]. There, the refactoring for

extracting an abstraction, such as a method, is phrased in a generic manner and instantiated

for different languages, notably Haskell and Java (or rather JOOS, a subset of Java). The

approach is very appealing for its genericity, but the instantiated version of Extract Method

for JOOS is not precise enough as there is no account for dataflow. It is only checked that

the block to extract does not contain a return statement (since a return will lead to a dif-

ferent control flow once placed in another method), and that there are no assignments to

non-instance variables declared outside the block to extract (since it would be difficult to

propagate these side effects). On the other hand, Juillerat et al. have described how to bet-

ter track dataflow dependencies [JH07]. They have implemented in Eclipse, in about 1000

lines of code, an improved version of Extract Method for a large subset of Java. They do not

explain, however, how to place declarations correctly. To our knowledge, we are the first to

give a complete, concise and executable specification of the core part of Extract Method.

Chapter 7

Discussion and future work

We conclude this thesis with a summary and an overview of related work. In particular, we

compare JunGL to existing tools and languages that are most closely related to it. We also

give hints on interesting future work. Some falls into integrating well-understood ideas from

other tools to make JunGL an end-to-end solution beyond a prototype. Other future work

presents a more challenging grasp, such as the automatic verification of some correctness

properties of our scripts, or the incremental evaluation of edges and predicates.

7.1 Summary

We summarise here the contributions and results of this thesis, from the design of JunGL

and Ordered Datalog to the specification of complex refactoring transformations.

Design of the language We identified the need for a language to script refactoring trans-

formations. New refactorings are proposed all the time, and yet even common examples like

Rename or Extract Method are incorrectly implemented in leading development environments.

We exposed the requirements for such a scripting language. It should provide functional fea-

tures to easily manipulate the AST of the object program and allow the computation of

static-semantic information that is crucial for expressing refactoring preconditions. To fa-

cilitate reasoning on the transformations, scripts should be very declarative. Therefore, we

ought to provide logical features to query the program tree and the static-semantic infor-

mation associated to it. We proposed a concrete, coherent design for such a language. Our

proposal, named JunGL, has three principal features: stream comprehensions, path queries

and lazy edges for seamlessly maintaining static-semantic relationships between program en-

tities. Stream comprehension is the glue between the logical and the functional parts of

JunGL scripts. Path queries are a special kind of predicates to concisely express complex

graph queries. Combined with user-defined lazy edges, they enable the elegant expression of

long-distance relationships in the program tree, such as a type reference to its declaration.

147

CHAPTER 7. DISCUSSION AND FUTURE WORK 148

Furthermore, we briefly described our implementation of JunGL on top of the .NET platform

using both C# and F#, as well as the toolkit around the language for quickly prototyping

refactoring transformations and, more generally, semantic-aware editors.

Logical constructs Mosts parts of our scripts rely on logical constructs. Predicates, edges

and path queries enable the concise expression of

• static-semantic information (e.g. name lookup, type lookup, control flow) which is

computed in a demand-driven manner when a transformation requires it,

• code queries for finding program entities of interest during a refactoring, and

• program analyses as preconditions of a refactoring.

All logical constructs translate to a novel variant of Datalog, called Ordered Datalog, which

returns query results in a deterministic order. Ordered Datalog gives control over the order

of results and preserves the meaningful order of entities in a program. Furthermore, it

enables the expression of computations in an elegant compositional way. We showed, for

instance, how to model name lookup in a Java-like language as a stream of potentially visible

declarations. By taking the first declaration that matches the name of a reference, we get

the declaration for that reference. The approach is elegant, and quite generic. We can model

name lookup for radically different languages in the exact same manner, and hence propose a

generic script for correctly detecting variable capture while renaming a variable. Our Rename

Variable script for the toy language While introduced in Chapter 2 is indeed similar to that

of the more complex language NameJava of Chapter 6 which supports nested classes and

inheritance.

Ordered Datalog We explained in Chapter 3 the least fixpoint semantics of Datalog and

showed that it coincides with a simple operational semantics based on relational algebra,

where each predicate is interpreted as a set relation. Nonmonotonic constructs need to be

handled carefully. The class of safe Datalog programs is defined with the static restriction

that no predicate depends negatively on itself. Such programs can hence be arranged as a

collection of strata that must be evaluated in topological order, each stratum being itself a set

of mutually recursive predicates. In contrast to the classical set-based semantics of Datalog,

Ordered Datalog manipulates sequences, thus encoding a precise order at each intermediate

step of the query. We redefined relational operators to operate on duplicate-free sequences,

and studied the consequences on monotonicity and program stratification. Next, we proved

an important property of Ordered Datalog, namely that it is a refinement of normal Datalog.

Yet, we saw that neither stratified Datalog nor stratified Ordered Datalog are sufficiently

expressive for our needs. In particular, there is a common pattern in the computation of

certain static-semantic information that requires negating a recursive predicate call. To

overcome this issue, we introduced the new class of partially stratified programs. This class of


programs is a subset of the well-known class of modularly stratified programs, but it highlights

an interesting evaluation mechanism inspired from the top-down set-based Query-Subquery

approach. When a call to a non-stratified rule is reached, the context is split to generate

several partial reductions of the called predicate. Those partial reductions being stratified,

they can be evaluated further with a set-based evaluation. Not so incidentally, our top-down

evaluation mechanism enables the computation of edges in a demand-driven manner, i.e.

only when their value is needed for the evaluation of a query. That lazy mechanism is further

enhanced by the fact that duplicate-free sequences are encoded as streams.

Evaluation We validated the design of JunGL through a number of non-trivial refactoring

scripts on substantial subsets of languages like Java and C#. In particular, we described the

important steps of Rename Variable and Extract Method and demonstrated how some bugs

in mainstream development environments are easily discussed and avoided by expressing

the refactorings in JunGL. The scripts are attached in Appendices B and C. In view of the

complexity of the refactorings they address, we find their small size very encouraging. In fact,

the only verbosity lies in the construction of code fragments and in the destructive updates

of the program tree. Those transformation parts of our scripts are indeed less declarative

than their equivalent in term rewrite systems. Beside rewrite rules, perhaps another missing

feature of our design is quotation for object programs. We discuss these two features in

future work. JunGL has proved very successful, however, for expressing all other parts of

the scripts. In particular, we were able to express concisely and elegantly the computation of

static-semantic information, such as name binding and control flow, which are usually hard

to accommodate within existing transformation systems.

7.2 Related work

Rigorous refactoring We are by no means the first to realise the need for a formal,

precise approach to refactoring. In their PhD theses, both Opdyke and Roberts insisted

on the importance of preconditions and postconditions for refactoring transformations to

ensure that a transformed program is always semantically equivalent to the original [Opd92,

Rob99]. Naturally, one cannot guarantee full behaviour preservation while refactoring real-

world programs, as there are always features that are not tractable (e.g., concurrency or

dynamic class loading).

Therefore, most rigorous specifications of refactorings rely on reasonable assumptions

and focus on certain properties to preserve during the refactoring. For instance, in our

specification of Rename Variable, we assume our transformation to be correct if it preserves

name bindings, i.e. if each variable reference points to the same declaration before and after

the transformation.

Similarly, specifications of type-based refactorings, which alter the type structure of a pro-

gram, mostly focus on maintaining type-correctness and on preserving bindings. For instance,


changes to the declared types of method parameters should account for the static nature of

overloading resolution to ensure that the program behaviour is not affected. Based on such

reasonable assumptions, type-based refactorings have been precisely defined for mainstream

languages like Java [TKB03, BTF05]. Some of them even deal with introducing generic types

[DKTE04, vDD04, FTK+05, KETF07].

Provably-correct refactorings Other works, however, try to formally prove the complete

correctness of refactorings on simpler languages using program refinement calculi. In his

PhD thesis, Cornelio formalises a large collection of refactorings as algebraic refinement rules

[Cor04] for ROOL, a Refinement Object-Oriented Language. In the tradition of refinement

calculi, the formal semantics of ROOL are based on weakest preconditions, from which can

be derived a set of programming laws. These programming laws are then used to prove that

a refactoring transformation is indeed behaviour preserving.

Ettinger takes a similar approach in his PhD thesis, in which he develops a theoretical

framework for slicing-based behaviour-preserving transformations and derives refactorings

that have never been mechanised before [Ett06]. His language has also formal semantics

based on weakest preconditions. It is, however, restricted to imperative constructs, as the

focus of his work is exclusively on statement-level refactorings that deal with control and

data flow.

Both theses address only simple languages, because any formal development would be

hardly manageable otherwise. However, they are clearly inspirational in specifying refactor-

ings for mainstream languages. In particular, they give an invaluable insight on the correct

preconditions of a refactoring.

An attempt to cope with more complex languages is to mechanise the verification of

refactorings. Sultana and Thompson have shown how to perform the verification of different

refactorings for untyped and typed lambda-calculi in the proof assistant Isabelle/HOL [ST08].

Using an interactive theorem prover has several benefits. First, it keeps track of all the

details to be proved. Second, the formal development can be used to automatically extract

the implementation of the refactoring. As with any fully-formal work, however, verifying

non-trivial refactorings requires discharging a considerable amount of proof obligations. This

again restricts the scope of the work.

Garrido and Meseguer have followed yet another approach. They use Maude, an algebraic

specification language, to specify and verify refactorings for Java with no concurrent features

[GM06]. Their specification builds on previous work in which Maude is used to formalise the

semantics of Java. Their implementation appears to be very concise, but the refactorings

they currently verify are very local. Nonetheless, the approach is very encouraging and,

hopefully, should scale to less local and more complex refactorings. In the same vein, Junior

et al. have built on the work of Cornelio and used CafeOBJ, another algebraic specification

language, to encode the programming laws of ROOL and verify refactorings [JSC07].

Finally, Bannwart and Muller have addressed the problem of proving refactoring correct


by introducing an explicit I/O model to ensure that the original and refactored programs are

externally equivalent [Ban06, BM06]. They specify the preconditions of various refactorings

for a subset of Java and give a formal proof that any application of a refactoring preserves

the external behaviour of the program, provided that the program satisfies the correctness

conditions of that refactoring. A peculiarity of their approach is the way they add correctness

conditions as assertions onto the refactored program. These contracts can then be checked

at runtime or statically using a program verifier such as Boogie or ESC/Java. Interestingly,

such specifications, and more generally any contract specification, ought to be refactored too

when refactorings are applied. This issue is addressed in [GFT06] by Goldstein et al. who

explain how to account for contracts in refactoring. In particular, they show how contracts

should be modified when code changes and how contracts may prevent certain changes.

Composition of refactorings It is widely accepted that complex refactoring transforma-

tions can be built from low-level primitive transformations. Opdyke was the first to make this

observation in his thesis [Opd92] and gave a set of useful primitives for refactoring object-

oriented programs, such as create an empty class or change a member function name. Later,

Roberts introduced additional postconditions for the composition of primitive transforma-

tions into high-level refactorings, and showed how to derive the precondition of a composite

refactoring from the preconditions of its components [Rob99].

More recently, Kniesel and Koch have improved Roberts’ approach by providing a formal

model for the static composition of refactorings [KK04], i.e. in a program-independent way.

In their model, each basic transformation is accompanied by a forward description and its

dual backward description, that act as predicate transformers. A forward description takes a

condition that holds before the transformation and returns the condition that will hold after

it. Conversely, a backward description takes a condition that holds after the transformation

and yields a joint precondition of the transformation. These descriptions are then used to

automatically infer the joint precondition of a chain of refactorings, thus allowing users to

correctly compose infinitely many refactorings from basic transformations. To validate their

approach, the authors have implemented a prototype framework where “basic” operations like

RenameField(class, field, newName), AddInterface(name), or Extract Method(class, method,

parameterType, sel, newName) are assumed to be hard-coded. As we have seen in this thesis,

these primitive transformations operations are in fact very difficult to mechanise correctly.

Kniesel and Koch’s work is hence complementary to ours. Their framework deals with

the composition of primitive transformations into much bigger refactorings, while JunGL ad-

dresses the complex mechanisation of those primitive transformations that require compiler-

like analyses at source level. Having implemented some of these transformations, we can

confidently say, however, that some of them could be in fact decomposed into yet smaller

primitives thus promoting their reusability. This is notably the case of Extract Method which

appears to be the composition of other atomic transformations for encapsulating the selec-

tion into a block, moving variable declarations in and out, and extracting this block to a


new method. By introducing new temporary abstractions into the language, one could even

break the last transformation into a first step that creates an inner method, and a second

step that lifts that inner method at the level of the class.

We have not addressed such decomposition carefully with JunGL, as we do not support

yet the backwards propagation of our preconditions. One way to provide such support would

be to adopt Kniesel and Koch’s approach and require that each primitive transformation be

annotated with backward descriptions. A much more challenging route would be to infer

these descriptions from the scripts themselves. This directly relates to the verification of

some correctness properties of our scripts, which we discuss in future work.

Specifications of compiler optimisations We indicated that the design of JunGL heav-

ily borrows from the literature on declarative specifications of compiler optimisations. In

particular, our use of path queries can be traced back to the design of Gospel by Whitfield

and Soffa [WS97]. Gospel has a similar feature, but the dataflow facts are hard-coded in

the implementation, whereas in JunGL they are user-definable via lazy edges. The idea to

achieve that flexibility via a form of logic programming augmented with path expressions

originated in Lacey and De Moor’s work [LM01, DdMS02]. A separate branch of research,

instigated by Lacey, is the formal verification of compiler optimisations that are specified

in this style [LJVWF02]. Lerner et al. have demonstrated how to automate such proofs

[LMC03, LMRC05].

A completely different approach to scripting compiler optimisations was proposed by Ol-

mos and Visser in [OV02]. There, the optimisations are rewrites of the syntax tree expressed

in the term rewriting system Stratego [BKVV06]. Rewrite rules are usually context-free,

meaning that they only have access to the term to which they are applied. Stratego ex-

tends the formalism of term rewriting both with programmmable rewriting strategies and

scoped dynamic rewrite rules [BvDOV06]. Programmable rewriting strategies enables the

combination of simple rewrite rules into complex transformations and provide control over

the application of rules, by defining the order in which rewrites rules should be applied.

Such strategies can be used to carry a data structure with contextual information, but they

do not provide any direct answer to the issue of context-sensitivity for the computation of

static-semantic information, such as name binding. A better approach is the use of scoped

dynamic rewrite rules, which allows the definition of new rewrite rules at run-time. These

rewrite rules may indeed access information from the context in which they are defined and

propagate it to the location where they are applied. As a small example, the operation

?Let(x, e1, e2)

; rules( Substitute : Var(x) -> e1 )

matches a Let construct that binds x to e1, and defines a new rewrite rule that replaces any

variable reference to x with e1. Here, the substitution should be valid in the expression e2

only, and Stratego therefore provides constructs for controlling the lifetime of any dynamic


rule. The technique is effective as it has been used to develop a frontend for Java 1.5.

Nonetheless, we feel the specification of static-semantic information is less declarative than

in JunGL and sometimes difficult to express. Furthermore, as soon as a semantic analysis

involves a graph structure such as the control flow graph of a program, it is yet harder to

express. On the other hand, for the description of the transformation itself, i.e the change to

the object program, the situation is reversed since rewriting strategies provide a much neater

formalism than our destructive updates.

Graph rewriting for program transformations The idea of declarative specifications

of refactorings via graph transformations was first put forward by Tom Mens in [MDJ02].

The refactorings considered there are different variants of moving class members. Their

specification is purely declarative, as a graph rewrite system. A big advantage of using graph

rewrite systems is that it becomes possible, for example, to detect conflicting refactorings

[MTR05]. The main difference with our work is that none of the refactorings require dataflow

analysis. It would be interesting to see whether Mens’s techniques scale up to the full-blown

refactorings of [TKB03].

An earlier attempt to use graph rewrite systems for specifying program transformations

is the Optimix system by Aßmann [Aßm98]. Optimix can be used to generate program

analyses and transformations. Interestingly, its input language is based both on Datalog and

on two classes of graph rewrite systems: edge addition rewrite systems (EARS) and more

general graph rewrite systems (GRS). EARS rules are used to add new edges to the program

graph. They are therefore quite similar to edge definitions in JunGL, and besides both can

be translated to Datalog. The added edges of an EARS rule correspond to the head of a

Datalog rule, while the tested edges and nodes correspond to the rule body. Aßmann notes

that strong confluence of EARS and fixpoint semantics of Datalog are in fact related. On

the other hand, GRS rules are used for deleting and attaching subgraphs to the original

graph. Each GRS rule has a precondition, the left-hand side of the rule, which is a graph

pattern expressed in Datalog. In Optimix, the transformation of the program tree is hence

performed repeatedly applying small rewrites to the graph, while in JunGL we have opted

for destructive updates a la ML. Again, our approach for manipulating the graph is hence

less declarative: one may want to include rewriting primitives to streamline that part of our

specifications.

Optimix was used to express various compiler optimisations, along with the information

necessary to perform those optimisations. Its rules are indeed expressive enough to compute

dataflow facts, such as use-def chains, which are crucial for implementing compiler optimisa-

tions. JunGL is therefore quite similar to Optimix, in the sense that it allows the computation

of auxiliary information required for a particular transformation. In Optimix, however, rules

relate to stratified Datalog. As we explained in Chapter 5, stratified Datalog is not expressive

enough for specifying the computation of static-semantic information such as name binding.

Compiler optimisations are typically performed on an intermediate representation output by


the compiler frontend. Therefore it is reasonable to assume symbolic names at that stage.

On the other hand, refactoring transformations are performed at the level of source code,

thus making name lookup a crucial component in their automation.

GraphStratego is another example of graph rewriting for program transformations [KV06].

Kalleberg and Visser observed that some program analyses may be either difficult or unnat-

ural to express in term rewriting systems. Therefore, they extended Stratego with references

to represent structures that are inherently graph-like (typically the control flow of a program)

in a more natural way. One challenge of such an extension is to handle the termination of

graph traversals. In GraphStratego, the answer is the concept of phased traversal to guaran-

tee that each reference is only visited once. Phases are supported through the introduction

of new primitive strategies into the language. It is the programmer’s responsibility to em-

ploy the right strategy to ensure termination. For collecting information on a graph, the

approach is hence less declarative than in JunGL, where the evaluation of Datalog predicates

is guaranteed to terminate in any event.

Logic meta-programming As we said in the introduction, logic programming has been

proposed at many occasions for program analysis and code queries. Many of these proposals

are based on Prolog, and a few among them also address the issue of program transformations.

Perhaps the systems that resemble JunGL most are JTransformer [KHR07] and more

recently GenTL [AK07]. JTransformer, available as a plug-in for Eclipse, combines a query

and a transformation engine for Java. It represents the AST of a program as a Prolog

database, which can then be queried with Prolog queries. The other system, GenTL, extends

JTransformer to allow concrete syntax patterns containing meta-variables. In contrast with

other code querying tools of that sort, e.g. JQuery [JV03], the underlying source code can be

transformed via Conditional Transformations (CTs). By first specifying the pattern to match

and then the transformation, CTs allow a clear separation between the use of pure Prolog

for the querying part, and impure functions for the transformation. This is, in a way, similar

to JunGL where we forbid updates and creation of new values in the querying parts. CTs

are more organised, however, and may be seen as rewrite rules. JunGL differs from GenTL

in being based on Datalog rather than Prolog. Yet, this is just a subtle implementation

detail if GenTL users restrict themselves to pure Prolog in the matching part of CTs and

if tabled resolution is used [War92]. A bigger difference between the two systems is the

support by GenTL of concrete syntax which enhances the readability of code queries. On

the other hand, although GenTL has been used for specifying refactorings, the computation

of static-semantic information does not seem to be part of the scripts. It is certainly possible

to express name bindings in Prolog, but we believe that edges and path queries as sugar for

predicates make the expression of this kind of information much more elegant.

Another example of logic meta-programming for program transformations is DeepWeaver,

a tool supporting cross-cutting program analysis and transformation components [FKI+07].

DeepWeaver operates at the bytecode level and provides a declarative way to access the


internal structure of methods, as well as control flow information. The design of DeepWeaver

is motivated by domain-specific optimisations. One example is the optimisation of database

calls by replacing a query of the form “select * from ...” by a more precise select statement

that retrieves only the columns that are actually accessed later in the execution. Like in

Optimix or GenTL, however, DeepWeaver assumes that some static-semantic information

about the object program is already available. This is, in the end, probably the biggest

difference between those systems and JunGL.

Attribute grammar systems Systems based on attribute grammars have proved very

successful in expressing static-semantic information. We have already mentioned in the intro-

duction the examples of the Synthesizer Generator [RT84] and of JastAdd [EH04]. Another

example is the Eli system for the flexible construction of compilers [GLH+92]. A particu-

larity of Eli is to allow the definition of attribution modules that can easily be reused. As

noted by Kastens and Waite in [KW94], attribute grammars can only be widely accepted

as a viable specification formalism if they can be decomposed into logical modules that can

be treated in isolation. JastAdd builds on the same observation, but one of its main addi-

tional strengths is its integration with a mainstream language. Attribute bodies are indeed

expressed in plain Java, which makes the system more widely applicable. JastAdd also builds

on reference attribute grammars to relate nodes in the program tree, for instance a variable

reference to its declaration. This allows to encode long-distance dependencies in the AST,

pretty much like JunGL does with lazy edges. Finally, JastAdd supports circular attributes,

which comes in very handy to express control and dataflow properties. Ekman, the main

designer of JastAdd, used all these features to implement in a modular and clean formalism

JastAddJ, a full compiler for Java 1.4 and its extension for Java 5 [EH07].

Recently, Schafer and Ekman have started to express refactorings on top of JastAddJ. In

particular, as we said in Chapter 6, they have designed a framework for sound and extensible

renaming for Java, where they re-qualify ambiguous accesses by inverting lookup attributes

in a systematic way [SEdM08]. Schafer has also implemented our specifications of Extract

Method, using circular attributes to express dataflow properties. The result is less declarative

than the conditions in JunGL, as attributes in JastAdd are written in plain Java, but the

specification is still concise. In general, we believe the expression of attribute bodies in

Java makes the code less tractable compared to JunGL edges, but it has the advantage of

bringing in more flexibility. For instance, greatest fixpoints can be computed as well, whereas

in JunGL we are limited to least fixpoints. Furthermore, while JunGL is still pretty much

a research prototype, JastAdd is a mature tool that can be used in an industrial setting.

Interesting future work would be to extend JastAdd with logic features.

Perhaps the main advantage of logic programming over attribute grammars is that logic

programs can be run backwards. To illustrate, consider again the part of Rename Variable

where we look for all references to the declaration we wish to rename. JastAdd supports

collection attributes for gathering such uses when they are first encountered, thus minimising


the number of tree traversals. When the lookup edge is expressed as a logic program, however,

the computation can be reversed from the declaration to all uses. We are of course not the

first to express the computation of name bindings and other contextual information as logic

programs. This was indeed the idea behind Pan, an environment generator in the spirit of

the Synthesizer Generator [BGV92]. The formalism of Pan’s semantic descriptions is that of

logic constraint grammars, which combine logic programming and consistency maintenance.

A logic constraint grammar is a context-free grammar with Prolog-based goals attached

to the productions in the grammar. As with normal attribute grammars, however, the

evaluation does not terminate if it encounters circular dependencies. On the other hand,

JunGL guarantees termination by using Datalog, rather than Prolog. Furthermore, JunGL

provides much syntactic sugar to facilitate the expression of predicates, in particular via path

queries.

Path queries The idea of path queries in the context of program transformations is due

to De Moor et al. [DdMS02, dMLVW03, SdML04]. For the version used in JunGL, we drew

inspiration from the syntax in [LS06], which followed on from the design in the work cited

above. A similar style of queries is of course very common in the literature on semi-structured

databases, e.g. [BFS00]. In [LRY+04], Liu et al. proposed parametric regular path queries,

which we have not yet introduced in the design of JunGL. If we were to support such path

queries, we would be able to define a new control-flow successor edge from a statement x and

parameterise it with the variable that is written when executing x :

l e t edge wr i t e x : Statement −(?v )→ ?y =

[ x ] c f s u c c [ ? y ] & [ x ] de f [ ? v ]

We could then use write to collect, for instance, all variables that are written before exiting

a method:

{ ?v | [ ] wr i t e (? v ) [ : Exit ] }

At first sight, it seems this new feature would have quite an impact on the underlying graph

representation of the program, since it allows for directed hyperedges, i.e. edges that have

more than one source and one target. In fact, in our framework, hyperedges would simply

be translated to relations of greater arity, and parameterised edge labels in path queries

converted to calls to these relations. This last comment naturally leads us to explore future

work.

7.3 Future work

Concrete syntax and rewriting We mentioned at several occasions that JunGL would

benefit vastly from concrete syntax. Some parts of our scripts indeed remain quite verbose,


in particular the one for creating new fragments of code. Queries on the structure of the

object program would also be much more readable and easier to write in concrete syntax.

Visser explained in [Vis02] all the benefits of concrete syntax over abstract syntax for meta-

programming. He showed with Stratego [BKVV06] how the syntax definition formalism SDF

[vdBHdJ+01] can be used to extend a language with elements of concrete syntax notation.

Rewrite rules in Stratego may accept concrete syntax patterns, enclosed in semantic brackets

to distinguish them from normal term patterns. Those syntax terms are then expanded

in-place by the Stratego compiler to their equivalent AST term patterns. Employing such

syntax terms results in more concise and much more readable rewrite rules.

The GenTL transformation language by Appeltauer and Kniesel also supports concrete

syntax [AK07]. The foundations of GenTL are not in term rewriting like Stratego, but in

logic programming like JunGL. Concrete syntax may be embedded in any predicate and

employed in any precondition of a Conditional Transformation. As said before, CTs are

quite similar to rewrite rules in the way they are applied. Both allow to match precise nodes

in the program tree and replace them with new code fragments, possibly embedding some of

the matched nodes. Clearly, such mechanism would streamline the transformation parts of

our specifications, making them more declarative.

Stratego provides strategies to control the order of rule application and the traversal over

term structures. In our experiments, Ordered Datalog has always provided enough control

for transformations. Indeed, we have always been able to query nodes in an order that

was adequate to perform safely the destructive updates of the underlying tree. We believe,

however, that if JunGL was to support rewriting, strategies would appear important, notably

to handle rule application failures.

Proving some correctness properties A more challenging avenue of future work is on

proving some correctness properties of our scripts. The declarative approach we adopt to

express static-semantic information and refactoring preconditions already provides a sound

basis for rigorous reasoning on the transformations. For instance, an important aspect in the

specification of Extract Method is the classification of local variables into different kinds of

parameters, namely those passed by value, those passed by reference and output parameters.

This classification is complex and yet crucial. An important property of the classification is

that no variable will be classified as two different kinds of parameter. It is easy to check, from

the definitions of valueParams, outParams and refParams, that this requirement is indeed

satisfied. Another desirable property, which is again quite easy to check, is that no variable

use will become orphaned, with no declaration to match it.

Of course, one may want to check more complex aspects of the transformations and

mechanise such checks. For instance, as we have already said, one may want to verify

statically that two transformations can be safely composed. We believe our formalism for

defining auxiliary information and expressing preconditions makes our scripts more tractable

than in other solutions, such as JastAdd where attributes are expressed in plain Java. The


path we have started to explore is the application of lightweight verification techniques using

Why, a verification condition generator back-end [Fil03, why07]. Why takes as input an

annotated program in HL, its internal language, and outputs proof obligations to be further

discharged by a proof assistant or an automatic decision procedure. HL is a small ML-like

language with imperative features, such as references, and annotations written in first-order

logic. With an appropriate memory model, we can express in HL the transformation parts of

our scripts. The translation does not require much annotation, except for the loop invariants,

preconditions and postconditions of our very few functions. Indeed, not all program points

need to be annotated, as Why uses a calculus of weakest preconditions to infer annotations

at most intermediate points. As for the logical parts, we have experimented with several

first-order axiomatisations of edge definitions, predicates and path queries.

The big advantage of Why is to offer the use of multiple backend provers. Our first

attempt has been with the automatic decision procedure Simplify [Sim07]. Simple proof

obligations are easily discharged with Simplify but we found more complicated obligations,

notably involving transitive closure, much harder to discharge. Simplify may fail to prove an

obligation either because it is not true or because it is too difficult to prove. In both cases, we

have found it difficult to track down the reasons for a failed proof. One may wish to explore

further this line of research and take advantage of Why to generate verification conditions

for a proof assistant such as Coq [coq07].

Incremental evaluation Another challenging area of research is the incremental evalua-

tion of edges and predicates in JunGL. Usual refactoring scenarios occur in an interactive

development environment where the object program changes frequently. In addition to the

user edits, the refactoring transformations themselves may invalidate some of the semantic in-

formation attached to the JunGL graph. As we explained in Chapter 2, lazy edges discharge

us from maintaining semantic information explicitly at every tree node. The information

is computed on-demand when it is required. Currently, however, we do not maintain that

information incrementally on every change. Instead, we flush the whole cache of lazy edges

at the end of each transformation, to ensure no edge, which ought to be invalidated and

recomputed, will be incorrectly reused in further refactorings.

Lazy edges translate to Ordered Datalog, but they share similarities with reference at-

tribute grammars. This is not too surprising as attribute grammars can be implemented

as logic programs [DM85]. Thus, to address the problem of incrementally maintaining lazy

edges in our program graph, one can build on the work of two research areas: the work done

on incremental evaluation of reference attribute grammars (e.g. [Hed91, Mad98, Boy02]),

and the work done on incremental evaluation of logic programs and maintenance of database

views (e.g. [DT92, SR05]).

More program analyses Finally, one may want to specify more program analyses in

JunGL, in order to implement more complex refactorings that have not been fully automated


so far for mainstream languages. In [Ett06], Ettinger describes how to correctly automate

complex statement-level refactoring based on slicing, such as the Untangling refactoring we

proposed in [EV04]. To implement that kind of refactoring, one would first need to build

a slicing tool for a mainstream language. We have shown in Chapter 2 how to implement

a naive slicer using JunGL, for a toy imperative language with no pointers. One may now

want to encode a points-to analysis in JunGL before implementing a slicer for a Java-like

language.

Such program analyses can be expressed elegantly in JunGL. Logic programming lan-

guages, like Prolog and Datalog, have been proposed at several occasions to express static

program analyses in a natural and concise way [Rep93, DRW96]. Reps et al. showed in

[Rep93] how to use Datalog for on-demand interprocedural slicing. However, one issue in im-

plementing that kind of program analysis as logic programs has been performance. Indeed,

early logic-based implementations did not scale very well compared to traditional implemen-

tations. More recent works on the use of Binary Decision Diagrams for program analysis

[LH04, BNL05] suggest that better performance can be achieved now. In particular, Whaley

et al. showed in [WACL05] how to encode a context-sensitive points-to analysis in Datalog

and evaluate it efficiently with BDDs. Without any change in the design of our language, we

believe such an alternative implementation would be very valuable to a wider applicability

of JunGL.

Appendix A

JunGL grammar

⟨letter

⟩::= A..Z | a..z

⟨digit

⟩::= 0..9

⟨Number

⟩::=

⟨digit

⟩+

⟨String

⟩::= "

⟨any

⟩?"

⟨Identifier

⟩::=

⟨letter

⟩(

⟨letter

⟩|⟨digit

⟩| | ’ )?

⟨@Identifier

⟩::= @

⟨Identifier

⟩

⟨?Identifier

⟩::= ?

⟨Identifier

⟩

⟨CompoundName

⟩::=

⟨Identifier

⟩?.

⟨Identifier

⟩

Figure A.1: Lexemes, identifiers and compound names

160

APPENDIX A. JUNGL GRAMMAR 161

⟨Program

⟩::=

⟨TopLevelStatement

⟩+

⟨TopLevelStatement

⟩::= using

⟨CompoundName

⟩+, {

⟨Statement

⟩?}∣∣ ⟨

Statement⟩

⟨Statement

⟩::=

⟨Declaration

⟩∣∣ do

⟨Block

⟩

⟨Declaration

⟩::=

⟨NamespaceDeclaration

⟩∣∣ ⟨

NodeTypeDeclaration⟩

∣∣ ⟨LetDeclaration

⟩

⟨NamespaceDeclaration

⟩::= namespace

⟨CompoundName

⟩{

⟨Declaration

⟩+}

⟨NodeTypeDeclaration

⟩::= type

⟨NodeTypeFragment

⟩( and

⟨NodeTypeFragment

⟩)?

⟨NodeTypeFragment

⟩::=

⟨Annotation

⟩? ⟨Identifier

⟩[ =

⟨TypeBody

⟩]

⟨Annotation

⟩::=

⟨@Identifier

⟩[ (

⟨String

⟩) ]

⟨NodeTypeBody

⟩::= {

⟨FieldDeclaration

⟩+; }∣∣ ( |

⟨NodeTypeFragment

⟩)+∣∣ (

⟨NodeTypeBody

⟩)

⟨FieldDeclaration

⟩::=

⟨Identifier

⟩:

⟨Type

⟩

⟨LetDeclaration

⟩::= let

⟨Pattern

⟩=

⟨Block

⟩∣∣ let

⟨Identifier

⟩ ⟨Pattern

⟩+=

⟨Block

⟩∣∣ let rec

⟨Identifier

⟩ ⟨Pattern

⟩+=

⟨Block

⟩∣∣ let predicate

⟨Identifier

⟩(

⟨?Identifier

⟩?, )

=⟨Predicate

⟩∣∣ let rec predicate

⟨Identifier

⟩(

⟨?Identifier

⟩?, )

=⟨Predicate

⟩

( and⟨Identifier

⟩(

⟨?Identifier

⟩?, ) =

⟨Predicate

⟩)?

∣∣ let edge⟨Identifier

⟩⟨Identifier

⟩[ :

⟨CompoundName

⟩] ->

⟨?Identifier

⟩

=⟨Predicate

⟩∣∣ let attribute

⟨Identifier

⟩⟨Identifier

⟩[ :

⟨CompoundName

⟩] =

⟨Expression

⟩

Figure A.2: Syntax of JunGL programs


⟨Block

⟩::=

⟨Expression

⟩+;

⟨Expression

⟩::=

⟨SimpleExpression

⟩∣∣ begin

⟨Block

⟩end∣∣ let

⟨Pattern

⟩=

⟨Expression

⟩in

⟨Block

⟩∣∣ let

⟨Identifier

⟩ ⟨Pattern

⟩+=

⟨Expression

⟩in

⟨Block

⟩∣∣ let predicate

⟨Identifier

⟩(

⟨?Identifier

⟩?, )

=⟨Predicate

⟩in

⟨Block

⟩∣∣ if

⟨SimpleExpression

⟩then

⟨Expression

⟩[ else

⟨Expression

⟩]∣∣ match

⟨SimpleExpression

⟩with

( |⟨Pattern

⟩->

⟨Expression

⟩)+∣∣ foreach

⟨Pattern

⟩in

⟨SimpleExpression

⟩do

⟨Expression

⟩∣∣ ⟨

SimpleExpression⟩.

⟨Identifier

⟩<-

⟨Expression

⟩

⟨SimpleExpression

⟩::= true

∣∣ false∣∣ null∣∣ ⟨

Number⟩

∣∣ ⟨String

⟩∣∣ ⟨

Identifier⟩

∣∣ ⟨SimpleExpression

⟩.

⟨Identifier

⟩∣∣ ⟨

SimpleExpression⟩+


⟩ ⟨InfixOperator

⟩ ⟨SimpleExpression

⟩∣∣ ⟨

PrefixOperator⟩ ⟨

SimpleExpression⟩


⟩is

⟨CompoundName

⟩∣∣ fun

⟨Pattern

⟩+->

⟨Expression

⟩∣∣ new

⟨CompoundName

⟩[⟨FieldInitialiser

⟩+, ]∣∣ {

⟨?SimpleExpression

⟩|

⟨Predicate

⟩}∣∣ (

⟨Expression

⟩?, )∣∣ ⟨

SimpleExpression⟩::

⟨Expression

⟩∣∣ [

⟨Expression

⟩?; ]

⟨FieldInitialiser

⟩::=

⟨Identifier

⟩=

⟨Expression

⟩

Figure A.3: Syntax of expressions



⟩::= true

∣∣ false∣∣ null∣∣ ⟨

Number⟩

∣∣ ⟨String

⟩∣∣ ⟨

Identifier⟩

∣∣ ⟨?Identifier

⟩∣∣ ⟨

?SimpleExpression⟩.

⟨Identifier

⟩∣∣ ⟨

?SimpleExpression⟩+

∣∣ ⟨?SimpleExpression

⟩ ⟨InfixOperator

⟩ ⟨?SimpleExpression

⟩∣∣ ⟨

PrefixOperator⟩ ⟨

?SimpleExpression⟩

∣∣ ⟨?SimpleExpression

⟩is

⟨CompoundName

⟩∣∣ fun

⟨Pattern

⟩+->


⟩∣∣ new

⟨CompoundName

⟩[⟨?FieldInitialiser

⟩+, ]∣∣ {


⟩|

⟨Predicate

⟩}∣∣ (


⟩?, )∣∣ ⟨

?SimpleExpression⟩::


⟩∣∣ [


⟩?; ]

⟨?FieldInitialiser

⟩::=

⟨Identifier

⟩=


⟩

Figure A.4: Syntax of expressions with logical identifiers

⟨InfixOperator

⟩::= or

∣∣ and∣∣ ==

∣∣ !=∣∣ <

∣∣ <=∣∣ >

∣∣ >=∣∣ +

∣∣ -∣∣ *

∣∣ /⟨PrefixOperator

⟩::= not

∣∣ -

Figure A.5: Operators

⟨Pattern

⟩::=

∣∣ true∣∣ false

∣∣ null∣∣ ⟨Number

⟩∣∣ ⟨

String⟩

∣∣ ⟨Identifier

⟩∣∣ (

⟨Pattern

⟩?, )∣∣ ⟨

Pattern⟩::

⟨Pattern

⟩∣∣ [

⟨Pattern

⟩?; ]

Figure A.6: Syntax of patterns


⟨Predicate

⟩::= true

∣∣ false∣∣ local⟨?Identifier

⟩+:

⟨Predicate

⟩∣∣ ⟨

Predicate⟩|

⟨Predicate

⟩∣∣ ⟨

Predicate⟩|>

⟨Predicate

⟩∣∣ ⟨

Predicate⟩&

⟨Predicate

⟩∣∣ !

⟨Predicate

⟩∣∣ first

⟨Predicate

⟩∣∣ (

⟨Predicate

⟩)∣∣ ⟨

CompoundName⟩(

⟨Term

⟩?, )∣∣ ⟨


∣∣ ⟨PathPredicate

⟩

⟨Term

⟩::=∣∣ ⟨


⟨PathPredicate

⟩::=

⟨NodePredicate

⟩(

⟨EdgePredicate

⟩ ⟨NodePredicate

⟩)?

⟨NodePredicate

⟩::= [

⟨Term

⟩[ : [ ! ]

⟨CompoundName

⟩] ]

⟨EdgePredicate

⟩::=

⟨CompoundName

⟩[ + | * ]∣∣ (

⟨ComplexEdgePredicate

⟩) [ + | * ]∣∣ ⟨

EdgePredicate⟩;

⟨EdgePredicate

⟩


⟩::=

⟨EdgePredicate

⟩[⟨PathPredicate

⟩]∣∣ ⟨

PathPredicate⟩ ⟨

EdgePredicate⟩

∣∣ local⟨?Identifier

⟩+:


⟩∣∣ ⟨

ComplexEdgePredicate⟩&

⟨Predicate

⟩

Figure A.7: Syntax of predicates

⟨Type

⟩::= bool∣∣ int∣∣ string∣∣ ⟨

CompoundName⟩

∣∣ ⟨Type

⟩list∣∣ ⟨

Type⟩stream∣∣ ⟨

Type⟩->

⟨Type

⟩∣∣ ⟨

Type⟩*

⟨Type

⟩∣∣ (

⟨Type

⟩)

Figure A.8: Syntax of type references

Appendix B

Rename Variable

The name binding rules for the object language described in Section 6.1.1:

using NameJava . Ast

{

namespace NameJava . NameResolution

{

(∗ main lookup edges ∗)

l et edge lookup x : SingleName → ?y =


l et edge lookup x : DotName → ?y = [ x ] r i gh t ; lookup [ ? y ]

l et getName x =

i f x i s CompUnit then x . packageName

else x . name

(∗ s t a t i c contex t ∗)

l et predicate i sVariableName (? x ) =

[ : F i e ldDec l ] expr [ ? x :Name ] | [ : Loca lVar i ab l eDec l ] expr [ ? x :Name ]

| i sVariableName (? z ) & [ ? z :DotName ] r i gh t [ ? x :Name ]

l et predicate isTypeName(? x ) =

[ : ClassDecl ] super [ ? x :Name ] | [ : F i e ldDec l ] f i e l dType [ ? x :Name ]

| [ : Loca lVar i ab l eDec l ] varType [ ? x :Name ] | [ : Cast ] castType [ ? x :Name ]

| [ ? x :Name ] parent [ :DotName ] r i gh t [ : This ]

| isTypeName(? z ) & [ ? z :DotName ] r i gh t [ ? x :Name ]

l et predicate isPackageOrTypeName (? x ) =

[ ? z :DotName ] l e f t ; ch i l d ∗ [ ? x :Name ] & isTypeName(? z )

l et predicate isAmbiguous (? x ) =

[ ? x :Name ] & ! isVariableName (? x ) &

! isPackageOrTypeName (? x ) & ! isTypeName(? x )

165

APPENDIX B. RENAME VARIABLE 166

l et edge exp rQua l i f i e r x : SingleName → ?y =

[ x ] parent ; r i gh t [ x ] parent ; l e f t ; ( expr ∗ ; r i gh t ∗ )∗ [ ? y : SingleName ]

l et predicate onTheRightOfDot(? x ) = [ ? x ] parent [ :DotName ] r i gh t [ ? x ]

l et edge enclos ingStmt x → ?y = f i r s t ( [ x ] parent ∗ [ ? y : Stmt ] )

l et edge enc l o s i ngC l a s s x → ?y = f i r s t ( [ x ] parent+[?y : ClassDecl ] )

l et edge enc l os ingScope x → ?y =

f i r s t ( [ x ] parent+[?y : ClassDecl ] & ! [ x ] parent+[?y ] super ; ch i l d ∗ [ x ] )

B [ x ] parent+[?y : CompUnit ]

(∗ type lookup ∗)

l et edge typeLookup x : SingleName → ?y =

[ x ] lookup [ : F i e ldDec l ] f i e l dType ; lookup [ ? y : ClassDecl ]

| [ x ] lookup [ ? y : ClassDecl ]

| [ x ] lookup [ ? y : CompUnit ]

l et edge typeLookup x :DotName → ?y = [ x ] r i gh t ; typeLookup [ ? y ]

l et edge typeLookup x : This → ?y =

onTheRightOfDot( x ) & [ x ] parent ; l e f t ; l ookupEnc los ingClas s [ ? y ]

| ! onTheRightOfDot( x ) & [ x ] enc l o s i ngC l a s s [ ? y ]

l et edge typeLookup x : Super → ?y =

onTheRightOfDot( x ) &

[ x ] parent ; l e f t ; l ookupEnc los ingClas s ; super ; lookup [ ? y ]

| ! onTheRightOfDot( x ) & [ x ] enc l o s i ngC l a s s ; super ; lookup [ ? y ]

l et edge typeLookup x : Parenthes i sedExpr → ?y = [ x ] expr ; typeLookup [ ? y ]

l et edge typeLookup x : Cast → ?y = [ x ] castType ; lookup [ ? y ]

l et edge l ookupEnc los ingClas s x : SingleName → ?y =

onTheRightOfDot( x ) &

[ x ] parent ; l e f t ; l ookupEnc los ingClas s ; bodyDecls [ ? y ] &

[ x ] enc l o s i ngC l a s s +[?y ] & getName x == getName ?y

| ! onTheRightOfDot( x ) & f i r s t ( [ x ] en c l o s i ngC l a s s +[?y ] &

getName x == getName ?y )

l et edge l ookupEnc los ingClas s x :DotName → ?y =

[ x ] r i gh t ; l ookupEnc los ingClas s [ ? y ]

(∗ lookup au x i l i a r y edges ∗)

l et edge l ookupAl l x : SingleName → ?y =

[ x ] lookupAllWithDotContext [ ? y ] &

( isVariableName (x ) & ( [ ? y : F i e ldDec l ] | [ ? y : Loca lVar i ab l eDec l ] )

B isTypeName(x ) & [ ? y : ClassDecl ]

B isPackageOrTypeName (x ) & ( [ ? y : ClassDecl ] | [ ? y : CompUnit ] )

B isAmbiguous (x ) )

l et edge lookupAllWithDotContext x : SingleName → ?y =

onTheRightOfDot( x ) & [ x ] parent ; l e f t ; typeLookup ; lookupAllMembers [ ? y ]

| ! onTheRightOfDot( x ) & [ x ] l ookupAl lDec l s [ ? y ]

| ! onTheRightOfDot( x ) & [ x ] lookupAl lPackages [ ? y ]


l et edge lookupAllMembers x : ClassDecl → ?y =

[ x ] ( super ; lookup ) ∗ [ ? s ] &

( [ ? s ] bodyDecls [ ? y : F i e ldDec l ] | [ ? s ] bodyDecls [ ? y : ClassDecl ] )

l et edge lookupAllMembers x : CompUnit → ?y =

[ x ] c l a s sD e c l s [ ? y : ClassDecl ]

l et edge l ookupAl lDec l s x → ?y =

[ x ] enclos ingStmt ; l i s t P r e d e c e s s o r +[?y : Loca lVar i ab l eDec l ]

| [ x ] enc l os ingScope ; l ookupAl lDec l s [ ? y ]

l et edge l ookupAl lDec l s x : ClassDecl → ?y =

[ x ] equa l s [ ? y ]

| [ x ] lookupAllMembers [ ? y ]

| [ x ] enc l os ingScope ; l ookupAl lDec l s [ ? y ]

l et edge l ookupAl lDec l s x : CompUnit → ?y =

[ x ] parent ; compUnits [ ? cu ] lookupAllMembers [ ? y ] &

( ?cu . packageName == x . packageName | ?cu . packageName == "" )

l et edge l ookupAl lPackages x → ?y =

[ x ] parent ∗ [ : Program ] compUnits [ ? y ]

}

}

And the full script for Rename Variable itself:

using NameJava . Ast , NameJava . NameResolution

{

namespace NameJava . Rename

{

l et i sVa r i ab l eDec l a r a t i on d =

d i s F i e ldDec l or d i s Loca lVar i ab l eDec l

l et a l r eadyEx i s t s dec newName =

not Ut i l s . isEmpty { ?d |

[ dec : Loca lVar i ab l eDec l ] parent ; ch i l d [ ? d : Loca lVar i ab l eDec l ] &

?d . name == newName

| [ dec : F i e ldDec l ] parent ; ch i l d [ ? d : F i e ldDec l ] & ?d . name == newName}

l et edge al lTypesOrPackages x :Name → ?y =

[ x ] l ookupAl lDec l s [ ? y : ClassDecl ] | [ x ] lookupAl lPackages [ ? y ]

l et f i ndSe l fC r o s sPo i n t (x , d) =

pick { (x , ? c , ? ec , ? sc , d) | [ x ] en c l o s i ngC l a s s [ ? c ] en c l o s i ngC l a s s ∗ [ ? ec ]

( super ; lookup ) ∗ [ ? sc ] bodyDecls [ d ] }

l et lookupScopeFrom x name =

pick { ? s | f i r s t ( [ x ] al lTypesOrPackages [ ? s ] & getName ? s == name) }

l et bui ldTypeReference x c =

l et es = pick { ? es | f i r s t ( [ c ] enc l os ingScope ∗ [ ? es ] &

? es == lookupScopeFrom x ( getName ? es ) ) } in


i f es == null then

e r r o r ( "Cannot build type access for " + c . name)

else

let chain = toL i s t { ? i c |

[ c ] enc l os ingScope ∗ [ ? i c ] enc l os ingScope +[ es ] } in

let esRef = new SingleName { name = getName es } in

L i s t . f o l d r

( fun node i c → new DotName {

l e f t = node ,

r i gh t = new SingleName { name = getName i c }

})

esRef chain

l et bui ldThi sRe f e r ence (x , c , ec , sc , d ) =

l et t h i s = new This in

let ee = i f ec == c then t h i s else

new DotName { l e f t = bui ldTypeReference x ec , r i gh t = th i s } in

let se = i f sc == ec then ee else

new Parenthes i sedExpr {

expr = new Cast { castType = bui ldTypeReference x sc , expr = ee }

} in

se

l et getExprRewrite (x , d) =

l et ( o l dQua l i f i e r , oldType , newType ) =

pick { (?q , ? ot , ? nt ) |

[ x ] parent [ : DotName ] r i gh t [ x ] parent ; l e f t [ ? q ] typeLookup [ ? ot ] &

[ d ] enc l o s i ngC l a s s [ ? nt ] } in

i f oldType == newType then ( fun ( ) → ( ) )

else let cas t = new Cast { castType = bui ldTypeReference x newType } in

let r ewr i t e ( ) = begin

replaceWith o l dQ ua l i f i e r (new Parenthes i sedExpr { expr = cas t } ) ;

cas t . expr ← o l dQ ua l i f i e r

end in

r ewr i t e

l et getThisRewr i te (x , d) =

l et o l dQu a l i f i e r = pick { ?q |

[ x ] parent [ :DotName ] r i gh t [ x ] parent ; l e f t [ ? q ] } in

let newQual i f i e r = bui ldThi sRe f e r ence ( f i ndSe l fC r o s sPo i n t (x , d ) ) in

let r ewr i t e ( ) = (

i f o l dQu a l i f i e r == null then begin

let e = new DotName { l e f t = newQual i f i e r } in

replaceWith x e ;

e . r i gh t ← x

end else

replaceWith o l dQu a l i f i e r newQual i f i e r

) in

r ewr i t e


l et renameVariable program node newName =

l et dec = pick { ?d | [ node ] lookup [ ? d ] B [ node ] equa l s [ ? d ] } in

i f not i sVa r i ab l eDec l a r a t i on dec then

e r r o r "Please choose a variable " ;



i f a l r eadyEx i s t s dec newName then

e r r o r "Declaration already exists " ;

l et f i n dF i r s t x =

pick { ?y |

[ x ] lookupAl l [ ? y ] & (newName == getName ?y | ?y == dec ) } in

let needRename =

{ ?x | [ program ] ch i l d +[?x : SingleName ] lookup [ dec ] } in

let mayBeCaptured =

{ (? x , ? d ) | [ program ] ch i l d +[?x : SingleName ] lookup [ ? d ] &

?x . name == newName } in

let needNewQual i f i er = L i s t . f o l d l

( fun l ( x , d) → i f f i n d F i r s t x == dec then (x , d ) :: l else l )

[ ] ( t oL i s t mayBeCaptured ) in

let needNewQual i f i er = L i s t . f o l d l

( fun l x → i f f i n dF i r s t x != dec then (x , dec ) :: l else l )

needNewQual i f i er ( t oL i s t needRename ) in

foreach (x , d) in needNewQual i f i er do

i f d i s Loca lVar i ab l eDec l then e r r o r "Cannot hide local variable " ;

l et getRewr i te (x , d) =

i f pick { ( ) | [ x ] exp rQua l i f i e r [ ] } != null then

getExprRewrite (x , d)

else

getThisRewr i te (x , d) in

let r ew r i t e s = L i s t .map getRewr i te needNewQual i f i er in

foreach r ewr i t e in r ew r i t e s do r ewr i t e ( ) ;

foreach x in needRename do x . name ← newName;


}

}

Appendix C

Extract Method

Extract Method for the object language described in Section 6.3.1 :

using CSharp . Ast , CSharp . Binding , CSharp . Flow

{

namespace CSharp . ExtractMethod

{

(∗ checks f o r a we l l−de f ined reg ion ∗)

l et dominates entryNode startNode endNode =

( startNode == endNode ) or

Ut i l s . isEmpty { ( ) |

[ entryNode ] ( local ? z : c f s u c c [ ? z ] & ?z != startNode ) ∗ [ endNode ] }

l et postDominates startNode endNode exitNode =

( startNode == endNode ) or

Ut i l s . isEmpty { ( ) |

[ s tar tNode ] ( local ? z : c f s u c c [ ? z ] & ?z != endNode ) ∗ [ exitNode ] }

l et haveSameParent x y =

not Ut i l s . isEmpty { ( ) | [ x ] parent ; ch i l d [ y ] }

(∗ t rans format ion code ∗)

l et createVoidTypeRef ( ) =

new TypeRef {

path = new NamespacePath {

ent i tyRef = new Enti tyRef { name = "void" }

}

}

l et createParamDecl name typeRef d i r e c t i o n =

new ParamDecl {

name = name , typeRef = typeRef , d i r e c t i o n = d i r e c t i o n

}

l et createArg name d i r e c t i o n =

170

APPENDIX C. EXTRACT METHOD 171

new MethodArgument {

t a r g e t = new Enti tyRef { name = name } , d i r e c t i o n = d i r e c t i o n

}

l et cr eateEnt i tyRef name =

new Enti tyRef { name = name }

l et createEmptyPrivateVoidMethod name parameters =

new MethodDecl {

name = name , mod i f i e r s = [ new Pr ivate ] ,

typeRef = createVoidTypeRef ( ) , parameters = parameters ,

b lock = new BlockStmt

}

l et c r ea t eCa l l S i t eS tmt methodName arguments =

new ExprStmt {

t a r g e t = new MethodInvokeExpr {

t a r g e t = new Enti tyRef { name = methodName } ,

arguments = arguments

}

}

l et i n s e r tStatementBefor e n s =

i f not Ut i l s . isEmpty { ?b | [ s ] parent [ ? b : BlockStmt ] } then

i n s e r tB e f o r e n s

else let block = new BlockStmt in

replaceWith s block ;

block . statements ← [ n ; s ]

l et c l oneDec l d =

new Var iableDeclStmt {

mod i f i e r s = L i s t .map c l one d . modi f i e r s ,

typeRef = c lone d . typeRef ,

name = d . name

}

l et detachDecl d =

i f d . i n i t i a l i z e r == null then

detach d

else

replaceWith d (new ExprStmt {

t a r g e t = new AssignExpr {

l e f t = new Enti tyRef { name = d . name } ,

operator = new Assign ,

r i gh t = c lone d . i n i t i a l i z e r

}

})

l et i s S t a t i c x =

not Ut i l s . isEmpty { ( ) | [ x ] mod i f i e r s [ : S t a t i c ] }


(∗ Extrac t Method ∗)

l et extractMethod startNode endNode newMethodName =

l et (method , c l a s s , entryNode , exitNode ) = pick { (?m, ? c , ? entry , ? e x i t ) |

[ s tar tNode ] parent+[?m: Ca l l ab l eDec l ] d i r ec tEnc l os ingType [ ? c : TypeDecl ]

& [ endNode ] parent+[?m]

& [ ?m] ca l l ab l eEnt r y [ ? entry ]

& [ ?m] c a l l a b l e Ex i t [ ? e x i t ] } in

let outerEndNode = pick { ?n | [ endNode ] e x i t [ ? n ] } in

i f not dominates entryNode startNode endNode then

e r r o r "Not all possible flows go through the start of selection " ;

i f not postDominates startNode outerEndNode exitNode then

e r r o r "Not all possible flows go through the end of selection " ;

i f not haveSameParent startNode endNode then

e r r o r "Selected block is not enclosed in a single parent statement " ;

l et s e l e c t i onS ta t ement s = { ? s |

[ s tar tNode ] ( local ?z : c f s u c c [ ? z ] & ?z != outerEndNode ) ∗ [ ? s ] } in

let predicate mayUseOrDefInSelection (? x ) =

i s I n (? s , s e l e c t i onS ta t ement s ) & [ ? s ] useOrDef [ ? x ] in

let va r i a b l e s = { ?x | mayUseOrDefInSelection (? x ) &

( [ ? x : Var iableDeclStmt ] | [ ? x : ParamDecl ] ) } in

let predicate mayUseOrDefOutOfSelection (? x ) =

[ entryNode ] c f s u c c +[? s ] c f s u c c +[ exitNode ] &

! i s I n (? s , s e l e c t i onS ta t ement s ) & [ ? s ] useOrDef [ ? x ]

| [ method ] parameters [ ? x ] in

let predicate dec InSe l e c t i on (? x ) =

i s I n (?d , s e l e c t i onS ta t ement s ) & [ ? d ] dec [ ? x ] in

let predicate mayUseInSelect ion (? x ) =

i s I n (?u , s e l e c t i onS ta t ement s ) & [ ? u ] use [ ? x ] in

let predicate mayDef InSelect ion (? x ) =

i s I n (?d , s e l e c t i onS ta t ement s ) & [ ? d ] de f [ ? x ] in

let predicate mustDefBe for eSe l ec t i on (? x ) =

! ( [ entryNode ] ( local ?z : c f s u c c [ ? z ] & ! [ ? z ] de f [ ? x ] )+ [ startNode ] ) in

let predicate mayUseAfterSelect ion (? x ) =

[ outerEndNode ] c f s u c c ∗ [ ? d ] c f s u c c ∗ [ exitNode ] & [ ? d ] use [ ? x ]

| [ ? x : ParamDecl ] d i r e c t i o n [ : ! Value ] in

let predicate mustDef InSe l ect i on (? x ) =

! ( [ s tartNode ] ( local ?z : [ ? z ] c f s u c c & ! [ ? z ] de f [ ? x ] )+ [ outerEndNode ] ) in

let predicate mayUseBeforeDef InSelect ion (? x ) =

i s I n (?u , s e l e c t i onS ta t ement s ) &

[ startNode ] ( local ?z : [ ? z ] c f s u c c & ! [ ? z ] de f [ ? x ] ) ∗ [ ? u ] use [ ? x ] in


l et predicate mayUseBeforeDefAfterSelect ion (? x ) =

[ outerEndNode ] ( local ?z : [ ? z ] c f s u c c & ! [ ? z ] de f [ ? x ] ) ∗ [ ? u ] use [ ? x ]

| [ method ] parameters [ ? x ] d i r e c t i o n [ : ! Value ] in

let predicate mayUseOrDefBeforeSelect ion (? x ) =

[ entryNode ] c f s u c c +[? s ] c f s u c c +[ startNode ] & [ ? s ] useOrDef [ ? x ]

| [ method ] parameters [ ? x ] d i r e c t i o n [ : ! Out ] in

let predicate mayUseOrDefAfterSelect ion (? x ) =

[ outerEndNode ] c f s u c c ∗ [ ? s ] c f s u c c +[ exitNode ] & [ ? s ] useOrDef [ ? x ]

| [ method ] parameters [ ? x ] d i r e c t i o n [ : ! Value ] in

let predicate decBe f o r eSe l e c t i on (? x ) =

[ entryNode ] c f s u c c +[?d ] c f s u c c +[ startNode ] & [ ? d ] dec [ ? x ]

| [ method ] parameters [ ? x ] in

let valueParams =

{ ?x | i s I n (?x , v a r i a b l e s ) &

mayUseBeforeDef InSelect ion (? x ) &

! ( mayDef InSelect ion (? x ) &

mayUseBeforeDefAfterSelect ion (? x ) )

} in

let outParams =



! mayUseBeforeDef InSelect ion (? x ) &

mustDef InSe l ect i on (? x )

} in

let refParams =


( mayUseBeforeDef InSelect ion (? x ) |

mayDef InSelect ion (? x ) & ! mustDef InSe l ect i on (? x ) ) &



! i s I n (?x , outParams )

} in

let needDecMoveOut =

{ ?x | dec InSe l e c t i on (? x ) &


} in

let needDecMoveIn =


! dec InSe l e c t i on (? x ) &

! i s I n (? x , valueParams ) &

! i s I n (? x , outParams ) &

! i s I n (? x , refParams )

} in

let needDecDupl icat ion =


{ ?x | i s I n (?x , needDecMoveIn ) &

mayUseOrDefOutOfSelection (? x ) |

i s I n (?x , needDecMoveOut ) &

! i s I n (? x , valueParams ) &

! i s I n (? x , outParams ) &

! i s I n (? x , refParams )

} in

let bui ld l d = L i s t .map ( fun x →(x , d ) ) l in

let parameters = L i s t . concat

[ bu i ld ( t oL i s t valueParams ) (new Value ) ;

bu i ld ( t oL i s t refParams ) (new Ref ) ;

bu i ld ( t oL i s t outParams ) (new Out ) ] in

let paramDecls = L i s t .map

( fun (x , d ) → createParamDecl x . name ( c l one ( x . typeRef ) ) ( c l one d ) )

parameters in

let newMethod = createEmptyPrivateVoidMethod newMethodName paramDecls in

let args = L i s t .map

( fun (x , d ) → createArg x . name ( c l one d ) ) parameters in

let c a l l S i t e = c r ea t eCa l l S i t eS tmt newMethodName args in

i n s e r tStatementBefor e c a l l S i t e startNode ;

foreach d in needDecMoveOut do

i n s e r tStatementBefor e ( c l oneDec l d) c a l l S i t e ;

l et topStatements = { ? t s | i s I n (? ts , s e l e c t i onS ta t ement s ) &

[ ? t s ] ( local ? z : parent [ ? z ] & ! i s I n (? z , s e l e c t i onS ta t ement s ))+[ method ] } in

foreach t s in topStatements do detach t s ;

newMethod . block . statements ← L i s t . append

( L i s t .map c l one ( toL i s t needDecMoveIn ) ) ( t oL i s t topStatements ) ;

i f i s S t a t i c method then

newMethod . mod i f i e r s ← L i s t . append newMethod . mod i f i e r s [ new S t a t i c ] ;

i n s e r tA f t e r newMethod method ;

foreach dec in { ?d |

i s I n (?d , needDecMoveOut ) & ! i s I n (?d , needDecDupl icat ion )

| i s I n (?d , needDecMoveIn ) & ! i s I n (?d , needDecDupl icat ion ) } do

detachDecl dec

}

}

Bibliography

[AK07] Malte Appeltauer and Gunter Kniesel. Towards concrete syntax patterns forlogic-based transformation rules. In Eighth International Workshop on Rule-Based Programming (RULE ’07), Paris, France, 2007.

[App98] Andrew W. Appel. Modern Compiler Implementation in ML. CambridgeUniversity Press, 1998.

[Aßm98] Uwe Aßmann. OPTIMIX — a tool for rewriting and optimizing programs.In H. Ehrig, G. Engels, H. J. Kreowski, and G. Rozenberg, editors, Hand-book of Graph Grammars and Computing by Graph Transformation, volume2: Applications, Languages and Tools, pages 307–318. World Scientific, 1998.

[Ban06] Fabian Bannwart. Changing software correctly. Technical Report 509, Depart-ment of Computer Science, ETH Zurich, 2006.

[BBK+07] Emilie Balland, Paul Brauner, Radu Kopetz, Pierre-Etienne Moreau, and An-toine Reilles. Tom: Piggybacking rewriting on Java. In Proceedings of the18th Conference on Rewriting Techniques and Applications (RTA ’07), Lec-ture Notes in Computer Science. Springer-Verlag, 2007.

[BBPR05] Rajesh Bordawekar, Michael Burke, Igor Peshansky, and MukundRaghavachari. Simplify XML processing with XJ.http://www.ibm.com/developerworks/xml/library/x-awxj.html, 2005.

[BFS00] Peter Buneman, Mary Fernandez, and Dan Suciu. UnQL: A query languageand algebra for semistructured data based on structural recursion. VLDBJournal, 9(1):76–110, 2000.

[BGH07] Marat Boshernitsan, Susan L. Graham, and Marti A. Hearst. Aligning devel-opment tools with the way programmers think about code changes. In Pro-ceedings of the SIGCHI conference on Human Factors in Computing Systems(CHI ’07), pages 567–576, New York, NY, USA, 2007. ACM Press.

[BGV92] Robert A. Ballance, Susan L. Graham, and Michael L. Van De Vanter. The Panlanguage-based editing system. ACM Transactions on Software Engineeringand Methodology, 1(1):95–127, 1992.

[Bir98] Richard Bird. Introduction to Functional Programming using Haskell (secondedition). Prentice Hall, New York, USA, 1998.

[BKVV06] Martin Bravenboer, Karl Trygve Kalleberg, Rob Vermaas, and Eelco Visser.Stratego/XT Tutorial, Examples, and Reference Manual (latest). Department

175

BIBLIOGRAPHY 176

of Information and Computing Sciences, Universiteit Utrecht, Utrecht, TheNetherlands, 2006. http://www.strategoxt.org.

[BM06] Fabian Bannwart and Peter Muller. Changing programs correctly: Refactoringwith specifications. In J. Misra, T. Nipkow, and E. Sekerinski, editors, FormalMethods (FM), volume 4085 of Lecture Notes in Computer Science, pages 492–507. Springer-Verlag, 2006.

[BMR07] Emilie Balland, Pierre-Etienne Moreau, and Antoine Reilles. Bytecode rewrit-ing in tom. In Second Workshop on Bytecode Semantics, Verification, Analysisand Transformation (Bytecode ’07), Braga,Portugal, 2007.

[BMS05] Gavin Bierman, Erik Meijer, and Wolfram Schulte. The essence of data accessin Cω - the power is in the dot!, 2005.

[BMSU86] Francois Bancilhon, David Maier, Yehoshua Sagiv, and Jeffrey D Ullman.Magic sets and other strange ways to implement logic programs (extendedabstract). In Proceedings of the fifth ACM SIGACT-SIGMOD symposium onPrinciples of Database Systems (PODS ’86), pages 1–15, New York, NY, USA,1986. ACM Press.

[BNL05] Dirk Beyer, Andreas Noack, and Claus Lewerentz. Efficient relational cal-culation for software analysis. IEEE Transactions on Software Engineering,31(2):137–149, 2005.

[Boy02] John Boyland. Incremental evaluators for remote attribute grammars. Elec-tronic Notes in Theoretical Computer Science, 63(3), 2002.

[BR87] Catriel Beeri and Raghu Ramakrishnan. On the power of magic. In Proceedingsof the sixth ACM SIGACT-SIGMOD symposium on Principles of DatabaseSystems (PODS ’87), pages 269–284, 1987.

[BTF05] Ittai Balaban, Frank Tip, and Robert Fuhrer. Refactoring support for classlibrary migration. In Proceedings of the 20th ACM conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA ’05),pages 265–279, 2005.

[BvDOV06] Martin Bravenboer, Arthur van Dam, Karina Olmos, and Eelco Visser. Pro-gram transformation with scoped dynamic rewrite rules. Fundamenta Infor-maticae, 69(1–2):123–178, 2006.

[CEF+08] Don Chamberlin, Daniel Engovatov, Daniela Florescu, Giorgio Ghelli, JimMelton, and Jerome Simeon. XQuery Scripting Extension 1.0 (W3C workingdraft), 2008. Available at http://www.w3.org/TR/xquery-sx-10/.

[CFM+08] Don Chamberlin, Daniela Florescu, Jim Melton, Jonathan Robie, and JeromeSimeon. XQuery Update Facility 1.0 (W3C candidate recommendation), 2008.Available at http://www.w3.org/TR/xquery-update-10/.

[CK01] Horatiu Cirstea and Claude Kirchner. The rewriting calculus — Part I and II.Logic Journal of the Interest Group in Pure and Applied Logics, 9(3):427–498,May 2001.

[Cla78] Keith L. Clark. Negation as failure. In Herve Gallaire and Jack Minker,editors, Logic and Databases, pages 293–322. Plenum Press, New York, 1978.

BIBLIOGRAPHY 177

[CMR92] Mariano Consens, Alberto Mendelzon, and Arthur Ryman. Visualizing andquerying software structures. In Proceedings of the 14th international con-ference on Software engineering (ICSE ’92), pages 138–156, New York, NY,USA, 1992. ACM Press.

[coq07] The Coq proof assistant. http://coq.inria.fr/, 2007.

[Cor04] Marcio Lopes Cornelio. Refactorings as Formal Refinements. PhD thesis,Universidade de Pernambuco, 2004.

[Cor06] James R. Cordy. The TXL source transformation language. Science of Com-puter Programming, 61(3):190–210, 2006.

[Cre97] Roger F. Crew. ASTLOG: A language for examining abstract syntax trees. InUSENIX Conference on Domain-Specific Languages, pages 229–242, 1997.

[CW96] Weidong Chen and David S. Warren. Tabled evaluation with delaying forgeneral logic programs. Journal of the ACM, 43(1):20–74, 1996.

[DDGM07] Brett Daniel, Danny Dig, Kely Garcia, and Darko Marinov. Automated testingof refactoring engines. In Proceedings of the ACM SIGSOFT Symposium onthe Foundations of Software Engineering (ESEC/FSE ’07), New York, NY,USA, 2007. ACM Press.

[DdMS02] Stephen J. Drape, Oege de Moor, and Ganesh Sittampalam. Transforming the.NET intermediate language using path logic programming. In Principles andPractice of Declarative Programming (PPDP ’02), pages 133–144, 2002.

[DKTE04] Alan Donovan, Adam Kiezun, Matthew S. Tschantz, and Michael D. Ernst.Converting Java programs to use generic libraries. In Proceedings of the 19thACM conference on Object-Oriented Programming, Systems, Languages andApplications (OOPSLA ’04), pages 15–34, 2004.

[DM85] Pierre Deransart and Jan Maluszynski. Relating logic programs and attributegrammars. Journal of Logic Programming, 2(2):119–155, 1985.

[dMLVW03] Oege de Moor, David Lacey, and Eric Van Wyk. Universal regular path queries.Higher-order and Symbolic Computation, 16(1-2):15–35, 2003.

[DP02] Brian A. Davey and Hilary Priestley. Introduction to Lattices and Order (sec-ond edition). Cambridge University Press, 2002.

[DRW96] Stephen Dawson, C. R. Ramakrishnan, and David Scott Warren. Practicalprogram analysis using general purpose logic programming systems. In Pro-ceedings of the ACM Symposium on Programming Language Design and Im-plementation (PLDI ’96), pages 117–126. ACM Press, 1996.

[DT92] Guozhu Dong and Rodney W. Topor. Incremental evaluation of datalogqueries. In Proceedings of the 4th International Conference on Database The-ory (ICDT ’92), pages 282–296, London, UK, 1992. Springer-Verlag.

[ecm06] C# Language Specification. Standard ECMA-334. http://www.ecma-international.org/publications/standards/Ecma-334.htm, 2006.

BIBLIOGRAPHY 178

[EESV08] Torbjorn Ekman, Ran Ettinger, Max Schafer, and Mathieu Verbaere.Refactoring bugs in Eclipse, IntelliJ IDEA and Visual Studio, 2008.http://progtools.comlab.ox.ac.uk/projects/refactoring/bugreports.

[EGM+06] Michael Eichberg, Daniel Germanus, Mira Mezini, Lukas Mrokon, andThorsten Schafer. QScope: an open, extensible framework for measuring soft-ware projects. In Proceedings of the Conference on Software Maintenanceand Reengineering (CSMR ’06), pages 113–122, Washington, DC, USA, 2006.IEEE Computer Society.

[EH04] Torbjorn Ekman and Gorel Hedin. Rewritable reference attributed grammars.In Martin Odersky, editor, Proceedings of the European Conference on Object-Oriented Programming (ECOOP ’04), pages 144–169, 2004.

[EH06] Torbjorn Ekman and Gorel Hedin. Modular name analysis for Java usingJastAdd. In Generative and Transformational Techniques in Software Engi-neering, International Summer School (GTTSE ’05) Braga, Portugal, volume4143 of Lecture Notes in Computer Science, pages 422–436. Springer, 2006.

[EH07] Torbjorn Ekman and Gorel Hedin. The JastAdd extensible Java compiler.In Proceedings of the 22th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA ’07),2007.

[EKC98] Michael D. Ernst, Craig S. Kaplan, and Craig Chambers. Predicate dispatch-ing: A unified theory of dispatch. In Proceedings of the 12th European Confer-ence on Object-Oriented Programming (ECOOP ’98), pages 186–211, Brussels,Belgium, July 20-24, 1998.

[EMOS04] Michael Eichberg, Mira Mezini, Klaus Ostermann, and Thorsten Schafer.XIRC: A kernel for cross-artifact information engineering in software devel-opment environments. In Proceedings of the 11th Working Conference on Re-verse Engineering (WCRE’04), volume 00, pages 182–191, Los Alamitos, CA,USA, 2004. IEEE Computer Society.

[Ett06] Ran Ettinger. Refactoring via Program Slicing and Sliding. PhD thesis, Uni-versity of Oxford, 2006.

[EV04] Ran Ettinger and Mathieu Verbaere. Untangling: a slice extraction refactoring.In Gail C. Murphy and Karl J. Lieberherr, editors, Proceedings of the 3rdinternational conference on Aspect-oriented software development (AOSD ’04),pages 93–101, 2004.

[Fal07] Luis Diego Fallas. Creating Java refactorings with Scala and EclipseLTK. http://langexplr.blogspot.com/2007/07/creating-java-refactorings-with-scala.html, 2007.

[Fil03] Jean-Christophe Filliatre. Why: a multi-language multi-prover verificationtool. Technical Report 1366, LRI, Universite Paris Sud, 2003.

[Fit02] Anne Fitzpatrick. A well-intentioned query and the halloween problem. Annalsof the History of Computing, IEEE, 24(2):86–89, Apr-Jun 2002.

BIBLIOGRAPHY 179

[FKI+07] Henry Falconer, Paul H. J. Kelly, David M. Ingram, Michael R. Mellor, TonyField, and Olav Beckmann. A declarative framework for analysis and opti-mization. In Proceedings of Compiler Construction (CC ’07), pages 218–232.Springer, 2007.

[FKK07] Robert M. Fuhrer, Adam Kiezun, and Markus Keller. Advanced refactoringin Eclipse: Past, present, and future. In Proceedings of the 1st Workshop onRefactoring Tools, pages 30–31, 2007.

[Fow99] Martin Fowler. Refactoring: Improving the Design of Existing Code. AddisonWesley, 1999.

[Fow01] Martin Fowler. Crossing refactoring’s rubicon.http://www.martinfowler.com/articles/refactoringRubicon.html, 2001.

[FTK+05] Robert Fuhrer, Frank Tip, Adam Kiezun, Julian Dolby, and Markus Keller. Ef-ficiently refactoring Java applications to use generic libraries. In Proceedings ofthe 19th European Conference on Object-Oriented Programming (ECOOP ’05),pages 71–96, Glasgow, Scotland, July 27–29, 2005.

[GFT06] Maayan Goldstein, Yishai A. Feldman, and Shmuel Tyszberowicz. Refactoringwith contracts. In Proceedings of the AGILE Conference (AGILE ’06), pages53–64, Washington, DC, USA, 2006. IEEE Computer Society.

[GHM00] Etienne Gagnon, Laurie J. Hendren, and Guillaume Marceau. Efficient infer-ence of static types for Java bytecode. In Proceedings of the 7th InternationalSymposium on Static Analysis (SAS ’00), pages 199–219, London, UK, 2000.Springer-Verlag.

[GL88] Michael Gelfond and Vladimir Lifschitz. The stable model semantics for logicprogramming. In Robert A. Kowalski and Kenneth Bowen, editors, Proceedingsof the Fifth International Conference on Logic Programming (ICLP ’88), pages1070–1080, Cambridge, Massachusetts, 1988. The MIT Press.

[GLH+92] Robert W. Gray, Steven P. Levi, Vincent P. Heuring, Anthony M. Sloane, andWilliam M. Waite. Eli: a complete, flexible compiler construction system.Communications of the ACM, 35(2):121–130, 1992.

[GM78] Herve Gallaire and Jack Minker. Logic and Databases. Plenum Press, NewYork, 1978.

[GM06] Alejandra Garrido and Jose Meseguer. Formal specification and verificationof Java refactorings. In Proceedings of the Sixth IEEE International Work-shop on Source Code Analysis and Manipulation (SCAM ’06), pages 165–174,Washington, DC, USA, 2006. IEEE Computer Society.

[GN93] William G. Griswold and David Notkin. Automated assistance for programrestructuring. ACM Transactions on Software Engineering and Methodology,2(3):228–269, 1993.

[Hed91] Gorel Hedin. Incremental static-semantic analysis for object-oriented lan-guages using door attribute grammars. In Proceedings on Attribute Grammars,Applications and Systems, pages 374–379, London, UK, 1991. Springer-Verlag.

BIBLIOGRAPHY 180

[HR92] Susan Horwitz and Thomas Reps. The use of program dependence graphsin software engineering. In Proceedings of the International Conference onSoftware Engineering (ICSE ’92), pages 392–411, 1992.

[HRB90] Susan Horwitz, Thomas Reps, and David Binkley. Interprocedural slicingusing dependence graphs. ACM Transactions on Programming Languages andSystems, 12(1):26–61, 1990.

[HVdM06] Elnar Hajiyev, Mathieu Verbaere, and Oege de Moor. CodeQuest: scalablesource code queries with Datalog. In Dave Thomas, editor, Proceedings of theEuropean Conference on Object-Oriented Programming (ECOOP ’06), volume4067 of Lecture Notes in Computer Science, pages 2–27. Springer, 2006.

[HVMV05] Elnar Hajiyev, Mathieu Verbaere, Oege de Moor, and Kris de Volder. Code-Quest with Datalog. In Companion to the 20th ACM SIGPLAN conference onObject-Oriented Programming, Systems, Languages and Applications (OOP-SLA ’05), New York, NY, USA, 2005. ACM Press.

[imp07] IMP home page. http://www.eclipse.org/imp/, 2007.

[Jar98] Stan Jarzabek. Design of flexible static program analyzers with PQL. IEEETransactions on Software Engineering, 24(3):197–215, 1998.

[JH07] Nicolas Juillerat and Beat Hirsbrunner. Improving method extraction: Anovel approach to data flow analysis using boolean flags and expressions. InProceedings of the 1st Workshop on Refactoring Tools, pages 48–49, 2007.

[jls05] The Java Language Specification (third edition).http://java.sun.com/docs/books/jls/, 2005.

[JM84] Neil D. Jones and Alan Mycroft. Stepwise development of operational anddenotational semantics for prolog. In Symposium on Logic Programming, pages281–288, 1984.

[JSC07] Antonio Carvalho Junior, Leila Silva, and Marcio Cornelio. Using CafeOBJ tomechanise refactoring proofs and application. Electronic Notes in TheoreticalComputer Science, 184:39–61, 2007.

[JV03] Doug Janzen and Kris De Volder. Navigating and querying code withoutgetting lost. In Proceedings of the 2nd international conference on Aspect-oriented software development (AOSD ’03), pages 178–187, New York, NY,USA, 2003. ACM Press.

[Ker05] Joshua Kerievsky. Refactoring to Patterns. Addison Wesley, 2005.

[KETF07] Adam Kiezun, Michael D. Ernst, Frank Tip, and Robert M. Fuhrer. Refac-toring for parameterizing Java classes. In Proceedings of the 29th Interna-tional Conference on Software Engineering (ICSE ’07), Minneapolis, MN,USA, May 23–25, 2007.

[KHR07] Gunter Kniesel, Jan Hannemann, and Tobias Rho. A comparison of logic-basedinfrastructures for concern detection and extraction. In Proceedings of the 3rdworkshop on Linking aspect technology and evolution (LATE ’07). ACM, 2007.

BIBLIOGRAPHY 181

[KK04] Gunter Kniesel and Helge Koch. Static composition of refactorings. Scienceof Computer Programming, 52(1-3):9–51, 2004.

[KKKS96] Marion Klein, Jens Knoop, Dirk Koschutzki, and Bernhard Steffen. DFA &OPT-METAFrame: a toolkit for program analysis and optimization. In Toolsand Algorithms for the Construction and Analysis of Systems (TACAS ’96),volume 1055 of Lecture Notes in Computer Science, pages 418–421. Springer,1996.

[Kli05] Paul Klint. A tutorial introduction to RScript. Centrum voor Wiskunde enInformatica, draft, 2005.

[KSR07] Raffi Khatchadourian, Jason Sawin, and Atanas Rountev. Automated refac-toring of legacy Java software to enumerated types. In Proceedings of theInternational Conference on Software Maintenance (ICSM’07), 2007.

[KV06] Karl Trygve Kalleberg and Eelco Visser. Strategic graph rewriting: Trans-forming and traversing terms with references. In Proceedings of the 6th In-ternational Workshop on Reduction Strategies in Rewriting and Programming,Seattle, Washington, August 2006.

[KW94] Uwe Kastens and William M. Waite. Modularity and reusability in attributegrammars. Acta Informatica, 31(7):601–627, 1994.

[Lam02] Ralf Lammel. Towards Generic Refactoring. In Proceedings of Third ACMSIGPLAN Workshop on Rule-Based Programming (RULE ’02), Pittsburgh,USA, 2002. ACM Press.

[LDG+04] Xavier Leroy, Damien Doligez, Jacques Guarrigue, Didier Remy, and JeromeVouillon. The Objective Caml System. http://caml.inria.fr/, 2004.

[LH04] Ondrej Lhotak and Laurie Hendren. Jedd: A BDD-based relational extensionof Java. In Proceedings of the ACM SIGPLAN conference on ProgrammingLanguage Design and Implementation (PLDI’04), pages 158–169, 2004.

[LJVWF02] David Lacey, Neil D. Jones, Eric Van Wyk, and Carl Christian Frederiksen.Proving correctness of compiler optimizations by temporal logic. In Proceed-ings of the 29th ACM symposium on Principles of Programming Languages(POPL ’02), pages 283–294, 2002.

[Llo87] John W. Lloyd. Foundations of Logic Programming (second edition). Springer-Verlag, 1987.

[LM01] David Lacey and Oege de Moor. Imperative program transformation by rewrit-ing. In R. Wilhelm, editor, Proceedings of the 10th International Conference onCompiler Construction (CC ’01), volume 2027 of Lecture Notes in ComputerScience, pages 52–68. Springer Verlag, 2001.

[LM07] Ralf Lammel and Erik Meijer. Revealing the X/O impedance mismatch(Changing lead into gold). In Roland Backhouse, Jeremy Gibbons, Ralf Hinze,and Johan Jeuring, editors, Datatype-Generic Programming, LNCS. Springer-Verlag, 2007.

BIBLIOGRAPHY 182

[LMC03] Sorin Lerner, Todd Millstein, and Craig Chambers. Automatically proving thecorrectness of compiler optimizations. In Proceedings of the ACM SIGPLANconference on Programming Language Design and Implementation (PLDI ’03),pages 220–231, 2003.

[LMRC05] Sorin Lerner, Todd Millstein, Erika Rice, and Craig Chambers. Automatedsoundness proofs for dataflow and analyses via local rules. In Proceedings ofthe 32nd ACM symposium on Principles of Programming Languages, pages364–377, 2005.

[LRY+04] Yanhong Annie Liu, Tom Rothamel, Fuxiang Yu, Scott D. Stoller, and NanjunHu. Parametric regular path queries. In Proceedings of the ACM SIGPLANconference on Programming Language Design and Implementation (PLDI ’04),pages 219–230, New York, NY, USA, 2004. ACM Press.

[LS06] Yanhong Annie Liu and Scott D. Stoller. Querying complex graphs. In P. VanHentenryck, editor, Proceedings of the 8th International Symposium on Prac-tical Aspects of Declarative Languages (PADL ’06), pages 16–30, 2006.

[LV02] Ralf Lammel and Joost Visser. Typed Combinators for Generic Traversal.In Proceedings of Practical Aspects of Declarative Programming (PADL ’02),volume 2257 of LNCS, pages 137–154. Springer-Verlag, January 2002.

[LV03] Ralf Lammel and Joost Visser. A Strafunski Application Letter. In Proceedingsof Practical Aspects of Declarative Programming (PADL ’03), volume 2562 ofLNCS, pages 357–375. Springer-Verlag, 2003.

[Mad98] William Maddox. Incremental static semantic analysis. Technical ReportUCB/CSD-97-948, University of California, Berkeley, 1998.

[MBB06] Erik Meijer, Brian Beckman, and Gavin Bierman. LINQ: reconciling object,relations and XML in the .NET framework. In Proceedings of the 2006 ACMSIGMOD international conference on Management of data (SIGMOD ’06),pages 706–706, New York, NY, USA, 2006. ACM Press.

[MDJ02] Tom Mens, Serge Demeyer, and Dirk Janssens. Formalising behaviour pre-serving program transformations. In Graph Transformation, volume 2505 ofLecture Notes in Computer Science, pages 286–301, 2002.

[Mil04] Todd Millstein. Practical predicate dispatch. In Proceedings of the 19th ACMconference on Object-Oriented Programming, Systems, Languages and Appli-cations (OOPSLA ’04). ACM Press, 2004.

[MLVW03] Oege de Moor, David Lacey, and Eric Van Wyk. Universal regular path queries.Higher-order and symbolic computation, 16(1-2):15–35, 2003.

[Mos06] Maxim Mossienko. Structural search and replace: What, why and how-to.http://www.jetbrains.com/idea/docs/ssr.pdf, 2006.

[MTHM97] Robin Milner, Mads Tofte, Robert Harper, and David MacQueen. The defini-tion of Standard ML (Revised). MIT Press, May 1997.

[MTR05] Tom Mens, Gabriele Taentzer, and Olga Runge. Detecting structural refac-toring conflicts using critical pair analysis. Electronic Notes in TheoreticalComputer Science, 127(3):113–128, 2005.

BIBLIOGRAPHY 183

[Muc97] Steven S. Muchnick. Advanced compiler design and implementation. MorganKaufmann Publishers Inc., San Francisco, CA, USA, 1997.

[MV04] Edward McCormick and Kris De Volder. JQuery: finding your way throughtangled code. In Companion to the 19th annual ACM SIGPLAN conference onObject-Oriented Programming, Systems, Languages and Applications (OOP-SLA ’04), pages 9–10, New York, NY, USA, 2004. ACM Press.

[NNH99] Flemming Nielson, Hanne Riis Nielson, and Chris Hankin. Principles of Pro-gram Analysis. Springer, 1999.

[Ode07] Martin Odersky. The Scala Programming Language. http://www.scala-lang.org, 2007.

[OO84] Karl J. Ottenstein and Linda M. Ottenstein. The program dependence graphin a software development environment. Software Development Environments(SDE), pages 177–184, 1984.

[Opd92] William F. Opdyke. Refactoring Object-Oriented Frameworks. PhD thesis,University of Illinois at Urbana-Champaign, 1992.

[OV02] Karina Olmos and Eelco Visser. Strategies for source-to-source constant prop-agation. In B. Gramlich and S. Lucas, editors, Workshop on Reduction Strate-gies in Rewriting and Programming, volume 70 of Electronic Notes in Theo-retical Computer Science. Elsevier Science Publishers, May 2002.

[Pai94] R. Paige. Viewing a program transformation system at work. In ManuelHermenegildo and Jaan Penjam, editors, Proceedings of the Sixth Interna-tional Symposium on Programming Language Implementation and Logic Pro-gramming, pages 5–24. Springer Verlag, 1994.

[Pay06] Arnaud Payement. Type-based refactoring using JunGL. Master’s thesis,University of Oxford, 2006.

[PDR91] Geoffrey Phipps, Marcia A. Derr, and Kenneth A. Ross. Glue-Nail: a deductivedatabase system. In Proceedings of the 1991 ACM SIGMOD internationalconference on Management of data (SIGMOD ’91), pages 308–317, New York,NY, USA, 1991. ACM.

[Prz88] Teodor C. Przymusinski. On the declarative semantics of deductive databasesand logic programs. In Foundations of Deductive Databases and Logic Pro-gramming., pages 193–216. Morgan Kaufmann, 1988.

[RBJ97] Don Roberts, John Brant, and Ralph Johnson. A refactoring tool for smalltalk.Theory and Practice of Object Systems, 3(4):253–263, 1997.

[Rep93] Thomas W. Reps. Demand interprocedural program analysis using logicdatabases. In Proceedings of the Workshop on Programming with LogicDatabases, pages 163–196, 1993.

[RG02] Raghu Ramakrishnan and Johannes Gehrke. Database Management Systems(third edition). McGraw-Hill Higher Education, 2002.

[Rob99] Don B. Roberts. Practical Analysis for Refactoring. PhD thesis, University ofIllinois at Urbana-Champaign, 1999.

BIBLIOGRAPHY 184

[Ros94] Kenneth A. Ross. Modular stratification and magic sets for datalog programswith negation. Journal of the ACM, 41(6):1216–1266, 1994.

[RS82] J. Alan Robinson and Ernest E. Sibert. LOGLISP: Motivation, design and im-plementation. In K. L. Clark and S. A. Tanlund, editors, Logic Programming,pages 299–313. Academic Press, 1982.

[RSSS94] Raghu Ramakrishnan, Divesh Srivastava, S. Sudarshan, and Praveen Seshadri.The CORAL deductive system. The VLDB Journal, 3(2):161–210, 1994.

[RT84] Thomas Reps and Tim Teitelbaum. The Synthesizer Generator. ACM SIG-SOFT Software Engineering Notes, 9(3):42–48, 1984.

[SAK07] Daniel Speicher, Malte Appeltauer, and Gunter Kniesel. Code analyses forrefactoring by source code patterns and logical queries. In Proceedings of the1st Workshop on Refactoring Tools, pages 17–20, 2007.

[SdML04] Ganesh Sittampalam, Oege de Moor, and Ken Friis Larsen. Incremental exe-cution of transformation specifications. In Proceedings of the 31st ACM sym-posium on Principles of Programming Languages (POPL ’04), pages 26–38,2004.

[SEdM08] Max Schafer, Torbjorn Ekman, and Oege de Moor. Sound and extensiblerenaming for Java. In Proceedings of the 23th ACM conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA ’08),2008. To appear.

[Ser01] Silvija Seres. The Algebra of Logic Programming. PhD thesis, University ofOxford, 2001.

[SH04] Peter Sestoft and Henrik I. Hansen. C# Precisely. MIT Press, 2004.

[Sim07] The Simplify decision procedure.http://kind.ucd.ie/products/opensource/Simplify/, 2007.

[Spi90] Michael Spivey. A functional theory of exceptions. Science of Computer Pro-gramming, 14(1):25–42, 1990.

[Spi00] Michael Spivey. Combinators for breadth-first search. Journal of FunctionalProgramming, 10(4):397–408, 2000.

[SR05] Diptikalyan Saha and C. R. Ramakrishnan. Incremental and demand-drivenpoints-to analysis using logic programming. In Proceedings of the 7th ACMSIGPLAN international conference on Principles and Practice of DeclarativeProgramming (PPDP ’05), pages 117–128, New York, NY, USA, 2005. ACM.

[SS99] Michael Spivey and Silvija Seres. Embedding Prolog in Haskell. In Haskell ’99,Technical Report UU-CS-1999-28, Department of Computer Science, Univer-sity of Utrecht., 1999.

[SSH99] Silvija Seres, Michael Spivey, and C. A. R. Hoare. Algebra of logic program-ming. In Proceedings of the International Conference on Logic Programming(ICLP ’99), pages 184–199, 1999.

BIBLIOGRAPHY 185

[SSL01] Frank Simon, Frank Steinbruckner, and Claus Lewerentz. Metrics based refac-toring. In Proceedings of the Fifth European Conference on Software Mainte-nance and Reengineering (CSMR ’01), page 30, Washington, DC, USA, 2001.IEEE Computer Society.

[SSW94] Konstantinos Sagonas, Terrance Swift, and David S. Warren. Xsb as an effi-cient deductive database engine. In Proceedings of the 1994 ACM SIGMODinternational conference on Management of data (SIGMOD ’94), pages 442–453, New York, NY, USA, 1994. ACM.

[ST08] Nik Sultana and Simon Thompson. Mechanical verification of refactorings. InProceedings of the 2008 ACM SIGPLAN symposium on Partial evaluation andsemantics-based program manipulation (PEPM ’08), pages 51–60, New York,NY, USA, 2008. ACM.

[Sym05] Don Syme. F# Home Page. http://research.microsoft.com/fsharp/fsharp.aspx,2005.

[TKB03] Frank Tip, Adam Kiezun, and Dirk Baumer. Refactoring for generalizationusing type constraints. In Proceedings of the 18th ACM conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA ’03),pages 13–26, 2003.

[TM03] Tom Tourwe and Tom Mens. Identifying refactoring opportunities using logicmeta programming. In Proceedings of the Seventh European Conference onSoftware Maintenance and Reengineering (CSMR ’03), page 91, Washington,DC, USA, 2003. IEEE Computer Society.

[Tom87] Masaru Tomita. An efficient augmented-context-free parsing algorithm. Com-putational Linguistics, 13(1-2):31–46, 1987.

[TZ86] Shalom Tsur and Carlo Zaniolo. LDL: A logic-based data language. InProceedings of the 12th International Conference on Very Large Data Bases(VLDB ’86), pages 33–41, San Francisco, CA, USA, 1986. Morgan KaufmannPublishers Inc.

[Ull89] J. D. Ullman. Bottom-up beats top-down for datalog. In Proceedings ofthe eighth ACM SIGACT-SIGMOD-SIGART symposium on Principles ofdatabase systems (PODS ’89), pages 140–149, New York, NY, USA, 1989.ACM.

[Ull94] Jeffrey D. Ullman. Assigning an appropriate meaning to database logic withnegation. Computers as Our Better Partners, pages 216–225, 1994.

[Van05] Ivan Vankov. Relational approach to program slicing. Master’s thesis, Univer-sity of Amsterdam, 2005.

[vdBHdJ+01] Mark van den Brand, Jan Heering, Hayco de Jong, Merijn de Jonge, TobiasKuipers, Paul Klint, Leon Moonen, Pieter Olivier, Jeroen Scheerder, JurgenVinju, Eelco Visser, and Joost Visser. The ASF+SDF Meta-Environment:a Component-Based Language Development Environment. In Proceedings ofCompiler Construction (CC ’01), LNCS. Springer, 2001.

BIBLIOGRAPHY 186

[vDD04] Daniel von Dincklage and Amer Diwan. Converting Java classes to use generics.In Proceedings of the 19th ACM conference on Object-Oriented Programming,Systems, Languages and Applications (OOPSLA ’04), pages 1–14, 2004.

[VEdM06] Mathieu Verbaere, Ran Ettinger, and Oege de Moor. JunGL: a scripting lan-guage for refactoring. In Dieter Rombach and Mary Lou Soffa, editors, Proceed-ings of the 28th International Conference on Software Engineering (ICSE ’06),pages 172–181, New York, NY, USA, 2006. ACM Press.

[Vie86] Laurent Vieille. Recursive axioms in deductive databases: The query-subqueryapproach. In Larry Kerschberg, editor, Proceedings of International Confer-ence on Expert Database Systems, 1986.

[Vis02] Eelco Visser. Meta-programming with concrete object syntax. In Generativeprogramming and component engineering, pages 299–315, 2002.

[Vor93] Scott A. Vorthmann. Modelling and specifying name visibility and bindingsemantics. Technical Report CMU//CS-93-158, Carnegie Mellon University,1993.

[VPdM06] Mathieu Verbaere, Arnaud Payement, and Oege de Moor. Scripting refactor-ings with JunGL. In Companion to the 21th ACM SIGPLAN conference onObject-Oriented Programming, Systems, Languages and Applications (OOP-SLA ’06), pages 651–652, New York, NY, USA, 2006. ACM Press.

[vRS91] Allen van Gelder, Kenneth Ross, and John S. Schlipf. The well-founded se-mantics for general logic programs. Journal of the ACM, 38(3):620–650, 1991.

[W3C07] W3C. XQuery 1.0 and XPath 2.0 formal semantics.http://www.w3.org/TR/xquery-semantics/, 2007.

[WACL05] John Whaley, Dzintars Avots, Michael Carbin, and Monica S. Lam. UsingDatalog and binary decision diagrams for program analysis. In Kwangkeun Yi,editor, Proceedings of the 3rd Asian Symposium on Programming Languagesand Systems (APLAS ’05), volume 3780. Springer-Verlag, 2005.

[Wad99a] Philip Wadler. A formal semantics of patterns in XSLT. In Markup Technolo-gies, 1999.

[Wad99b] Philip Wadler. Two semantics for XPath. Available athttp://www.cs.bell-labs.com/ who/wadler/topics/xml.html, 1999.

[War92] David S. Warren. Memoing for logic programs. Communications of the ACM,35(3):93–111, 1992.

[Wei84] Mark Weiser. Program slicing. IEEE Transactions on Software Engineering,10:352–357, 1984.

[why07] The Why verification tool. http://why.lri.fr/, 2007.

[WS97] Deborah Whitfield and Mary Lou Soffa. An approach for exploring code-improving transformations. ACM Transactions on Programming Languagesand Systems, 19(6):1053–1084, 1997.

a language to script refactoring transformations

Documents