part i : introduction to polyglot with swapj · part i : introduction to polyglot with swapj...

Part I : Introduction to Polyglot with SwapJ

Raoul-Gabriel Urma - Imperial College London

This is a tutorial written for researchers and students getting started with using

Polyglot to modify or extend Java. Many thanks to Prof. Sophia Drossopoulou,

Prof. Nathaniel Nystrom and Prof. Andrew Myers for encouraging and reviewing

this tutorial.

1 Polyglot

1.1 Introduction

Polyglot is a highly extensible compiler construction framework developedby Nystrom, Clarkson and Myers at Cornell University [1]. It performsparsing and semantic checking on a language extension and the compileroutputs Java source code. It is implemented as a Java class frameworkusing design patterns to promote extensibility.

Several projects have successfully used Polyglot to extend the Java pro-gramming language; they range from large-to middle-scale projects. Forexample, Jif [2] is a language modification that extends Java with supportfor information flow control and access control, SessionJ introduces session-based Distributed Programming in Java [3] and J0 is a subset of Java usedfor education [4].

Language modifications follow the same pattern: they are implementedby extending the base grammar, type system and defining new code trans-formations on top of the base Polyglot framework. The result is a compilerthat outputs Java source code. We can then compile the output with thestandard Java compiler javac to generate bytecode runnable by the JVM.

Currently, Polyglot only supports Java version 1.4, but a language ex-tension supporting most Java version 5.0 features has been developed [5].

Polyglot comes with the Polyglot Parser Generator (PPG), a customisedLook-Ahead LR Parser based on CUP [6, 7]. It is specially developed forthe language extension developer to easily extend the Java base grammardefined with CUP by specifying the required set of changes [1]. In fact,PPG provides grammar inheritance, which enables the language extensiondeveloper to augment, modify or delete symbols and production rules fromthe Java base grammar.

The architecture of Polyglot follows standard compiler construction tech-niques. First of all, it first uses JFlex, a lexical analyzer generator [8], to

1

perform lexical analysis on the source language. This step transforms thesequence of characters from the source code file into tokens. This chainof tokens is then parsed by PPG, which creates an abstract syntax tree(AST). An AST is a tree data structure that reflects the syntactic structureof a program. During the Polyglot process, this data structure is visitedby several passes; the default set of passes include for example type check-ing, ambiguities removing and translation. In addition, Polyglot enablesthe introduction of extra passes in order to perform operations related tothe compiler purposes: for example, optimising the AST. Finally, if no ex-ceptions are thrown during the Polyglot process, the created compiler isexpected to produce valid Java source code that can be compiled into Javabytecode.

Figure 1: Polyglot high-level process

1.2 In Details

In practice, implementing new language modification requires some knowl-edge about the Polyglot structure and internals. In this section, we explorePolyglot at a deeper level.

The latest revision of Polyglot can be fetched from the project SVN [9].The Polyglot distribution contains several directories:

• /bin/ : contains Polyglot base compiler and script newext.sh thatgenerates the skeleton for a new language extension

• /doc/ : various documentation about Polyglot

• /examples/ : source codes of sample language extensions using Poly-glot

• /skel/ : skeleton directory hierarchy and files used for building newlanguage extensions

• /src/ : complete source code of Polyglot framework

• /tools/java cup/ : source code of tweaked version of Java CUP

• /tools/ppg/ : source code of PPG

2

To create a language extension called [extname], the first step is to run/bin/nexext.sh, which creates the necessary skeleton package hierarchyand files:

• [extname]/bin/[extname]c : compiler for [extname]

• [extname]/compiler/src/[extname]/ : source code for languageextension

– ast : AST nodes specific to [extname]

– extension : Extension and delegate objects specific to [extname]

– types : type objects and typing logic specific to [extname]

– visit : visitors specific to [extname]

– parse : symbols table, lexer and parser for [extname], PPG gram-mar file ([extname].ppg), lexer grammar file([extname].flex) Thenewext.sh script takes several parameter:

Listing 1: newext.sh Parameters

1 Usage: ./ newext.sh dir package LanguageName ext

2 where dir - name to use for the top -level

directory

3 and for the compiler script

4 package - name to use for the Java

package

5 LanguageName - full name of the language

6 ext - file extension for source files

In addition, a class ExtensionInfo.java defines how the language ex-tension will be parsed and type-checked. A file Version.java specifiesthe version of the language extension. Finally, the class Main.javabrings all the parts together and performs the compilation.

• [extname]/tests/ : sample [extname] source code test files

The second step is to define the language modifications. This is doneby modifying the [extname].ppg file, which is processed by PPG. It specifiesthe changes to the base Java grammar required to generate a parser forthe new language extension. Sometime the developer has to updated thelexer grammar file [extname].flex too, if new tokens are added. The full listof available instructions for the PPG grammar can be found on the PPGProject page [6]. They include:

• extend S ::= productionsthe specified productions are added to the nonterminal S.

• override S ::= productionsthe specified productions completely replace S

3

The PPG file also specifies how the parsed information is used to createa new AST. New AST nodes are instantiated through the language exten-sion’s NodeFactory, which has factory methods to create each AST node.This NodeFactory typically extends Polyglot’s Java node factory, which isdefined by the class NodeFactory c class and implements the NodeFactory

interface. This interface specifies all the factory methods for each node.Figure 2 depicts a UML diagram explaining the different classes involved inthe node factory.

Figure 2: Language extension NodeFactory UML diagramA language extension’s node factory mechanism is split into two parts: an

interface implementing the base node factory and a concrete class nodefactory extending the base concrete node factory.

The node factory can be accessed within the PPG file through theparser.nf instance. Listing 2 shows as an example the parsing and creationof an Assert Java base node.

Listing 2: Parsing and Creation of Assert AST Node

1 assert_statement ::=

2 ASSERT:x expression:a COLON expression:b SEMICOLON:d

3 {:

4 RESULT = parser.nf.Assert(parser.pos(x, d), a, b);

5 :};

This snippet defines the production rule assert statement. It is defined by an assertsymbol and an expression representing the assert condition, a colon and anotherexpression representing the error message followed by a semicolon. For example:

assert(size == 0) is a valid assertion. If the parsing is successful, a new Assert ASTnode is created through the parser.nf.Assert(Position,Expr) node factory method.

4

After defining the syntactic changes through the grammar and definingthe new AST node classes for the language modification, the next step is tospecify the new semantic changes. New passes can be defined by defining andincluding them in Extensions.java. Most of the time, semantic changes canbe implemented directly as part of the type-checking pass. Note that eachnode has a method visitChildren(NodeVisitor v) that is called beforethe type-checking process in order to disambiguate the types of the Node’sfields. New nodes defined for the language extension must therefore alsoexecute the visit on each fields. This is done by overriding visitChildren

(NodeVisitor v) and passing the instance of the visitor to each field usingthe method visitChild(Node,Visitor). For the sake of completeness,Listing 3 illustrates this mechanism and shows the method visitChildren

of the Assert node.

Listing 3: Assert node’s visitChildren(NodeVisitor) method

1 /** Visit the children of the statement. */

2 public Node visitChildren(NodeVisitor v) {

3 Expr cond = (Expr) visitChild(this.cond , v);

4 Expr errorMessage = (Expr) visitChild(this.errorMessage , v);

5 return reconstruct(cond , errorMessage);

6 }

The Swap node has two fields: the condition expression (this.cond) and the errormessage (this.errorMessage). Both are passed to the visitChild method. The node is

then reconstructed using the returned instances.

Type checking is done in each node by the method typeCheck(ContextVisitor

tc) of a Node. If a type error exists the method throws a SemanticException.To continue with our example of the Assert node, Listing 4 illustrates typechecking for an assert statement.

5

Listing 4: Assert node’s type checking

1 public Node typeCheck(ContextVisitor tc) throws

SemanticException {

2 if (! cond.type().isBoolean ()) {

3 throw new SemanticException (" Condition of assert

statement " +

4 "must have boolean type

.",

5 cond.position ());

6 }

7

8 if (errorMessage != null && errorMessage.type().isVoid

()) {

9 throw new SemanticException (" Error message in

assert statement " +

10 "cannot be void.",

11 errorMessage.position ()

);

12 }

13

14 return this;

15 }

The method type-checks if the expression cond is of type boolean qnds if the expressionerrorMessage is defined and not void.

In addition, the language developer can access the TypeSystem instancethrough the ContextVisitor. The TypeSystem instance defines the typesof the language and how they are related. For example, it is used to comparethe equality between two types, set new types on the expressions or checkif a type can be cast to another type. Listing 5 shows an example of usingthe TypeSystem from the ContextVistor.

6

Listing 5: Switch c’s node typechecking

1 /** Type check the statement. */


SemanticException {

3 TypeSystem ts = tc.typeSystem ();

4 Context context = tc.context ();

5

6 if (!ts.isImplicitCastValid(expr.type(), ts.Int(), context)

7 && !ts.isImplicitCastValid(expr.type(), ts.Char(),

context))

8 {

9 throw new SemanticException (" Switch index must be an

integer.", position ());

10 }

11 return this;

12 }

The index of a switch statement (switch(index)) can only be a char type or an integertype. However, any type that can be cast to an int or a char is also allowed. For

example, an Integer or a short is valid but a String isn’t. The Switch c’s typeCheckmethod gets the type system from the instance tc of the ContextVisitor and then callsthe method isImplicitCastValid(Type, Type, Context) to perform the casting checks.

The final step after the semantic analysis of each AST node is to pro-duce valid Java code. There are several options available to the languageextension developer.

First, the most extensible way is to introduce a separate pass that trans-forms the language extensions AST nodes to valid Java AST nodes andthen rely on the default Java AST translation pass to output valid Javasource code. New passes can be added by extending the default Polyglotscheduler: JLScheduler. One would then have to create a pass by extend-ing an appropriate Polyglot visitor class and schedule the pass before theCodeGenerated pass, which is responsible for translation. The created passwill be responsible for rewriting language extensions AST nodes into JavaAST nodes using the NodeFactory methods. In addition, one can also usethe Polyglot quasiquoting feature, which enables the generation of Java ASTnodes from a String (polyglot.qq.QQ). This standard technique ensures thatthe output is well-formed Java code.

Secondly, another way to translate to Java code is to simply override themethod prettyPrint(CodeWriter, PrettyPrinter) of each new Node.This method is responsible for printing the generated code to the output fileand is called by the default implementation of the method translate(CodeWriter,

Translator), which is called during the Translation pass. Although this isa quick and easy way to perform code generation, it makes it harder to ex-tend the modified language later. Furthermore, Polyglot won’t check thatthe generated Java code is valid, so errors may show up when the code iscompiled.

7

As an example, Listing 6 shows the code generation for the Assert ASTnode and Listing 7 shows the code generation for the Throw AST node.

Listing 6: Assert node translation

1 /** Write the statement to an output file. */

2 public void prettyPrint(CodeWriter w, PrettyPrinter tr) {

3 w.write (" assert ");

4 print(cond , w, tr);

5

6 if (errorMessage != null) {

7 w.write (": ");

8 print(errorMessage , w, tr);

9 }

10

11 w.write (";");

12 }

13

14 public void translate(CodeWriter w, Translator tr) {

15 if (! Globals.Options ().assertions) {

16 w.write (";");

17 }

18 else {

19 prettyPrint(w, tr);

20 }

21 }

The translate method from the Assert node by default calls the prettyPrint methodwhich handles the code generation to the output file. Note that the print(Node child,

CodeWriter w, PrettyPrinter pp) method will handle the code generation for the Nodeinstance child. Essentially it calls its prettyPrinter method.

Finally, the language extension compiler is ready and can be used byrunning Main.java.

Listing 7: Throw node translation

1 /** Write the statement to an output file. */


3 w.write ("throw ");

4 print(expr , w, tr);

5 w.write (";");

6 }

Similarly to the Assert node, the Throw node’s prettyPrint method handles codegeneration and writes code to the output file.

8

2 SwapJ

In this section, we bring the concepts introduced about Polyglot together toshow how to create a compiler for a language extension. We create SwapJ,a language that extends Java with a swap functionality. The modificationmade to Java is simple: we introduce a new swap(x,y) keyword that swapsthe contents of the arguments x and y if they are of the same type. Wedon’t support swapping array accesses and field accesses, for simplicity.

We start by formally defining the syntax changes to the Java base gram-mar and also provide the semantics and type systems for the swap operation.After, we work step by step and implement the SwapJ compiler.

2.1 Formal Definition

We describe the syntax of our new build-in swap operation:

Listing 8: SwapJ BNF

1 <statement > ::= "swap" "(" <identifier > "," <identifier > ")"

";"

2 | <statement >

We define the operational semantics of our swap operation. Swappingthe two variables x and y means to assign the content of y to x and assigningthe content of x to y in the store φ:

SwapOS

stmts, φ[x 7→ φ(y), y 7→ φ(x)], χ; v′, χ′

swap(x, y); stmts, φ, χ; v′, χ′

We also define the type rule of our swap operation:

SwapTS

P, Γ ` x : tP, Γ ` y : t

P, Γ ` swap(x, y) : void

9

2.2 Implementation

1. Build skeleton extensionAs explained in the previous section, the first step is to run newext.shto build the skeleton files and directories for our language extension:

Listing 9: Creation of SwapJ files structure

1 ./ newext.sh swapJ swapj SwapJ sj

The directory containing the skeleton file structure is SwapJ. The package name isswapj, the language name is SwapJ and the extension file for our SwapJ source

files is .sj

2. PPG grammar specificationThe next step is to specify the syntactic grammar differences to theJava base grammar. We translate our BNF specification into PPGgrammar and specify the changes in swapJ.ppg (Listing 10) as ex-plained in the previous section. Since we are adding a new token swap

we also have to modify the lexer grammar file (Listing 11).

Listing 10: PPG grammar for SwapJ

1

2 terminal Token SWAP;

3 non terminal Stmt swap_stmt;

4

5 start with goal;

6

7 swap_stmt ::= SWAP:a LPAREN name:l COMMA name:r RPAREN

SEMICOLON:b {:

8 RESULT = parser.nf.Swap(parser.pos(a,b),l.toExpr (), r.

toExpr ());

9 :};

10

11 extend statement_without_trailing_substatement ::=

swap_stmt:a {: RESULT = a; :};

We extend the Java statement and add a new Swap statement. Note that we alsoadded a token SWAP that is defined in the lexer grammar file.

Listing 11: Lexer grammar for SWAP token

1 keywords.put("swap", new Integer(sym.SWAP));

3. NodeFactory and AST Nodes

The next step is to define the new SwapJNodeFactory and create thenecessary AST node. Listing 10 shows that if parsing is successful,

10

a node Swap will be created through the node factory. We now needto specify the interfaces required, implement the concrete classes andfollow UML diagram 2 describing the AST node creations.

We create an interface Swap (Listing 12), a concrete class Swap c

(Listing 13) providing the implementation, a new factory interfaceSwapJNodeFactory (Listing 14) which provides the template for aswap node creation but also implements the base NodeFactory inter-face and the concrete node factory SwapJNodeFactory c (Listing 15)that handles the instantiation of concrete Swap nodes. UML diagram 3summarises the hierarchy and relations between the different classes.

Listing 12: Swap interface

1 public interface Swap extends Stmt{

2

3 }

A swap node is a statement and therefore implements the Stmt interface.

Listing 13: Swap c concrete class constructor

1 public class Swap_c extends Stmt_c implements Swap{

2

3 private Expr left_e;

4 private Expr right_e;

5

6 public Swap_c(Position pos , Expr left_e , Expr right_e) {

7 super(pos);

8 this.left_e = left_e;

9 this.right_e = right_e;

10

11 }

The constructor of a Swap node takes the Position in the source file from theparser and also two expressions representing the left variable and right variable toswap. The Swap c also extends the Stmt c class, which encapsulates behaviour of

any Java statement.

Listing 14: SwapJNodeFactory interface

1 /**

2 * NodeFactory for swapJ extension.

3 */

4 public interface SwapJNodeFactory extends NodeFactory {

5

6 Swap Swap(Position pos , Expr left_e , Expr right_e);

7

8 }

11

Listing 15: SwapJNodeFactory c concrete class

1 /**

2 * NodeFactory for swapJ extension.

3 */

4 public class SwapJNodeFactory_c extends NodeFactory_c

implements SwapJNodeFactory {

5

6 public Swap Swap(Position pos , Expr left_e , Expr right_e

) {

7 return new Swap_c(pos , left_e , right_e);

8 }

9

10 }

The factory method Swap(Position, Expr, Expr) returns a concrete node Swap c

Figure 3: SwapJ class diagram

4. Semantic changesThe next step is to perform type checking on the swap operation: weneed to ensure that the two arguments to swap are of the same type.As explained in the previous section, type checking is performed bythe typeCheck(ContextVisitor) method of each node. We overridethis method so it checks if the two expressions of the Swap node areof the same type. Note that we also need to override the methodvisitChildren(NodeVisitor) to ensure the types of the Swap nodearguments are disambiguated. Listing 16 shows how the Swap c con-crete class’ methods visitChildren and typeCheck are implemented.

12

Listing 16: Swap node type checking

1 public Swap reconstruct(Expr expr_l , Expr expr_r) {

2 if (this.left_e != expr_l || this.right_e != expr_r) {

3 Swap_c n = (Swap_c) copy();

4 n.left_e = expr_l;

5 n.right_e = expr_r;

6 return n;

7 }

8 return this;

9 }

10

11 @Override

12 public Node visitChildren(NodeVisitor v) {

13 Expr expr_l = (Expr) visitChild(left_e , v);

14 Expr expr_r = (Expr) visitChild(right_e , v);

15

16 return reconstruct(expr_l , expr_r);

17 }

18

19 @Override


SemanticException {

21

22 SwapJTypeSystem ts = (SwapJTypeSystem)tc.typeSystem ();

23

24 Type left_t = left_e.type();

25 Type right_t = right_e.type();

26

27 if(! left_t.typeEquals(right_t))

28 {

29 throw new SemanticException ("swap() arguments of

different types !");

30 }

31

32 return this;

33

34 }

The overriden visitChildren disambiguate the Swap c’s fields and return adisambiguated Swap node. The overriden typeCheck method uses the type

systems to verify the type equality between the two swap’s arguments.

5. Translation and code generationThe final step is translation. Since the swap translation to Java sourcecode is straightforward, we can override the prettyPrint(CodeWriterw, PrettyPrinter tr) method directly and define code generation asexplained in the previous section. We also need a fresh variable nameto perform the swap, and Polyglot provides a helper static method forthis: Name.makeFresh(). Listing 17 shows how to perform the codegeneration of a Swap node.

13

Listing 17: Swap c node’s code generation

1 @Override


3

4 // fresh variable name

5 String fresh = Name.makeFresh ().toString ();

6

7 // int temp = y;

8

9 left_e.type().print(w);

10 w.write (" " + fresh + " = ");

11 print(right_e ,w,tr);

12 w.write (";\n");

13

14 // y = x;

15 print(right_e ,w,tr);

16 w.write (" = ");

17 print(left_e ,w,tr);

18 w.write (";\n");

19

20 // x = y;

21 print(left_e ,w,tr);

22 w.write (" = " + fresh);

23 w.write (";\n");

24 }

6. TestingThe SwapJ compiler is now operational, and we can test code gener-ation. We create a swapTest class written in SwapJ and compile itwith the SwapJ compiler. Listing 18 shows the class written in SwapJand Listing 19 shows the generated Java output.

Listing 18: swapTest class written in SwapJ

1 public class swapTest {

2

3 public static void main(String [] args) {

4

5 String x = "Swapj !";

6 String y = "Java";

7

8 swap(x,y);

9 // Java Swapj!

10 System.out.println(x + " " +y);

11 swap(x,y);

12 // Swapj! Java

13 System.out.println(x + " " +y);

14 }

15 }

14

Listing 19: swapTest class after compilation

1 public class swapTest {

2

3 public static void main(String [] args) {

4 String x = "Swapj !";

5 String y = "Java";

6 java.lang.String id0 = y;

7 y = x;

8 x = id0;

9

10 System.out.println(x + " " + y);

11 java.lang.String id1 = y;

12 y = x;

13 x = id1;

14

15 System.out.println(x + " " + y);

16 }

17 }

2.3 Summary

We showed how to quickly implement a simple language extension by usingPolyglot. This section was written as a tutorial for researchers and studentsinterested to develop language extensions. In the next part of the tutorial,we will go in further depth and explore the scheduling of passes in Polyglot:we will add an AST Rewriting pass to directly transform SwapJ nodes intoJava nodes. Table 2.3 summaries the lines of code added for each new classintroduced to implement the SwapJ language extension.

Figure 4: SwapJ code modifications summaryClasses Lines of code

SwapJNodeFactory 14SwapJNodeFactory c 14Swap 7Swap c 117

Total 152

15

References

[1] Nathaniel Nystrom, Michael R. Clarkson, and Andrew C. Myers. Poly-glot: An extensible compiler framework for Java. Technical report, Cor-nell University, 2003.

[2] Nate Nystrom, Lantian Zheng, Steve Zdancewic, Andrew Myers,Stephen Chong, and K. Vikram. Java information flow. http://www.

cs.cornell.edu/jif/.

[3] Raymond Hu, Nobuko Yoshida, and Kohei Honda. Session-based dis-tributed programming in java. Technical report, Imperial College Lon-don, 2008.

[4] Andrew Jonas, Daniel Lee, and Andrew Myers. J0: A Java extensionfor beginning (and advanced) programmers. http://www.cs.cornell.

edu/Projects/j0/.

[5] Milan Stanojevic and Todd Millstein. Polyglot for Java 5. http://www.cs.ucla.edu/~todd/research/polyglot5.html.

[6] Michael Brukman and Andrew C. Myers. A parser generator for exten-sible grammars. http://www.cs.cornell.edu/projects/polyglot/

ppg.html.

[7] Princeton University and Technical University of Munich. LALR parsergenerator for java. http://www2.cs.tum.edu/projects/cup/.

[8] JFlex - the fast scanner generator for Java. http://jflex.de/.

[9] Polyglot svn. http://polyglot-compiler.googlecode.com/svn/

trunk/polyglot/.

16

http://www.cs.cornell.edu/jif/

http://www.cs.cornell.edu/jif/

http://www.cs.cornell.edu/Projects/j0/

http://www.cs.cornell.edu/Projects/j0/

http://www.cs.ucla.edu/~todd/research/polyglot5.html

http://www.cs.ucla.edu/~todd/research/polyglot5.html

http://www.cs.cornell.edu/projects/polyglot/ppg.html

http://www.cs.cornell.edu/projects/polyglot/ppg.html

http://www2.cs.tum.edu/projects/cup/

http://jflex.de/

http://polyglot-compiler.googlecode.com/svn/trunk/polyglot/

http://polyglot-compiler.googlecode.com/svn/trunk/polyglot/

List of Figures

1 Polyglot high-level process . . . . . . . . . . . . . . . . . . . . 22 Language extension NodeFactory UML diagram . . . . . . . . 43 SwapJ class diagram . . . . . . . . . . . . . . . . . . . . . . . 124 SwapJ code modifications summary . . . . . . . . . . . . . . . 15

17

Listings

1 newext.sh Parameters . . . . . . . . . . . . . . . . . . . . . . 32 Parsing and Creation of Assert AST Node . . . . . . . . . . . 43 Assert node’s visitChildren(NodeVisitor) method . . . . . . . 54 Assert node’s type checking . . . . . . . . . . . . . . . . . . . 65 Switch c’s node typechecking . . . . . . . . . . . . . . . . . . 76 Assert node translation . . . . . . . . . . . . . . . . . . . . . 87 Throw node translation . . . . . . . . . . . . . . . . . . . . . 88 SwapJ BNF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Creation of SwapJ files structure . . . . . . . . . . . . . . . . 1010 PPG grammar for SwapJ . . . . . . . . . . . . . . . . . . . . 1011 Lexer grammar for SWAP token . . . . . . . . . . . . . . . . 1012 Swap interface . . . . . . . . . . . . . . . . . . . . . . . . . . 1113 Swap c concrete class constructor . . . . . . . . . . . . . . . . 1114 SwapJNodeFactory interface . . . . . . . . . . . . . . . . . . . 1115 SwapJNodeFactory c concrete class . . . . . . . . . . . . . . . 1216 Swap node type checking . . . . . . . . . . . . . . . . . . . . 1317 Swap c node’s code generation . . . . . . . . . . . . . . . . . 1418 swapTest class written in SwapJ . . . . . . . . . . . . . . . . 1419 swapTest class after compilation . . . . . . . . . . . . . . . . 15

18

part i : introduction to polyglot with swapj · part i : introduction to polyglot with swapj...

Documents