new proceedings of the java™ virtual machine research and … · 2019. 2. 25. · gagnon and...

USENIX Association

Proceedings of theJava™ Virtual Machine Research and

Technology Symposium(JVM '01)

Monterey, California, USAApril 23–24, 2001

THE ADVANCED COMPUTING SYSTEMS ASSOCIATION

© 2001 by The USENIX Association All Rights Reserved For more information about the USENIX Association:

Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: [email protected] WWW: http://www.usenix.orgRights to individual papers remain with the author or the author's employer.

Permission is granted for noncommercial reproduction of the work for educational or research purposes.This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein.

SableVM: A Research Framework for the EÆcient Execution of

Java Bytecode�

Etienne M. Gagnon and Laurie J. Hendren

Sable Research Group

School of Computer Science

McGill University

[gagnon,hendren]@sable.mcgill.ca

Abstract

SableVM is an open-source virtual machine for Javaintended as a research framework for eÆcient ex-ecution of Java bytecode1. The framework is es-sentially composed of an extensible bytecode inter-preter using state-of-the-art and innovative tech-niques. Written in the C programming language,and assuming minimal system dependencies, the in-terpreter emphasizes high-level techniques to sup-port eÆcient execution.

In particular, we introduce a biderectional layout forobject instances that groups reference �elds sequen-tially to allow eÆcient garbage collection. We alsointroduce a sparse interface virtual table layout thatreduces the cost of interface method calls to that ofnormal virtual calls. Finally, we present a techniqueto improve thin locks[13] by eliminating busy-waitin presence of contention.

1 Introduction & Motivation

Over the last few years, Java[21] has rapidly becomeone of the most popular general purpose object-oriented (OO) programming languages. Java pro-grams are compiled into class �les which includetype information and platform independent byte-code instructions. On a speci�c platform, a run-time system called a virtual machine[24] loads andlinks class �les then executes bytecode instructions.The virtual machine collaborates with the standard

�This research is partly supported by FCAR, NSERC, andHydro-Qu�ebec.

1In this document, the term Java means: the Java pro-gramming language.

class libraries to provide key services to Java pro-grams, including threads and synchronization, au-tomatic memory management (garbage collection),safety features (array bound checks, null pointer de-tection, code veri�cation), re ection, dynamic classloading, and more.2

Early Java virtual machines were simple bytecodeinterpreters. Soon, the quest for eÆciency led tothe addition of Just-In-Time compilers (JIT) to vir-tual machines, an idea formerly developed for otherOO runtime systems like Smalltalk-80[17] and Self-91[15]. In a few words, a just-in-time compiler worksby compiling bytecodes to machine speci�c code onthe �rst invocation of a method. JITs range fromthe very naive, that use templates to replace eachbytecode with a �xed sequence of native code in-structions (early versions of Ka�e[5] did this), tothe very sophisticated that perform register alloca-tion, instruction scheduling and other scalar opti-mizations (e.g. [8, 23, 29, 32]).

JITs face two major problems. First, they striveto generate good code in very little time, as com-pile time is lost to the running application. Second,the code of compiled method resides in memory;this augments the pressure on the memory managerand garbage collector. Recent virtual machines tryto overcome these problems. The main trend is touse dynamic strategies to �nd hot execution paths,and only optimize these areas (e.g. [4, 10, 16]).HotSpot[4], for example, is a mixed interpreter andcompiler environment. It only compiles and op-timizes hot spots. Jalapeno[10, 11], on the otherhand, always compiles methods (naively at �rst),

2There exist static compilers that directly compile Javaprograms to machine code (e.g. [2, 3, 7]). The constraintsof static and dynamic environments are quite di�erent. Ourresearch focuses on dynamic Java execution environments.

then uses adaptive online feedback to recompile andoptimize hot methods. These techniques are par-ticularly suited to virtual machines executing longrunning programs in server environments. The op-timizer can be relatively slow and consist of a fully edged optimizing compiler using intermediate rep-resentations and performing costly aggressive opti-mizations, as compile time will be amortized on theoverall running time.

Our research complements these approaches by ex-ploring opportunities for making the virtual ma-chine execute eÆciently. Rather than looking at�ne grain techniques, like register allocation andinstruction scheduling, we address the fundamen-tal problem of data layout in a dynamic Java envi-ronment. While Java shares many properties withother object-oriented languages, the set of runtimeconstraints enforced by the veri�er and the basicservices provided for each object (hash code, lock-ing) are unique. This leads us to revisit traditionaldata structures used in object-oriented runtime en-vironments, and adapt them to fully take advantageof the properties of the Java runtime environment.

As a testbed for evaluating our proposed data struc-tures and algorithms, we are designing and imple-menting SableVM, a standards conforming open-source virtual machine. Written in the C program-ming language, and depending on the POSIX appli-cation programming interface (API), it is meant asa small and portable interpreter3. It can be used asan experimental framework for extending the byte-code language. It can also be used as an eÆcientvirtual machine for embedded systems, or as a pro-�ling interpreter in a hybrid interpreter/just-in-timeoptimizing-compiler environment.

The remaining part of this document is structuredas follows. Section 2, we state the contributionsof this paper. In section 3, we give an overviewof the SableVM framework. In section 4, we de-scribe SableVM's threaded interpreter. In section5, we introduce our classi�cation of virtual machinememory. In section 6, we introduce our new layoutsfor object instances and virtual tables, and our im-proved thin locks. In section 7, we discuss our pro-posed experiments. Finally, in section 8, we presentour conclusions.

3SableVM depends on the open-source GNU Classpath[1]class library for providing standard library services.

2 Contribution

The speci�c contributions of this paper are as fol-lows.

� Introduction of a bidirectional object instancelayout that groups reference �elds sequen-tially, enabling simpler and faster garbage col-lection tracing.

� Introduction of a sparse interface virtual ta-ble layout that enables constant time interfacemethod lookup in presence of dynamic load-ing.

� Improvement of the bimodal �eld thin lockalgorithm[13, 26] to eliminate busy-wait, with-out overhead in the object instance layout.

� Categorization of virtual machine memoryinto separate conceptual areas exhibiting dif-ferent management needs.

3 Framework Overview

As shown in Figure 1, the SableVM experimentalframework is a virtual machine composed of �vemain components: interpreter, memory manager,veri�er, class loader, and native interface. In ad-dition, the virtual machine implements various ser-vices required by the class library (e.g.: synchro-nization and threads).

SableVM is entirely4 written in portable C. Thus,its source code is readable and simple to modify.This makes an ideal framework for testing new high-level implementation features or bytecode languageextensions. For example, adding a new arithmeticbytecode instruction entails making a minor modi�-cation to the class loader, adding a few rules to theveri�er, and �nally adding the related interpretercode. This is pretty easy to do in SableVM, ascompared to a virtual machine written in assemblylanguage, or a virtual machine with an embeddedcompiler (e.g. JIT).

The current implementation of SableVM targets theLinux operating System on Intel x86 processors. It

4Exceptions: We assume a POSIX system library, we uselabel as values (see Figure 2(b)), and there is a single line ofassembly code (compare-and-swap).

��

��

Memory Manager

Verifier

SableVM

Profiler JIT

Class Loader Java Native Interface

InterpreterThreaded

Class Libraries / Applications

Figure 1: The SableVM experimental framework

uses the GNU libc implementation of POSIX thre-ads to provide preemptive operating system levelthreads.

4 Threaded Interpreter

SableVM's interpreter is a threaded interpreter.Pure bytecode interpreters su�er from expensivedispatch costs: on every iteration, the dispatch loopfetches the next bytecode, looks up the associatedimplementation address in a table (explicitly, orthrough a switch statement), then transfers the con-trol to that address. Direct threading[20] reducesthis overhead: in the executable code stream, eachbytecode is replaced by the address of its associ-ated implementation. In addition, each bytecodeimplementation ends with the code required to dis-patch the next opcode. This is illustrated in �gure2. This technique eliminates the table lookup andthe central dispatch loop (thus eliminating a branchinstruction to the head of the loop). As these opera-tions are expensive on modern processors, this tech-nique has been shown to be quite e�ective[20, 27].

Method bodies are translated to threaded code ontheir �rst invocation. We take advantage of thistranslation to do some optimizations. For exam-ple, we precompute absolute branch destinations,we translate overloaded bytecodes like theGET FIELD instruction to separate implementationaddresses (GET FIELD INT, GET FIELD FLOAT, ...),and we inline constant pool references to direct

operand values.

This one pass translation is much simpler than thetranslation done by even the most naive just-in-time compiler, as each bytecode maps to an address,not a variable sized implementation. However, un-like a JIT, the threaded interpreter still pays thecost of an instruction dispatch for each bytecode.Piumarta[27] has shown a technique to eliminatethis overhead within a basic block using selective in-lining in a portable manner, at the cost of additionalmemory5. SableVM implements this technique op-tionally through a compile-time ag, as it might notbe appropriate for systems with little memory.

5 Memory Management

Memory management is a central issue in the designof SableVM. Most of the high-level performance en-hancements introduced in this research are relatedto memory management.

In this section, we classify the memory of the Javavirtual machine according to the control on its man-agement, and its allocation and release behavior.We de�ne four categories (system, shared, threadspeci�c, and class loader speci�c), and discuss howSableVM takes advantage of them.

5On some processors, this technique requires one line ofassembly code to synchronize the instruction and data caches.

/* code */

char code[] = {ICONST_2, ICONST_2,ICONST_1, IADD, ...

}char *pc = code;

/* dispatch loop */

while(true) {switch(*pc++) {case ICONST_1: *++sp = 1; break;case ICONST_2: *++sp = 2; break;case IADD:

sp[-1] += *sp; --sp; break;...

}}

/* code */

void *code[] = {&&ICONST_2, &&ICONST_2,&&ICONST_1, &&IADD, ...

}void **pc = code;

/* implementations */

goto **(pc++);

ICONST_1: *++sp = 1; goto **(pc++);ICONST_2: *++sp = 2; goto **(pc++);IADD:sp[-1] += *sp; --sp; goto **(pc++);

...

(a) Pure bytecode interpreter (b) Threaded Interpreter

Figure 2: Pure and threaded interpreters

5.1 System Memory

System memory is the portion of memory on whichwe, as C developers, have essentially no direct con-trol. It consists of the memory used to store exe-cutable machine code, native C stacks, the C heap(malloc() and free()), dynamically linked nativelibraries, and any other uncontrollable memory.

5.2 Shared Memory

Shared memory is managed by the virtual machineand potentially allocated and modi�ed by manythreads executing methods of various class loaders.

This memory consists primarily of the Java heap(which is garbage collected), and global JNI refer-ences. The allocation and release behavior of suchmemory is highly application dependent, with nogeneral allocation or release pattern.

5.3 Thread Speci�c Memory

Thread speci�c memory is also managed by the vir-tual machine, but it is allocated speci�cally for in-ternal management of each Java thread.

This memory consists primarily of Java stacks, JNIlocal reference frames for each stack, and internalstructures storing thread speci�c data like stack in-formation, JNI virtual table, and exception status.

This memory exhibits precise allocation and releasepatterns. Thread speci�c structures have a life timesimilar to their related thread. So, this memorycan be allocated and freed (or recycled) at the timeof respective creation and death of the underlyingthread. Also, stacks have a regular pattern: theygrow and shrink on one side only. This property isshared by JNI local reference frames.

5.4 Class Loader Speci�c Memory

Class loader speci�c memory is managed by the vir-tual machine and is allocated for internal manage-ment of each class loader.

This memory consists primarily of the internal datastructures used to store class loader, class, andmethod data structures. This includes method bod-ies in their various forms like bytecode, directthreaded code, inlined threaded code, and poten-tially compiled code (in the presence of a JIT). Italso includes normal and interface virtual tables.

This memory exhibits precise allocation and releasepatterns. This memory is allocated at class load-ing time, and at various preparation, veri�cation,and resolution execution points. This memory dif-fers signi�cantly from stacks and the shared garbagecollected heap in that once it is allocated, it muststay at a �xed location, and it is unlikely to be re-leased soon. The Java virtual machine speci�cationallows for potential unloading of all classes of a classloader as a group, if no direct or indirect referencesto the class loader, its classes, and theirs instances

remain. In such a case, and if a virtual machine sup-ports class unloading, all memory used by a classloader and its classes can be released at once.

5.5 SableVM Implementation

In SableVM, thread speci�c memory is managed in-dependently from shared memory. SableVM allo-cates thread structures on thread creation but doesnot release them at thread death. Instead, it man-ages a free list to recycle this memory on futurethread creation.

Java stacks are growing structures; a memory blockis allocated at thread creation, and if later the stackproves too small, the memory block is expanded,possibly moving it to another location to keep thestack contiguous and avoid fragmentation.

SableVM also manages class loader speci�c mem-ory independently from other memory. Each classloader has it own memory manager that allocatesmemory (from the system) in relatively big chunks,then redistributes this memory is smaller fragments.This has many advantages.

It allows the allocation of many small memory blockswithout the usual memory space overhead, asmalloc() would use additional memory space tostore the size of each allocated block in previsionof future free() calls. In the class loader speci�cmemory category, smaller fragments will only be re-turned to the system as a group (in case of classunloading), so we need not keep track of individualfragment sizes.

As a corollary, class unloading is more eÆcient us-ing a dedicated memory manager than using regularmalloc() and free() calls, as there is no need toincrementally aggregate small memory segments, aswould happen with a sequence of free() calls.

Using dedicated memory managers allows class pars-ing and decoding in one pass without memory over-head, by allocating many small memory blocks. Thisis usually not feasible, as it is not possible to esti-mate the memory requirement for storing internalclass information before the end of the �rst pass.

Finally, and importantly, a dedicated memory man-ager allows for irregular memory management strate-gies: it is possible to return sub-areas of an allocated

block to the memory manager, if these sub-areas areknown not to be used. We take advantage of thisto improve the representation of interface methodlookup tables6.

6 Performance Enhancements

In this section, we introduce new layouts for objectinstances and interface virtual tables, as well as im-provements to the thin lock algorithm, leading tohigh-level performance enhancements in the areasof garbage collection, interface method invocation,and synchronization.

We say high-level enhancements, because these tech-niques are applicable to any Java virtual machine,independently from its form: interpreter, just-in-time compiler, adaptive online feedback based sys-tems, etc.

6.1 Bidirectional Object Layout

In this subsection, we propose a new object layoutthat optimizes the placement of reference �elds toallow eÆcient gc tracing.

The Java heap is by de�nition a garbage collectedarea. A Java programmer has no control on thedeallocation of an object. Garbage collectors canbe divided into two major classes: tracing and non-tracing collectors. Non-tracing collectors (mainlyreference counting) cannot reclaim cyclic data struc-tures, are a poor �t for concurrent programmingmodels, and have a high reference count mainte-nance overhead. For this reason, Java virtual ma-chine designers usually opt for a tracing collector.

There exist many tracing collectors[22]. The sim-plest models are mark-and-sweep, copying, andmark-compact. The common point to all tracingcollectors (including advanced generational, conser-vative and incremental techniques) is that they musttrace a subset of the heap, starting from a root set,looking for reachable objects. Tracing is often one ofthe most expensive steps of garbage collection[22].For every root, the garbage collector (gc) looks upthe type of the object to �nd the o�set of its ref-erence �elds, then it recursively visits the objectsreferenced by these �elds.

6See section 6.2.

To provide eÆcient �eld access, it is desirable toplace �elds at a constant o�set from the objectheader, regardless of inheritance. This is easilyachieved in Java as instance �elds can only be de-clared in classes (not in interfaces), and classes arerestricted to single inheritance. Fields are laid outconsecutively after the object header, starting withsuper class �elds then subclass �elds, as shown inFigure 3(a). When tracing such an object, thegarbage collector must access the object's class in-formation to discover the o�set of its reference �elds,then access the superclass information to obtain theo�set of its reference �elds, and so on. As this pro-cess must be repeated for each traced object, it isquite expensive.

There are three improvements that are usually ap-plied to this naive representation. Firstly, reference�elds are grouped together in the layout of eachclass. Secondly, each class stores an array of o�-sets and counts of reference �elds for itself and allits super classes. Thirdly, a one word bit array isused in the virtual table to represent the layout ofreference �elds in small objects (each bit being setif the object instance word, at the same index, is areference). This is shown in Figure 3(b). For bigobjects, the number of memory accesses needed totrace an object is n + 3 + (2 � arraysize), where n

is the number of references. Two nested loops (andloop variables) are required: one to traverse the ar-ray, and one for each array element (accessing therelated number of references). For smaller objects,the gc needs to access the virtual table to retrievethe bit �eld word, then it needs to perform a set ofshift and mask operations to �nd the o�set of ref-erence �elds. Overall, using this layout, tracing anobject is a relatively complex operation.

Tracing reference �elds could be much simpler ifthey were simply grouped consecutively. The dif-�culty is to group them while keeping the constant

o�set property in presence of inheritance.

We introduce a bidirectional object instance layout

that groups reference �elds while maintaining theconstant o�set property. The left part of Figure4 illustrates this new layout. In the bidirectionalobject instance layout, the instance starting point ispossibly a reference �eld. The instance grows bothways from the object header, which is located in themiddle of the instance. References are placed beforethe header, and other �elds are placed after it. Theright part of Figure 4 illustrates the layout of arrayinstances. Array element are placed in front or after

the array instance header, depending on whether theelement type is a reference or a non-reference type,respectively.

The object header contains two words (three for ar-rays). The �rst is a lock word and the second is avirtual table pointer. We use a few low-order bitsof the lockword encode the following information:

� We set the last (lowest order) bit to one, todi�erentiate the lock word from the precedingreference �elds (which are pointers to alignedobjects, thus have their last bit set to zero).

� We use another bit to encode whether the in-stance is an object or an array.

� If it is an array, we use 4 bits to encode itselement type (boolean, byte, short, char, int,long, oat, double, or reference).

� If it is an object, we use a few bits to encode(1) the number of references and (2) the num-ber of non-reference �eld words of the object,(or special over ow values, if the object is toobig).

We also use two words of the virtual table (see Fig-ure 5) to encode the number of reference and non-reference �eld words of the object if the object istoo big to encode this information in the header.

At this point, we must distinguish the two ways inwhich an object instance can be reached by a trac-ing collector. The �rst way is through an objectreference that points to the object header (whichis in the middle of the object). The second way isthrough its starting point, in the sweep phase of amark-and-sweep gc, or in the tospace scanning ofa copying gc. In both cases, our bidirectional lay-out allows the implementation of simple and eleganttracing algorithms.

In the �rst case, the gc accesses the lock word to getthe number of references n (one shift, one mask).If n is the over ow value (big object), then n is re-trieved from the virtual table. Finally, the gc simplytraces n references in front of the object header.

In the second case, the object instance is reachedfrom its starting point in memory, which might beeither a reference �eld or the object header (if thereare no reference �elds in this instance). At thispoint, the gc must �nd out whether the initial word

IncreasingMemory

Addresses

Reference

Reference

Reference

Reference

Reference

Reference

Reference

fields of class C

fields of class B

fields of class A

fields of class C

fields of class B

fields of class A

Instance of class C extends B extends A

Non−Ref Fields

Non−Ref Fields

Non−Ref Fields

Non−Ref Fields

Instance Ptr

Object Header

VTBL Ptr

(a) Naive Object Layout

NULL

Super Class Info Ptr

(size) 2

(offset) 60

(offset) 48

...

Class Info Ptr

Thin Lock+Other Bits

Class Info Ptr

NULL

Super Class Info Ptr

(size) 3

...

(number) 2

(offset) 8

(number) 3

(offset) 28

(number) 2

(offset) 48


Instance Ptr

Object Header

VTBL Ptr

Fields

Reference

Reference

Fields

Reference

Reference

Reference

Fields

IncreasingMemory

Addresses

Reference

Reference

(b) Usual Object Layout

Bit Field

Thin Lock+Other Bits

Figure 3: Traditional layout

non−ref fields of class BFields

Reference

Referenceref fields of class C

Reference

Reference 2

IncreasingMemory

Addresses

IncreasingMemory

Addresses

IncreasingMemory

Addresses

Reference

Reference ref fields of class B

Reference

Reference

ref fields of class A

non−ref fields of class AFields

Size

Reference

Reference

0

1

Array of reference elements

Reference size − 1

Size

Array of primitive elements

elements

size − 1

10

...

non−ref fields of class CFields

Instance Ptr

Object Header

Object Header

Instance Ptr

Object Header

Instance Ptr


InstanceStarting

Point

VTBL Ptr

VTBL Ptr

VTBL Ptr

Thin Lock nref ref 0 1

Thin Lock 1

Thin Lock 1

1

1

Type

Type

Figure 4: Bidirectional layout

is a reference or a lock word. But, this is easy to�nd. The gc needs simply check the state of the lastbit of the word. If it is one, then the word is a lockword. If it is zero, then the word is a reference.

So, for example, a copying collector, while scan-ning the tospace needs only read words consecu-tively, checking the last bit. When set to zero, thediscovered reference is traced, when set to 1, thenumber of non-reference �eld words (encoded in thelock word itself, or in the virtual table on over ow) isused to �nd the starting point of the next instance.

In summary, using our bidirectional layout, a gconly accesses the following memory locations whiletracing: reference �elds and lock word, for all in-stances (objects and arrays), and at most three ad-ditional accesses for objects with many �elds (vir-tual table pointer and two words in the virtual tableitself).

While our work on bidirectional objects for group-ing references is new, we mention some previous re-lated work. The idea of using a bidirectional ob-ject layout (without grouping references) has beeninvestigated[25, 28] as a mean to provide eÆcientaccess to instance data and dispatch informationin languages supporting multiple inheritance (mostspeci�cally C++). In [14], Bartlett proposed agarbage collector which required grouping pointersat the head of structures; this was not achieved us-ing bidirectional structs.

6.2 Sparse Interface Virtual Tables

In this subsection, we present a virtual table lay-out that eliminates the overhead of interface methodlookup over normal virtual method lookup.

This enhancement addresses a problem raised bymultiple inheritance of interfaces in Java. The vir-tual machine instruction set contains an invokein-

terface instruction, used to invoke interface meth-ods. A common technique to implement this in-struction is to prepare multiple virtual tables foreach class: a main virtual table used for normalvirtual method invocation, and one additional vir-tual table for each interface directly or indirectlyimplemented by the class[5]. Each method declaredin an interface is given an index within its virtualtable. After preparation, each invokeinterface hastwo arguments: an interface number, and a method

index. On execution, the invokeinterface instruc-tion operates its method lookup in two steps. It�rst lookups up the appropriate virtual table (usinglinear, binary, or hashed search), then it retrievesthe method pointer in a single operation from thevirtual table entry located at the given method in-dex. This interface lookup procedure has the follow-ing overhead over normal virtual method lookup: itneeds to do a search to �nd the appropriate virtualtable. It would be possible to implement a constanttime lookup using compact encoding[31], but unfor-tunately, dynamic class loading requires updatingthis information dynamically, which is diÆcult todo in a multi-threaded Java environment. Our ap-proach is simple, and does not require dynamic re-computation of tables or code rewrite.

The idea of maintaining multiple virtual tables incase of multiple inheritance is reminiscent of C++implementations[19]. But, Java's multiple inheri-tance has a major semantic di�erence: it only ap-plies to interfaces which may only declare methodsignatures without providing an implementation.Furthermore, if a Java class implements two distinctinterfaces which declare the same method signature,this class satis�es both interfaces by providing a sin-gle implementation of this method. C++ allows theinheritance of distinct implementations of the samemethod signature.

We take advantage of this important di�erence to re-think the appropriate data structure needed for eÆ-cient interface method lookup. Our ideas originatefrom previous work on eÆcient method lookup indynamically typed OO languages using of selector-indexed dispatch tables[12, 18, 30]. We assign a glob-ally unique increasing index7 to each method signa-ture declared in an interface. A method signaturedeclared in multiple interfaces has a single index.When the virtual table of a class is created, we alsocreate an interface virtual table that grows downfrom the normal virtual table. This interface vir-tual table has a size equal to the highest index of allmethods declared in the direct and indirect superinterfaces of the class. For every declared super in-terface method, the entry at its index is �lled withthe address of its implementation. Interface invo-cation is then encoded with the invokeinterface in-struction, and a single interface method index. Theexecution of invokeinterface can then proceed at theexact same cost as an invokevirtual.

7In reality, we use a decreasing index, starting at at -1, toallow direct indexing in the interface virtual table.

The interface virtual table is a sparse array ofmethod pointers. As more interfaces are loaded,with many interface method signatures, the amountof free space in interface virtual tables grows. Thetraditional approach has been to use table compres-sion techniques to reduce the amount of free space.However, these techniques are poorly adapted toconcurrent and dynamic class loading environmentslike the Java virtual machine, as they require dy-namic recompilation.

Our approach di�ers. Instead of compressing inter-face virtual tables, we simply return the free spacein them to the related class loader memory man-ager (see section 5.4). This memory is then usedto store all kinds of other class loader related datastructures. In other words, we simply recycle thefree space of sparse interface virtual tables within aclass loader. The layout of interface virtual tablesis illustrated in Figure 5.

As interface usage in most Java programs rangefrom very low to moderate, we could argue that itis unlikely that the free space returned by interfacevirtual tables will grow faster than the rate at whichit is recycled. However, in order to handle patholog-ical cases, we also provide a very simple technique,which incurs no runtime overhead, to limit the max-imal growth of interface virtual tables. To limit thisgrowth to N entries, we stop allocating new inter-face method indices as soon as index N is given.Then, new interface method signatures are encodedusing traditional techniques. The trick to make thiswork is to encode interface calls di�erently, basedon whether the invoked method signature has beenassigned an index or not. The traditional techniqueused to handle over ow can safely ignore all inter-face methods which have already been assigned anindex.

6.3 Improved Thin Locks

Our �nal enhancement improves upon Onodera's bi-modal �eld locking algorithm[26], a modi�ed ver-sion of Bacon's thin lock algorithm[13], but withoutbusy-wait transition from light to heavy mode.

Bacon's thin lock algorithm can be summarized asfollows. Each object instance has a one lock wordin its header8. To acquire the lock of an object, a

8Only 24 bits of that word are used for locking on 32 bitsystems. 8 bits remain free for other uses.

thread uses the compare-and-swap atomic operationto compare the current lock value to zero, and re-place it with its thread identi�er. If the lock valueisn't zero, this means that either the lock is alreadyin ated, in which case a normal locking procedure isapplied, or the lock is thin and is already acquiredby some thread. In the latter case, if the owningthread is the current one, a nesting count (in thelock word) is increased. If the owning thread is notthe current one, then there is contention, and Ba-con's version of the algorithm busy-waits, spinninguntil it acquires the lock. When it is �nally ac-quired, it is in ated. Unlocking non-in ated locksis simple. On each unlock operation, the nestingcount is decreased. When it reaches 0, the lock byteis replaced by zero, releasing the lock.

The advantages of this algorithm are that a singleatomic operation is needed to acquire a thin lockin absence of contention, and more importantly, noatomic operation is required to unlock an object9.

Onodera eliminates the busy wait in case of con-tention on a thin lock, using a single additional bitin each object instance. The role of this contentionbit is to indicate that some other thread is waiting toacquire the current thin lock. Onodera's algorithmdi�ers from the previous algorithm at two points.First, when a thread fails to acquire a thin lock (be-cause of contention), it acquires a fat monitor forthe object, sets the contention bit, checks that thethin lock was not released, then puts itself in a wait-ing state. Second, when a thin lock is released (e.g.lock word is replaced by zero), the releasing threadchecks the contention bit. If it is set, it in ates thelock, and noti�es all waiting threads10.

The overhead of Onodera's algorithm over Bacon'sis the contention bit test on unlocking, a fairly sim-ple non-atomic operation, and the one bit per ob-ject. This bit has the following restriction: it mustnot reside in the lock word. This is a problem. Itis important to keep the per object space overheadas low as possible, as Java programs tend allocatemany small objects. It is now common practice touse 2 word headers in object instances; one word forthe virtual pointer, and the second for the lock andother information. The contention bit cannot residein either of these two words (putting it in the virtualtable pointer word would add execution overhead to

9Unlike Agesen's recent meta-lock algorithm[9] which re-quires an atomic operation for unlocking objects.

10This is a simpli�ed description. Please refer to the orig-inal paper[26] for details.

��

��

��

��

��

��

��

��

��

��

��

��

Class Info Ptr

Interface Method Ptr

Interface Method Ptr

−3

−8

VTBL Ptr

Virtual table of class X extends ... extends A, implements Y, Z

SparseInterfaceMethodLookupTable

Increasing

addressesmemory

Returned to class loadermemory manager

Returned to class loadermemory manager

TraditionalVirtualTableLayout

Method Ptr

Method Ptr

.

.

.

+4

...

Method Ptr

Method Ptr

Method Ptr

java.lang.Object methods

class A methods

class X methods

...

Increasing

addressesmemory

# reference words

# non−reference words

+3

Figure 5: Virtual table layout

method invocation, �eld access, and any other op-eration dereferencing this pointer). As objects needto be aligned on a word multiple (for the atomic op-eration to work), this one bit overhead might welltranslate into a whole word overhead for small ob-jects. Also, it is likely that the placement of this bitwill be highly type dependent, which complicatesthe unlocking test.

Our solution to this problem is to put the con-

tention bit in the thread structure, instead of inthe object instance. This simple modi�cation hasthe advantage of eliminating the per object over-head while maintaining the key properties of thealgorithm, namely, fast thin lock acquisition witha single atomic operation, fast thin lock unlockingwithout atomic operations, and no busy-wait in caseof contention.

We modify Onodera's algorithm as follows. InSableVM, each thread has a related data structure

containing various information, like stack informa-tion and exception status. In this structure, we addthe contention bit, a contention lock11, and a linkedlist of (waiting thread, object) tuples. Then wemodify the lock and unlock operation as describedin the following two subsections.

6.3.1 Modi�cations to the lock operation

The lock operation is only modi�ed in the case ofcontention on a thin lock.

When a thread xt fails to acquire a thin lock on ob-ject zo due to contention (because thread yt alreadyowns the thin lock), then (1) thread xt acquires thecontention lock of the owning thread (yt), and (2)sets the contention bit of thread yt, then (3) checksthat the lock of object zo is still thin and owned by

11The contention lock is a simple non-recursive mutex.

thread yt. If the check fails, (4a) the contention bitis restored to its initial value, the contention lock isreleased and the lock operation is repeated. If thecheck succeeds, (4b) the tuple (xt, zo) is added tothe linked list of thread yt, then thread xt is putin the waiting state (temporarily releasing the con-tention lock of thread yt, while it sleeps). Later,when thread xt wakes up (because it was signaled),it releases the re-acquired contention lock and re-peats the lock operation.

6.3.2 Modi�cations to the unlock operation

The unlock operation is modi�ed to check the con-tention bit of the currently executing thread. Thischeck is only done when a lock is actually released(as locks are recursive), after releasing the lock.

When the lock of object bo is released by thread yt,and if the contention bit of thread yt is set, then (1)thread yt acquires its own contention lock, and (2)iterates over all the elements of its tuple linked list.For each tuple (xt, zo), if (z0 = bo), thread xt issimply signaled. If (zo 6= bo), the lock of object zo isin ated12 (if it is thin), then thread xt is signaled.Finally, (3) thread yt empties its tuple linked list,clears its contention bit, and releases its contentionlock.

7 Experimentation

We are conducting the following experiments, toevaluate the various memory management and per-formance enhancement strategies.

� Implementation of both standard and bidirec-tional instance layout, and comparison of thetracing speed of SableVM's copying collectoron both layouts.

� Measure the memory overhead of sparse inter-

face method lookup tables in application bench-marks. Test, using micro benchmarks,SableVM's behavior in presence of pathologi-cal cases.

12Notice that thread yt necessarily owns the lock of objectzo, as a only one lock (on object bo) has been released bythread yt since it last cleared its contention bit and emptiedits tuple list.

� Measure the size of class loader memory frag-

ments returned to the memory manager forrecycling. Measure how much of this memorygets e�ectively reused. Explore the possibilityof not managing these fragments if the storagethey require is insigni�cant.

� Evaluate relative speed of SableVM comparedto the speed of other Java virtual machinesrunning on Linux, using a set of standardbenchmarks, and some micro benchmarks.

The results of these experiments can be found onthe following web site [6].

8 Conclusion and Future Work

In this paper, we have presented SableVM, a frame-work for testing high-level performanceenhancements and extensions to the Java virtualmachine. SableVM is written in portable C withminimal system dependecies.

The main goal of the SableVM project was the de-sign and implementation of an open-source virtualmachine suitable for research which is easy to mod-ify, can simply handle language and bytecode exten-sions, and also provides a testbed for various imple-mentation strategies.

Particularly, we have introduced in this paper newhigh-level techniques usable by any Java virtual ma-chine (including JITs, and hybrid systems) to sup-port eÆcient execution of Java bytecode.

More speci�cally, we have introduce a bidirectionalobject layout that groups reference �elds, and weshowed how this layout makes tracing objects andarrays simple and eÆcient.

We also introduced a sparse interface virtual tablelayout adapted to the dynamic class loading facilityof Java, which reduces the cost of an invokeinter-face instruction to that of an invokevirtual. We alsodemonstrated that the sparse representation neednot waste memory, because unused holes in the in-terface table could be recycled and used by the classloader memory manager.

Our last performance enhancement technique wasan improvement on thin locks. We introduced asimple algorithm and related data structures that

eliminate busy-wait in case of contention on a thinlock. This strategy incurs no space overhead on ob-ject instances.

Other groups have expressed an interest in addingother components to the VM, including a JIT com-piler. We encourage such collaboration. TheSableVM source is publicly-available at:http://www.sablevm.org/.

References

[1] Classpath. www.classpath.org/.

[2] GCJ. sources.redhat.com/java/.

[3] Harissa.www.irisa.fr/compose/harissa/harissa.html.

[4] HotSpot.java.sun.com/products/hotspot/whitepaper.html.

[5] Ka�e. www.ka�e.org/.

[6] SableVM. www.sablevm.org/.

[7] Toba. www.cs.arizona.edu/sumatra/toba/.

[8] Ali-Reza Adl-Tabatabai, Micha l Cierniak, Guei-Yuan Lueh, Vishesh M. Parikh, and James M.Stichnoth. Fast, e�ective code generation in a Just-in-Time Java compiler. In Proceedings of the ACMSIGPLAN '98 Conference on Programming Lan-guage Design and Implementation, pages 280{290.ACM Press, 1998.

[9] Ole Agesen, David Detlefs, Alex Garthwaite, RossKnippel, Y. S. Ramakrishna, and Derek White.An eÆcient meta-lock for implementing ubiqui-tous synchronization. In Proceedings of the Con-ference on Object-Oriented Programming, Systems,Languages, and Applications, pages 207{222. ACMPress, November 1999.

[10] B. Alpern, C. R. Attanasio, J. J. Barton, M. G.Burke, P. Cheng, J.-D. Choi, A. Cocchi, S. J. Fink,D. Grove, M. Hind, S. F. Hummel, D. Lieber,V. Litvinov, M. F. Mergen, T. Ngo, J. R. Russell,V. Sarkar, M. J. Serrano, J. C. Shepherd, S. E.Smith, V. C. Sreedhar, H. Srinivasan, and J. Wha-ley. The Jalape~no virtual machine. IBM SystemsJournal, 39(1):211{238, October 2000.

[11] Bowen Alpern, Dick Attanasio, Anthony Cocchi,Derek Lieber, Stephen Smith, Ton Ngo, and John J.Barton. Implementing Jalapeno in Java. In Pro-ceeings of the 1999 ACM SIGPLAN Conference onObject-Oriented Programming, Systems, Languages& Applications (OOPSLA`99), volume 34.10 ofACM Sigplan Notices, pages 314{324. ACM Press,November 1999.

[12] Cox B. Object-Oriented Programming: An evolu-tionary Approach. Addison-Wesley, 1987.

[13] David F. Bacon, Ravi Konuru, Chet Murthy, andMauricio Serrano. Thin locks: Featherweight syn-chronization for Java. In Proceedings of the ACMSIGPLAN'98 Conference on Programming Lan-guage Design and Implementation (PLDI), pages258{268. ACM Press, June 1998.

[14] Joel F. Bartlett. Compacting garbage collectionwith ambiguous roots. Technical Report 88.2, Dig-ital { Western Research Laboratory, 1988.

[15] C. Chambers, D. Ungar, and E. Lee. An eÆ-cient implementation of SELF a dynamically-typedobject-oriented language based on prototypes. InProceedings of the Conference on Object-OrientedProgramming Systems, Languages, and Applica-tions (OOPSLA), volume 24, pages 49{70. ACMPress, October 1989.

[16] Micha l Cierniak, Guei-Yuan Lueh, and James N.Stichnoth. Practicing JUDO: Java under dynamicoptimizations. In Proceedings of the ACM SIG-PLAN '00 Conference on Programming LanguageDesign and Implementation, pages 13{26, Vancou-ver, British Columbia, June 2000. ACM Press.

[17] Peter Deutsch and Alan M. Schi�man. EÆcient im-plementation of the Smalltalk-80 system. In Con-ference Record of the Eleventh Annual ACM Sym-posium on Principles of Programming Languages,pages 297{302. ACM Press, January 1984.

[18] Karel Driesen. Selector table indexing & sparsearrays. SIGPLAN Notices: Proc. 8th AnnualConf. Object-Oriented Programming Systems, Lan-guages, and Applications, OOPSLA, 28(10):259{270, September 1993.

[19] Margaret A. Ellis and Bjarne Stroustrup. The An-notated C++ Reference Manual. Addison-Wesley,December 1990.

[20] Anton M. Ertl. A portable Forth engine.www.complang.tuwien.ac.at/forth/threaded-code.html.

[21] James Gosling, Bill Joy, Guy Steele, and GiladBracha. The Java Language Speci�cation SecondEdition. Addison-Wesley, 2000.

[22] Richard Jones and Rafael Lins. Garbage collection:algorithms for automatic dynamic memory man-agement. Wiley, 1996.

[23] Andreas Krall. EÆcient JavaVM Just-in-Timecompilation. In Proceedings of the 1998 Inter-national Conference on Parallel Architectures andCompilation Techniques (PACT '98), pages 205{212. IEEE Computer Society Press, October 1998.

[24] Tim Lindholm and Frank Yellin. The Java VirtualMachine Speci�cation. Addison-Wesley, second edi-tion, 1999.

[25] Andrew C. Myers. Bidirectional object layoutfor separate compilation. In OOPSLA '95 Con-ference Proceedings: Object-Oriented Programming

Systems, Languages, and Applications, pages 124{139. ACM Press, October 1995.

[26] Tamiya Onodera and Kiyokuni Kawachiya. A studyof locking objects with bimodal �elds. In Procee-ings of the 1999 ACM SIGPLAN Conference onObject-Oriented Programming, Systems, Languages& Applications (OOPSLA`99), volume 34.10 ofACM Sigplan Notices, pages 223{237. ACM Press,November 1999.

[27] Ian Piumarta and Fabio Riccardi. Optimizing di-rect threaded code by selective inlining. In SIG-PLAN '98 Conference on Programming LanguageDesign and Implementation, pages 291{300. ACMPress, June 1998.

[28] William Pugh and Grant Weddell. Two-directionalrecord layout for multiple inheritance. ACM SIG-PLAN Notices, 25(6):85{91, June 1990.

[29] T. Suganuma, T. Ogasawara, M. Takeuchi, T. Ya-sue, M. Kawahito, K. Ishizaki, H. Komatsu, andT. Nakatani. Overview of the IBM Java Just-in-Time Compiler. IBM Systems Journal, 39(1):175{193, 2000.

[30] Jan Vitek and R. Nigel Horspool. Taming mes-sage passing: EÆcient method look-up for dynami-cally typed languages. In Object-Oriented Program-ming, Proceedings of the 8th European ConferenceECOOP'94, volume 821 of Lecture Notes in Com-puter Science, pages 432{449. Springer, July 1994.

[31] Jan Vitek, R. Nigel Horspool, and Andreas Krall.EÆcient type inclusion tests. In OOPSLA '97 Con-ference Proceedings: Object-Oriented ProgrammingSystems, Languages, and Applications, pages 142{157. ACM Press, October 1997.

[32] Byung-Sun Yang, Soo-Mook Moon, Seongbae Park,Junpyo Lee, SeungIl Lee, Jinpyo Park, Yoo C.Chung, Suhyun Kim, Kemal Ebcio�glu, and ErikAltman. LaTTe: A Java VM Just-in-Time com-piler with fast and eÆcient register allocation. InProceedings of the 1999 International Conferenceon Parallel Architectures and Compilation Tech-niques (PACT '99), pages 128{138. IEEE Com-puter Society Press, October 1999.

new proceedings of the java™ virtual machine research and … · 2019. 2. 25. · gagnon and...

Documents