distributed object sharing for cluster-based java virtual machine

Distributed Object Sharing for Cluster-based Java

Virtual Machine

Fang Weijian

��

A thesis submittedin partial fulfillment of the requirement for

the degree of Doctor of Philosophyat the University of Hong Kong

2004

Abstract of thesis entitled

“Distributed Object Sharing for Cluster-based Java VirtualMachine”

submitted by Fang Weijian

for the degree of Doctor of Philosophy

at the University of Hong Kong in 2004

Java has already become one of the most popular programming languages

since its debut. Recent advances in Java compilation and execution technolo-

gies have further pushed Java into the arena of high performance parallel and

distributed computing. On the other hand, the computer cluster has gradu-

ally been accepted as a scalable and affordable parallel computing platform

by both academia and industry in recent years.

We were therefore inspired to design a cluster-based Java Virtual Ma-

chine (JVM) that can run unmodified multi-threaded Java applications on a

computer cluster, where Java threads can be automatically distributed to dif-

ferent computer nodes to achieve high parallelism and leverage cluster-wide

resources such as memory and network bandwidth.

In a cluster-based JVM, the shared memory nature of Java threads calls

for a global object space (GOS) that “virtualizes” a single Java object heap

spanning the cluster to facilitate transparent distributed object sharing. The

performance of the cluster-based JVM hinges on the GOS’s ability to mini-

mize the communication and coordination overheads in maintaining the sin-

gle object heap illusion.

i

Different from the previous approaches to build a cluster-based JVM, we

build the GOS as an object-based distributed shared memory (DSM) service

embedded in the cluster-based JVM, which facilitates the exploitation of

abundant runtime information for performance improvement. Distributed-

shared objects (DSOs) that are reachable from threads at different nodes are

detected to facilitate efficient consistency maintenance and memory manage-

ment in the cluster-based JVM.

Furthermore, based on the concept of DSO, we propose a framework to

characterize object access patterns, along three orthogonal dimensions. With

this framework, we are able to effectively calibrate the runtime memory access

patterns and dynamically apply an adaptive cache coherence protocol to

minimize consistency maintenance overhead. The adaptation devices include

an adaptive object home migration method that optimizes the single-writer

access pattern, synchronized method migration that allows the execution of

a synchronized method to take place remotely at the home node of its locked

object, and connectivity-based object pushing that uses object connectivity

information to optimize the producer-consumer access pattern. Extensive

experiments have demonstrated the effectiveness of our design.

ii

Declarations

I hereby declare that the thesis entitled “Distributed Object Sharing for

Cluster-based Java Virtual Machine” represents my own work and has not

been previously submitted to this or any other institution for a degree,

diploma and other qualifications.

————————

Fang Weijian

2004

i

Acknowledgements

I would like to thank my supervisors, Dr. Cho-Li Wang and Dr. Francis C.

M. Lau, for their advices and help on my research and daily life, which are

endless, patient, and invaluable. It is their encouragement and support that

have brought this research to completion. In particular, the experiences of

intensively revising papers before deadlines with Dr. Wang were painful, but

remarkably rewarding. From them, I not only learned how to write papers

but also learned how to do research. Dr. Lau is inspiring and enlightening

in directing my research.

I also want to thank my internal and external examiners for their valuable

comments on my thesis.

It is my pleasure to work with Zhu Wenzhang in my PhD study. I am

full of gratitude to his suggestions and cooperations. It is also my pleasure

to hike with Zhu Wenzhang. He is energetic in both hiking and research.

I would like to thank many colleagues in HKU. They are Wang Lian,

Wang Tianqi, Chen Weisong, Chen Lin, Chen Ge, Zhu DongLai, Li Wei, Yin

Kangkai, etc. I really enjoyed the time spent with them. I also want to thank

Benny Cheung, Roy Ho, and Anthony Tam, for their help on my research

and teaching work.

Finally, I want to express my deepest gratitude to my wife and my parents.

ii

Contents

Declarations i

Acknowledgements ii

1 Introduction 1

1.1 Java and Java Virtual Machine . . . . . . . . . . . . . . . . . 1

1.2 Cluster Computing . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Cluster-based Java Virtual Machine . . . . . . . . . . . . . . . 3

1.4 Global Object Space . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.6 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . 7

1.7 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background 10

2.1 Software Distributed Shared Memory . . . . . . . . . . . . . . 10

2.1.1 Memory Consistency Model . . . . . . . . . . . . . . . 11

2.1.2 Classification Based on the Coherence Granularity . . . 12

2.2 Java Memory Model . . . . . . . . . . . . . . . . . . . . . . . 14

3 Memory Access Pattern 17

3.1 Memory Access Pattern Optimization in DSM . . . . . . . . . 17

3.1.1 Programmer Annotation . . . . . . . . . . . . . . . . . 18

3.1.2 Compiler Analysis . . . . . . . . . . . . . . . . . . . . 19

iii

3.1.3 Runtime Adaptation . . . . . . . . . . . . . . . . . . . 21

3.2 Access Pattern Space . . . . . . . . . . . . . . . . . . . . . . . 24

4 Distributed-Shared Object 27

4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Benefits from DSO Detection . . . . . . . . . . . . . . . . . . 28

4.2.1 Benefits on Memory Consistency Maintenance . . . . . 28

4.2.2 Benefits on Memory Management . . . . . . . . . . . . 29

4.3 Lightweight DSO Detection and Reclamation . . . . . . . . . . 30

4.4 Basic Cache Coherence Protocol . . . . . . . . . . . . . . . . . 34

5 Adaptive Cache Coherence Protocol 38

5.1 Adaptive Object Home Migration . . . . . . . . . . . . . . . . 39

5.1.1 Home Migration Concepts . . . . . . . . . . . . . . . . 40

5.1.2 Home Migration with Adaptive Threshold . . . . . . . 43

5.2 Synchronized Method Migration . . . . . . . . . . . . . . . . . 48

5.3 Connectivity-based Object Pushing . . . . . . . . . . . . . . . 51

6 Object Access Pattern Visualization 53

6.1 Object Access Trace Generator . . . . . . . . . . . . . . . . . 56

6.2 Pattern Analysis Engine . . . . . . . . . . . . . . . . . . . . . 58

6.3 Pattern Visualization Component . . . . . . . . . . . . . . . . 59

7 Implementation 63

7.1 JIT Compiler Enabled Native Instrumentation . . . . . . . . . 63

7.2 Distributed Threading and Synchronization . . . . . . . . . . 66

7.2.1 Thread Distribution . . . . . . . . . . . . . . . . . . . 67

7.2.2 Thread Synchronization . . . . . . . . . . . . . . . . . 68

7.2.3 JVM Termination . . . . . . . . . . . . . . . . . . . . . 69

7.3 Non-Blocking I/O Support . . . . . . . . . . . . . . . . . . . . 70

7.4 Distributed Class Loading . . . . . . . . . . . . . . . . . . . . 71

7.5 Garbage Collection . . . . . . . . . . . . . . . . . . . . . . . . 73

iv

7.5.1 Local Garbage Collection . . . . . . . . . . . . . . . . . 74

7.5.2 Distributed Garbage Collection . . . . . . . . . . . . . 75

8 Performance Evaluation 78

8.1 Experiment Environment . . . . . . . . . . . . . . . . . . . . . 78

8.2 Application Suite . . . . . . . . . . . . . . . . . . . . . . . . . 79

8.2.1 CPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

8.2.2 ASP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

8.2.3 SOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

8.2.4 NBody . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

8.2.5 NSquared . . . . . . . . . . . . . . . . . . . . . . . . . 82

8.2.6 TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

8.3 Application Performance . . . . . . . . . . . . . . . . . . . . . 83

8.3.1 Sequential Performance . . . . . . . . . . . . . . . . . . 84

8.3.2 Parallel Performance . . . . . . . . . . . . . . . . . . . 85

8.4 Effects of Adaptations . . . . . . . . . . . . . . . . . . . . . . 89

8.4.1 Adaptive Object Home Migration . . . . . . . . . . . . 91

8.4.2 Synchronized Method Migration . . . . . . . . . . . . . 95

8.4.3 Connectivity-based Object Pushing . . . . . . . . . . . 96

8.5 Sensitivity and Robustness Analysis for HM Protocol . . . . . 97

8.6 More on Synchronized Method Migration . . . . . . . . . . . . 105

9 Related Work 109

9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

9.2 Augmenting Java for Parallel Computing . . . . . . . . . . . . 109

9.2.1 Language Augmentation . . . . . . . . . . . . . . . . . 110

9.2.2 Class Augmentation . . . . . . . . . . . . . . . . . . . 111

9.3 Cluster-based JVM . . . . . . . . . . . . . . . . . . . . . . . . 112

9.3.1 Jackal . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

9.3.2 Hyperion . . . . . . . . . . . . . . . . . . . . . . . . . . 115

9.3.3 JavaSplit . . . . . . . . . . . . . . . . . . . . . . . . . . 115

v

9.3.4 cJVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

9.3.5 JESSICA . . . . . . . . . . . . . . . . . . . . . . . . . 119

9.3.6 Java/DSM . . . . . . . . . . . . . . . . . . . . . . . . . 120

10 Conclusion 121

10.1 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

10.1.1 Effectiveness of the Adaptations . . . . . . . . . . . . . 121

10.1.2 Which Existing JVM is Based on . . . . . . . . . . . . 123

10.1.3 Thread Migration vs. Initial Placement . . . . . . . . . 123

10.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

10.2.1 Compiler Analysis to Reduce Software Checks . . . . . 124

10.2.2 Automatic Performance Bottleneck Detection . . . . . 125

10.2.3 High Performance Communication Substrate . . . . . . 126

A Appendix 127

A.1 Overheads of GOS Primitive Operations . . . . . . . . . . . . 127

A.2 ASP Code Segment . . . . . . . . . . . . . . . . . . . . . . . . 129

A.3 The Method for Parallel Performance Breakdown . . . . . . . 130

A.4 JIT Compilation vs. Interpretation . . . . . . . . . . . . . . . 131

vi

List of Figures

3.1 The object access pattern space . . . . . . . . . . . . . . . . . 23

4.1 The detection of distributed-shared object . . . . . . . . . . . 33

4.2 The state transition graph depicting object lifecycle in the GOS 35

5.1 Home-based Protocol for LRC with multiple-writer support . . 41

5.2 Barrier class . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.1 PAT Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.2 Memory access operations in GOS . . . . . . . . . . . . . . . . 57

6.3 Phase parallel paradigm . . . . . . . . . . . . . . . . . . . . . 60

6.4 The time lines window . . . . . . . . . . . . . . . . . . . . . . 61

6.5 The window of object access pattern analysis result (the bigger

one) and the window of the application’s source code (the

smaller one) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7.1 Pseudo code for access check: using a function call . . . . . . . 65

7.2 Pseudo code for access check: by comparison . . . . . . . . . . 65

7.3 Detailed pseudo code for a read check . . . . . . . . . . . . . . 65

7.4 IA32 assembly code for a read check . . . . . . . . . . . . . . 66

7.5 Remote unlock of a DSO . . . . . . . . . . . . . . . . . . . . . 71

7.6 JVM’s dynamical loading, linking, and initialization of classes 72

7.7 Tolerating inconsistency in DGC . . . . . . . . . . . . . . . . 74

7.8 DSO reference diffusion tree . . . . . . . . . . . . . . . . . . . 75

vii

8.1 The typical operation in SOR . . . . . . . . . . . . . . . . . . 80

8.2 Barnes-Hut tree for 2D space decomposition . . . . . . . . . . 81

8.3 Single node performance . . . . . . . . . . . . . . . . . . . . . 84

8.4 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

8.5 Breakdown of normalized execution time against number of

processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

8.6 The adaptive protocol vs. the basic protocol . . . . . . . . . . 90

8.7 Effects of adaptations w.r.t. execution time . . . . . . . . . . . 92

8.8 Effects of adaptations w.r.t. message number . . . . . . . . . . 93

8.9 Effects of adaptations w.r.t. network traffic . . . . . . . . . . . 94

8.10 The effect of object home migration on SOR . . . . . . . . . . 95

8.11 RCounter’s Source code skeleton run by each thread . . . . . . 98

8.12 Effects of home migration protocols against repetition of single-

writer pattern: normalized execution time (RCounter) . . . . . 100


writer pattern: normalized message number (RCounter) . . . . 100

8.14 DSOR’s Source code skeleton run by each thread . . . . . . . 102


writer pattern: normalized execution time (DSOR) . . . . . . 104


writer pattern: normalized message number (DSOR) . . . . . 104

8.17 Effect of synchronized method migration on the barrier oper-

ation against the number of processors . . . . . . . . . . . . . 106

8.18 ASP’s execution times on different problem sizes . . . . . . . . 107

9.1 JavaSplit’s code sample to send and receive objects . . . . . . 117

A.1 The source code to measure GOS primitive operations . . . . 128

A.2 JIT Compilation vs. Interpretation . . . . . . . . . . . . . . . 132

viii

List of Tables

4.1 Coherence protocols according to object type . . . . . . . . . . 37

8.1 Communication effort on 16 processors . . . . . . . . . . . . . 89

A.1 Overheads (in microseconds) of primitive operations with re-

spect to different number of threads . . . . . . . . . . . . . . . 128

ix

Chapter 1

Introduction

1.1 Java and Java Virtual Machine

In less than ten years, Java [31] has become one of the most popular pro-

gramming languages since its debut on May 23, 1995 at SunWorld ’95. Java’s

following features contribute to its success.

• Java adopts a simplified C++ alike grammar, which make it a simple

yet expressive object-oriented language.

• Java is also a concurrent programming language by supporting multi-

threading.

• Java is by design a platform-independent language, through the intro-

duction of the bytecode. Java’s source code is first compiled to the

standard bytecode, which in turn can run on any platform where there

is a Java Virtual Machine (JVM) [60]. JVM is the runtime system

responsible for executing Java bytecode.

• JVM provides some very attractive runtime features, such as automatic

memory management through garbage collection [79], multi-threading

1

support, and runtime safety checks that include array boundary checks

as well as reference type checks.

• Java Development Kit (JDK) provides abundant libraries to support

Collection, Socket, Remote Method Invocation (RMI) [75], and Object

Serialization [77], etc.

Although Java has been considered as a productive and universal lan-

guage for a long time, its performance was unsatisfactory due to the poor

performance of JVM. However, recent advances in Java compilation and ex-

ecution technology, such as the just-in-time compiler [73], the hotspot tech-

nology [76], and the incremental garbage collection [52], add to the attrac-

tiveness of Java as a language for high performance scientific and engineering

computing [6]. As a consequence, more and more researchers are adopting

Java in high performance parallel and distributed computing [19][20].

1.2 Cluster Computing

A cluster is a type of parallel or distributed processing system, which consists

of a collection of interconnected stand-alone computers working together as

a single, integrated computing resource [32]. In recent years, the computer

cluster has been widely accepted as a scalable and affordable parallel com-

puting platform by both academia and industry [30, 26, 41]. For example,

in the TOP500 list [17] released on November 2003, 41.6% of the supercom-

puters, i.e., 208 systems, are of cluster, and account for 49.8% of aggregated

performance.

The prosperity of cluster computing is attributed to ever advancing com-

modity high performance microprocessors and high-speed networks, as well

as open source cluster softwares, such as Rocks [12] for Linux cluster soft-

ware installation, Torque [18] for resource management, Maui [10] for job

2

scheduling, MPICH [11] for message passing programming, and Ganglia [2]

for cluster monitoring.

Nevertheless, cluster programming is still a challenging task. One of the

major programming paradigms on clusters is message passing, e.g., by follow-

ing MPI standard [16]. Message passing paradigm requires the programmers

to write explicit code to send and receive data in order to coordinate processes

on different cluster nodes. In message passing paradigm, a superior perfor-

mance is usually achievable by fine tuning the timing and content of each

message, which however is widely believed to be a painful and error-prone

process.

Alternatively, software Distributed Shared Memory (DSM) [1] promises

a better programmability compared with the message passing paradigm, by

providing a globally shared memory abstraction across physically distributed

memory machines. In software DSM, programmers access distributed data

in the same way as local data. Special APIs are provided to synchronize

parallel processes. To improve the performance, shared data unit can be

replicated on multiple nodes. Inconsistency among multiple replications is

solved according to memory consistency models [21]. Since the enforcement

of data coherence is done automatically through the DSM infrastructure, it

is probable that the communications happen more frequently and involve

more data traffic than necessary. For example, the update or invalidation to

a cached copy is unnecessary if the copy will not be used any more.

1.3 Cluster-based Java Virtual Machine

Motivated by both the programmability of Java and the ample availability

of clusters as a cost-effective parallel computing environment, the transpar-

ent and parallel execution of multi-threaded Java programs on clusters has

become a research hotspot [62, 78, 82, 24, 61, 42, 44].

In this work, we build a cluster-based Java Virtual Machine to tackle this

3

problem. A cluster-based JVM conforms to the JVM specification [60], but

runs on a cluster. With a cluster-based JVM, the Java threads created within

one program can be transparently distributed onto different cluster nodes to

achieve a higher degree of execution parallelism. In addition, cluster-wide

resources such as memory, I/O, and network bandwidth can be unified and

used as a whole to solve resource-demanding problems. A cluster-based JVM

is also called a distributed JVM.

A cluster-based JVM is composed of a group of collaborating daemons,

one on each cluster node. A cluster-based JVM daemon is a standard JVM

augmented with the cluster awareness and capabilities to cooperate with each

other in order to present a single system image (SSI) [53] of cluster towards

Java applications. The single system image is enabled through the global

object space that will be discussed in the next section.

The adoption of the cluster-based JVM for parallel Java computing can

boost the cluster programming productivity. Given that the cluster-based

JVM conforms to the JVM specification, any Java program can run on the

cluster-based JVM without any modification. The steep learning curve can

thus be avoided since the programmers do not need to learn a new parallel

language, a new message passing library, or a new tool in order to develop

parallel programs on clusters. It is also convenient for program development

as the multi-threaded programs can be implemented and tested on a non-

parallel computer before it is submitted to a cluster for execution. Finally,

many existing multi-threaded Java applications, especially those server ap-

plications, can be ported to clusters when a cost-effective parallel platform

is sought for.

1.4 Global Object Space

In a cluster-based JVM, as Java threads are distributed around the cluster,

the shared memory nature of Java threads calls for a global object space

4

(GOS) that “virtualizes” a single Java object heap spanning the cluster to

facilitate transparent distributed object sharing.

In GOS, object replication is encouraged to improve the data locality,

which raises the consistency issue. The memory consistency issue is solved ac-

cording to the Java memory model (Chapter 8 of the JVM specification [60]).

Particularly, memory consistency operations are triggered by thread synchro-

nization. GOS is responsible to enforce the Java memory model, as well as

handle threads’ distribution and location-transparent synchronization. In

addition, in order to completely comply with the JVM specification, GOS

needs to perform distributed garbage collection for automatic memory man-

agement.

GOS is indeed a DSM service with functionality extensions in an object-

oriented Java system. The performance of the cluster-based JVM hinges on

the GOS’s ability to minimize the communication and coordination overheads

in maintaining the single object heap illusion. It is challenging to design and

implement a GOS that is both complete in terms of functionality and efficient

in terms of performance.

1.5 Our Approach

We design a cluster-based JVM. Different from previous approaches [82, 61]

that leverage a page-based DSM as an underlying infrastructure to build

the GOS, we build a GOS embedded in the cluster-based JVM [43]. In this

architecture, GOS is able to exploit abundant runtime information in JVM,

particularly the object type information, to improve the performance.

We leverage the runtime object connectivity information to detect distributed-

shared objects (DSOs). DSOs are the objects that are reachable from at least

two threads located at different cluster nodes in a cluster-based JVM. The

identification of DSOs allows us to handle the memory consistency problem

more precisely and efficiently. For example, in Java, synchronization primi-

5

tives are not only used to protect critical sections but also to maintain the

memory consistency. Clearly, only synchronizations of DSOs may involve

multiple threads on different nodes. Thus, the identification of DSOs can

reduce the frequency of consistency-related memory operations. Moreover,

since only DSOs that are replicated on multiple nodes would be involved in

the consistency maintenance, the detection of DSOs therefore leads to a more

efficient implementation of the cache coherence protocol. The identification

of DSOs also facilitates distributed garbage collection.

The choice of a good cache coherence protocol is often application-dependent.

That is, the particular memory access patterns in an application speak for

the more suitable protocol. That motivates us to go after an adaptive proto-

col. An adaptive cache coherence protocol is able to detect the current access

pattern and adjusts itself accordingly. We believe that adaptive protocols are

superior to non-adaptive ones due to their adaptability to object access pat-

terns in applications. In our design, we use an object-based adaptive cache

coherence protocol to implement the Java memory model.

The challenges of designing an effective and efficient adaptive cache co-

herence protocol are: (1) whether we can determine those important access

patterns that occur frequently or those that contribute a significant amount

of overhead to the GOS, and (2) whether the runtime system can efficiently

and correctly identify such target access patterns and apply the correspond-

ing adaptations in a timely fashion.

To further understand the first challenge and to overcome it, we propose

the access pattern space [44] as a framework to characterize object access

behavior. This space has three dimensions: number of writers, synchro-

nization, and repetition. We identify some basic access patterns along each

dimension: multiple-writer, single-writer, and read-only for the number-of-

writers dimension; mutual exclusion and condition for the synchronization

dimension; and patterns with different numbers of consecutive repetitions for

repetition dimension. Some combination of different basic patterns along the

6

three dimensions then portrays an actual runtime memory access pattern.

This 3-D access pattern space serves as a foundation on which we can iden-

tify those important object access patterns in the distributed JVM. We can

then choose the right adaptations to match with these access patterns and

improve the overall performance of the GOS.

To meet the second challenge, we take advantage of the fact that the GOS

is implemented embedded in the cluster-based JVM. Our adaptive protocol

can leverage all runtime object types and access information to efficiently

and accurately identify the access patterns worthy of special focus.

We apply three different protocol adaptations to the basic home-based

multiple-writer cache coherence protocol in three respective situations in the

access pattern space: (1) adaptive object home migration which optimizes

the single-writer access pattern by moving the object’s home to the writing

node according to the access history; (2) synchronized method migration

which chooses between default object (data) movement and optional method

(control flow) movement in order to optimize the execution of critical section

methods according to some prior knowledge; (3) connectivity-based object

pushing which scales the transfer unit to optimize the producer-consumer

access pattern according to the object connectivity information.

1.6 Contributions of the Thesis

We summarize the contributions of this thesis as follows:

1. We design a global object space embedded in a cluster-based JVM

that exploits Java’s runtime information to improve the performance.

In particular, distributed-shared objects are identified at run time to

reduce the overhead of memory consistency maintenance and to facili-

tate the distributed garbage collection.

2. We propose an object access pattern space as a framework to charac-

7

terize the object access behavior.

3. We propose a novel object home migration protocol that optimizes the

single-writer access pattern. The protocol demonstrates both the sensi-

tivity to the lasting single-writer pattern and the robustness against the

transient single-writer pattern. In the latter case, the protocol inhibits

home migration in order to reduce the home notification overhead.

4. We propose other optimizations in our GOS, including synchronized

method migration that allows the execution of a synchronized method

to take place remotely at the home node of its locked object, and

connectivity-based object pushing that uses object connectivity infor-

mation to optimize the producer-consumer access pattern.

5. We design and implement a visualization tool called PAT (Pattern

Analysis Tool) that can be used to visualize object access traces and

analyze object access patterns in our GOS.

6. We have prototyped a cluster-based JVM with our GOS design and all

optimizations incorporated. Extensive experiments demonstrate the

performance of our GOS and the effectiveness of the optimizations.

1.7 Thesis Organization

Chapter 2 introduces the background of this research. Chapter 3 elaborates

the memory access patterns in DSM and GOS. Chapter 4 presents the con-

cept of distributed-shared object and how we leverage it to improve GOS’

performance. Chapter 5 elaborates the adaptations we have adopted. Chap-

ter 6 presents our pattern analysis tool used to visualize object access pat-

terns. Chapter 7 discusses some implementation details in our cluster-based

JVM. Chapter 8 reports the experiments we conduct to measure the perfor-

mance of the prototype based on our design. Chapter 9 discusses the related

8

work and compares them with this work. Chapter 10 gives the conclusion

and presents a possible agenda for future work.

9

Chapter 2

Background

To support the truly parallel execution of Java threads on a cluster, we need a

global object space for transparent distributed object accesses. The concept

of global object space is rooted in software distributed shared memory, which

is a well-established research area in cluster computing. In this chapter, the

concepts of distributed shared memory will be introduced. We will also

discuss Java’s special constraints on global object space, i.e., Java Memory

Model.

2.1 Software Distributed Shared Memory

Software distributed shared memory (DSM1) [1] promises a higher programma-

bility compared with message passing paradigm, by providing a globally

shared memory abstraction across physically distributed memory machines.

To improve the performance, the replication of shared data is allowed. The

data consistency issue is solved by well defined memory consistency mod-

els [21].

1In this thesis, DSM denotes software distributed shared memory.

10

2.1.1 Memory Consistency Model

The memory consistency model of a DSM system provides a formal speci-

fication of how the memory system will appear to the programmer [21]. It

defines the restrictions on the legal values that a read can return among the

writes performed by other processors.

From the viewpoint of programmers, sequential consistency [58] is the

most intuitive model, which requires the memory accesses within each indi-

vidual process follow program order and writes be made atomically visible to

all the processes. Though intuitive, sequential consistency suffers from poor

performance. Sequential consistency not only prohibits some common com-

piler optimizations, such as reordering memory accesses to different memory

locations, but also results in excessive data communication on the distributed

shared memory platform [59].

In order to improve the efficiency of DSM, it has been considered to relax

the memory order constraints imposed by sequential consistency. Lazy re-

lease consistency (LRC) [56] is one of the state-of-the-art relaxed consistency

models widely used in software DSM systems. LRC distinguishes synchro-

nization variables from normal shared variables. LRC defines two operations

on synchronization variables, namely acquire and release. Acquire operations

are used to tell the memory system that a critical region is about to be en-

tered. Release operations are used to tell that a critical region is about to

be exited. In LRC, when a process P1 acquires a synchronization variable

that was most recently released by another process P2, all the writes that

are visible to P2 at the time of releasing the synchronization variable become

visible to P1.

LRC allows common compiler optimizations. LRC also allows the write

propagations to be postponed and batched until the synchronization points.

Moreover, correctly synchronized LRC programs that are data-race-free have

sequential consistent behavior [22]. Thus, it is intuitive for programmers to

reason the execution of a data-race-free LRC program.

11

2.1.2 Classification Based on the Coherence Granular-

ity

According to the coherence granularity in DSM, there are three kinds of

DSM systems: page-based DSMs, whose granularity is a virtual memory

page; object-based DSMs, whose granularity is a variable-sized structured

data unit defined by the application; and fine-grain DSMs, whose granularity

is a fixed-sized memory block that is much smaller than a virtual memory

page.

Page-based DSM

Page-based DSMs’ coherence granularity is the virtual memory page. The

page-based DSM leverages the memory management unit (MMU) to inter-

cept the faulting access on a shared page that is not locally available, because

it is either obsolete or not cached at all. Then the page-based DSM fetches

the valid copy from the other nodes according to the memory consistency and

resumes the faulting access. The advantage of page-based DSM is that by

using the MMU only faulting accesses are trapped by the page-based DSM,

all the non-faulting accesses go in full speed. However, the size of virtual

memory page is as large as 4K bytes, and thus raises the false sharing prob-

lem. The false sharing problem happens when two processes independently

access different parts of the same page. The page-based DSM’s effort for

the two processes to have the same view of the page is unnecessary for the

correctness of the program. The false sharing problem could be a serious

performance issue in page-based DSMs, particularly for those applications of

fine-grain sharing characteristics.

TreadMarks [57] is a page-based DSM which adopts a homeless cache

coherence protocol to implement lazy release consistency [56]. TreadMarks

uses twin and diff techniques to support multiple processes writing on the

same shared virtual memory page simultaneously due to false sharing. On a

12

write fault to a local cached page, a copy of that page, called twin, is created.

Later, the diff, which is the local update ever performed, can be figured out

by comparing the current page with the previously saved twin. The protocol

is considered to be homeless because the diffs are saved and managed at each

process.

Comparatively, HLRC [55] uses a home-based protocol to implement LRC.

In home-based protocol, each shared coherence unit has a home to which

all writes (diffs) are propagated and from which all copies are derived. It

has shown that the home-based protocol is more scalable than the homeless

protocol because the home-based protocol maintains a simpler state, sends

fewer messages, has a lower diff overhead, and consumes a much smaller

memory [55].

Object-based DSM

Having observed that the false sharing problem is rooted in the sharing gran-

ularity mismatch between the page-based DSM systems and the applications,

some researchers introduced the concept of object-based DSM. Object-based

DSMs’ coherence granularity is an object, which is a structured data unit

defined by the applications. Most existing object-based DSM systems are

language-based. They are either new parallel programming languages (e.g.,

Orca [28] and Jade [70]), or modifications of programming languages such

as C (e.g., Munin [33] and Midway [83]). In both cases, the compiler or the

preprocessor is leveraged to hook the source code with the routines in the

corresponding object-based DSM library.

Object-based DSMs reduce the false sharing problem due to the relatively

small probability that two processes independently access the different parts

of a shared object. However, they raise another performance issue. Since the

MMU can not be used to trap the faulting accesses on arbitrary-sized objects,

software checks must be inserted before the memory accesses to guarantee

the accessed objects are in the right access state. The software access checks

13

could introduce a large overhead in object-based DSMs.

Fine-grain DSM

The fine-grain DSM is a trade-off between the page-based DSM and the

object-based DSM. The fine-grain DSM provides a shared memory address

space as that the page-based DSM does. The copies of the same shared data

reside at the same virtual memory address on all nodes. Thus it eases the

memory management and the data transfer among nodes. To reduce the

false sharing problem, the coherence granularity of fine-grain DSM is much

smaller than that of the page-based DSM. For example, a fine-grain DSM,

Shasta [71], has a variable-sized coherence granularity, called block, which is a

multiple of line, occupying 64 bytes or 128 bytes’s memory. Software checks

are inserted before memory accesses to guarantee that the shared data are

in the right state, as in object-based DSMs. Shasta has demonstrated a set

of techniques to reduce the software checks. Jackal [78] also uses a fine-grain

DSM to build the GOS for a cluster-based JVM.

2.2 Java Memory Model

Java is a programming language incorporating multi-threading support. Java

threads interact with each other through a shared memory, i.e., the object

heap. It is necessary to define the rules describing which values may be

seen by a read of shared memory that is updated by multiple threads. The

Java memory model (JMM) (chapter 8 of JVM specification [60]) defines the

memory consistency semantics of multi-threaded Java programs.

There is a lock associated with each object in Java. The Java language

provides the synchronized keyword, used in either a synchronized method or

a synchronized statement, for synchronization among multiple threads. En-

tering or exiting a synchronized block corresponds to acquiring or releasing

the lock of the specified object. A synchronized method or a synchronized

14

statement is used not only to guarantee exclusive accesses in the critical sec-

tion, but also to maintain memory consistency of objects among all threads

that have performed synchronization operations on the same lock.

An abstract machine is defined in JMM to describe threads’ memory

behavior. All threads share a main memory, which contains the master

copies of all variables. A variable is an object field, an array element, or a

static field. Each thread has its own working memory, which is its private

cache for all the variables it uses. A use on a variable in the main memory

will make it cached in the thread’s working memory.

JMM defines: before a thread releases a lock, it must copy all assigned

values in its working memory back to the main memory; before a thread

acquires a lock, it must flush (invalidate) all variables in its working mem-

ory. In this way, the later uses will load the up-to-date values from the main

memory. In addition, with respect to a lock, the acquire and release op-

erations performed by all threads are sequentially consistent. The acquire

and release operations have their embodiments in the Java bytecode set, i.e.,

monitorenter and monitorexit.

JMM resembles LRC in that acquire/release operations are used to es-

tablish a partial order between the memory actions performed by multiple

threads. We follow the operations defined in the JVM specification to imple-

ment JMM.

Revising JMM

Some researchers argue that the current JMM is not well designed because

it prohibits some common compiler optimizations, causes some counterintu-

itive behavior, and even makes some well known design patterns unsafe [68].

Currently, the JMM is under active revision through JCP’s procedures [8].

JCP (Java Community Process) is the standard procedure to evolve Java

technology through the community effort and under the supervision of Sun

Microsystems at the same time. Hopefully, a new JMM will be introduced

15

in the Tiger (1.5) release of Java, to replace the original one. The latest in-

formation of the proposed JMM can be found at Pugh’s website [15], which

is still in constantly revising.

The detailed comparison between the current JMM and the proposed one

is beyond the scope of this thesis. Here we simply list some major changes

made in the proposed JMM:

The semantics of volatile variables have been strengthened to have acquire

and release semantics. A read to a volatile field has the acquire semantics

and a write to a volatile field has the release semantics.

The semantics of final fields have been strengthened to allow for thread-

safe immutability. A read on a final field will always return the correctly

initialized value as long as the object reference is not exposed during the

object construction.

In addition, the proposed JMM states that the useless synchronization

has no memory semantics. A synchronization action is useless in a number of

situations, including acquiring/releasing a lock of thread-local objects, and

re-acquiring an already acquired lock. This statement is very reasonable.

Based on our understanding of the current and proposed JMM, we believe

that although our cluster-based JVM mainly follows the current JMM, it can

be quickly adapted to the proposed JMM if it was officially approved.

16

Chapter 3

Memory Access Pattern

Our cluster-based JVM is highlighted for its adaptability to object access

patterns. In this chapter, we firstly survey various memory access pattern

optimizations in the area of DSM. Then we propose an access pattern space

as a framework to characterize object access behavior, which is used as a

foundation to design the effective adaptations.

3.1 Memory Access Pattern Optimization in

DSM

Although the DSM paradigm promises higher programmability than mes-

sage passing paradigm, it may involve more communication than necessary.

For example, the update or invalidation to a cached copy is unnecessary if

the copy will not be used any more. To make DSM applications’ perfor-

mance be comparable with their message passing counterparts, researchers

are investigating various ways to reduce the communication in DSMs.

In DSM systems, many cache coherence protocols have been proposed

to implement various memory consistency models. The home-based proto-

col [55] assigns a home node to each shared data object from which all copies

17

are derived. It is widely believed that home-based protocol is more scal-

able than the homeless protocol [57], for the reason that the former has less

memory consumption and can eliminate diff accumulation. The home in a

home-based protocol can be either fixed [55] or mobile [35]. There are also

variations for the coherence operations, such as a multiple-writer protocol, or

a single-writer protocol. The single-writer protocol allows only one process to

write on a shared data unit at the same time. In order to become a writer, a

process needs to acquire a writing permission from the previous writer. The

multiple-writer protocol introduced in Munin [33] supports concurrent writes

on different copies of the same object by using the diff technique. It may

however incur heavy diff overhead compared with conventional single-writer

protocols. Another choice is between the update protocol (e.g., Orca [28])

and the invalidate protocol. The latter is used in many page-based DSM

systems such as TreadMarks [57] and JUMP [35]. The update protocol can

prefetch the data before the access, but it may send much more unneeded

data when compared with the invalidate protocol.

A promising approach to further improve the performance of DSM sys-

tems is to design adaptive cache coherence protocols that are able to de-

tect and optimize memory access patterns. The rationale behind is that

the particular memory access patterns in an application speak for the more

suitable protocol. It means the choice of a good coherence protocol is of-

ten application-dependent. That motivates people to go after an adaptive

protocol.

In this section, we discuss three approaches used in the memory access

pattern optimization, namely programmer annotation, compiler analysis, and

runtime adaptation.

3.1.1 Programmer Annotation

The programmer annotation approach requires the programmers explicitly

annotate the shared data objects with pattern declarations. Strictly speak-

18

ing, this approach does not use an adaptive cache coherence protocol. Nev-

ertheless, it manages to optimize some memory access patterns.

Munin [33] follows the programmer annotation approach. Munin allows

programmers to explicitly annotate the object with pattern declarations,

which include conventional, read-only, migratory, and write-shared. Each

pattern has its own protocol that will be used by Munin at runtime. Munin

applies a multiple-writer protocol to the write-shared pattern, and a single-

writer protocol to the conventional pattern. For the migratory pattern, the

objects are migrated from machine to machine as critical regions are entered

and exited. The read-only data are replicated on demand without further

consistency maintenance, but a runtime error will be generated if some pro-

cess tries to write read-only data.

SAM [72] is an object-based DSM runtime system that supports to opti-

mize some object access patterns, such as the producer-consumer pattern and

the accumulator pattern. An accumulator is used to represent a piece of data

that must be updated in a critical section. SAM provides synchronization

primitives to let user explicitly tie the patterns to the object accesses. SAM

automatically migrates the accumulator data, and prefetches the producer-

consumer data before they are consumed.

3.1.2 Compiler Analysis

The programmer annotation approach allows programmers choose a most

suitable cache coherence protocol among a set of candidates for an object

presenting a particular access pattern. Although this approach helps to im-

prove the performance, it is inconvenient for programmers and error-prone.

The compiler analysis approach tries to overcome the shortcoming of the

programmer annotation approach by leveraging compiler analysis techniques

to automatically extract the access pattern information from the programs.

Orca [28] is a language-based DSM system. At runtime, a shard object

can be either replicated on all processors, or not replicated at all. For the

19

replicated objects, the broadcast is used to deliver the update to all replica-

tions. For the non-replicated objects, the remote procedure call is used to

access the objects. The actual replication policy for each object is determined

by both the compiler and the runtime system. Orca’s compiler estimates the

expected read to write ratio of each shared object in the program. For ex-

ample, an object with a large read/write ratio on a cheap broadcast network

will be replicated on all processors. Orca’s runtime system can also collect

the actual read/write information to amend the compiler derived decisions.

Although the compiler analysis approach is able to automatically extract

access pattern information from the programs, it has several shortcomings

inherited from compiler analysis techniques. Firstly, since the input of the

compiler is the source code of the program, the compiler analyzes the pro-

grams based on allocation sites. An allocation site is the location in the

program source code where object instances are created at runtime. Though

the compiler analysis works well for the situation where all object instances

created from the same allocation site present the same access pattern, it

may be difficult to distinguish among the object instances of different access

patterns from the same allocation site.

Secondly, the compiler analysis approach may be difficult to detect the

access pattern changes. Even it is able to notice the possible changes, it may

be difficult to predict the actual change time.

Thirdly, the compiler analysis can not precisely predict the access patterns

without the knowledge of the actual thread-to-node mapping in the situation

of multi-threading. For example, assuming two threads concurrently write on

a shared large object, which can be detected by the compiler, if they reside

on different nodes at runtime, a multiple-writer protocol is suitable for the

shared object. The twin and diff techniques are used to support concurrent

multiple writers. However, if these two threads are on the same node at

runtime, all the twin and diff overheads are simply wasted.

20

3.1.3 Runtime Adaptation

To overcome the shortcomings of the compiler analysis approach, people are

investigating the runtime approach to optimize the memory access patterns,

called the runtime adaptation approach. It leverages the adaptive cache co-

herence protocol to detect and adapt to some particular access patterns. It

is transparent to the programmers. Since all the runtime access information

is accessible, the precise and prompt access pattern optimizations are pos-

sible. Usually the runtime adaptation approach speculatively detects access

patterns based on some heuristics. The false speculation can be corrected at

runtime.

Currently, most works in the runtime adaptation approach are done on

page-based DSMs. In the context of page-based DSMs, accesses to different

objects residing at the same page are mingled at the page level. So it is

difficult for them to detect access patterns in applications with fine-grain

sharing.

Some homeless page-based DSM systems use adaptive cache coherence

protocols to optimize memory access patterns. Adaptive TreadMarks [23]

can adapt between the single-write protocol and the multiple-writer protocol.

The single-writer protocol does not use twin and diff technique. Instead,

one process must get the ownership of a shared page before writing on it.

Adaptive TreadMarks switches to the single-writer protocol when it observes

that the overhead of requesting and applying diffs is larger than that of

requesting the whole page. It can also perform dynamical page aggregation,

which groups several pages together as a coherence unit. When a page of the

group is faulted in, the whole group of pages will be faulted in, too.

ADSM [64] can also adapt between the single-write protocol and the

multiple-writer protocol based on the approximate association between locks

and the data they protect. Initially, all pages are in the initial state, valid at

and owned by process 0. Any access fault will place the page in migratory

state, until a write fault by other process happens. Then the page is placed

21

in multiple-writer state. The single-writer protocol is used for pages in MIG

state, and the multiple-writer protocol is used for pages in multiple-writer

state. From time to time, pages in multiple-writer state can be reset to initial

to allow continuous adaptation.

The asymmetry between the home copy and non-home copies in home-

based protocols raises the home assignment problem. In home-based proto-

cols, the home copy is always valid. The accesses on home node never incur

communication overhead, while the accesses on non-home nodes will trigger

the communication with the home node. Therefore, which node acts as the

home will change the coherence data communication pattern, and thus influ-

ence the application performance. In fact, the optimal home assignment is

determined by the memory access pattern of the application. This inspires

some dynamic home assignment protocols able to adapt to runtime memory

access patterns.

In JiaJia [51], which is a page-based DSM system, those pages that are

written by only one process between two barriers are recognized by the bar-

rier manager and their homes are migrated to the single writing process.

New home notifications are piggybacked on barrier messages. JiaJia’s home

migration protocol only optimizes the single-writer pattern. Since JiaJia’s

approach relies on the barrier synchronization, it will not work if the appli-

cation does not use barriers or the DSM infrastructure does not expose the

barrier function. For example, in our case, the Java programmers have to im-

plement the barrier by using more primitive synchronization operations such

as lock/unlock/wait. Furthermore, since all the single-writer detection work

is done centrally at the barrier manager, it may cause considerable overhead

when there are a fair number of processes as well as shared pages.

JUMP [35] adopts a migrating-home protocol in that the process requiring

the page becomes the new home. The new home notification is broadcast to

other processes at synchronization points. Although this approach results in

less diffing operations because the writes probably happen at the home node,

22

No synchronization

Condition (Assignment)

Mutual exclusion (Accumulator)

Read only

Single writer

Multiple writers

Adaptationpoint

1

Number of writers

Synchronization

Repetition

Figure 3.1: The object access pattern space

the home migration decision ignores the inherent memory access patterns of

the application. If the accesses by the process at the new home do not persist,

home migration will not improve the performance; instead, it could suffer

from heavy home notification overhead. The worst case happens when the

shared page is written by processes sequentially, which produces numerous

home notification messages.

23

3.2 Access Pattern Space

According to JMM, an object’s access behavior can be described as a set

of reads and writes performed on the object, with interleaving synchroniza-

tion actions such as locks and unlocks. Locks and unlocks on the same

object are executed sequentially. Three orthogonal dimensions capturing the

characteristics of object access behavior can be defined: number of writers,

synchronization, and repetition. They form a 3-dimensional access pattern

space, as shown in Fig. 3.1.

Number of writers

Among all the accesses from different threads, a happen-before-1 [22] partial

ordering, denoted byhb1−→, can be established:

• If a1 and a2 are two memory actions by the same thread, and a1 occurs

before a2 in program order, then a1hb1−→ a2.

• If a1 is an unlock by thread t1, and a2 is the following lock on the same

object by thread t2, then a1hb1−→ a2.

• If a1 → a2 and a2 → a3, then a1hb1−→ a3.

A write w1 is a concurrent write if there exists another write w2 so that

• w1 and w2 are issued by different threads; and

• w1 and w2 are on the same object; and

• both w1hb1−→ w2 and w2

hb1−→ w1 do not hold.

We can also say w1 is concurrent with respect to w2, denoted by w1‖w2.

A write w1 is a sequential write if there does not exist another write w2

so that w1‖w2.

24

On the dimension of number of writers, we distinguish three cases:

• Multiple writers : the object is written by multiple threads. Specifically,

if w is a concurrent write on this object, this object presents multiple-

writer pattern when w happens. Multiple-writer pattern is not the

data race situation. Accesses of data race happen on the same variable,

while accesses of multiple-writer pattern happen on the same object.

Multiple-writer pattern implies the false sharing situation.

• Single writer : the object is written by a single thread. Specifically, if

w is a sequential write on this object, this object presents single-writer

pattern when w happens. Exclusive access is a special case where the

object is accessed (written and read) by only one thread.

• Read only : no thread writes to the object.

Synchronization

This characterizes the execution order of accesses by different threads. When

the object is accessed by multiple threads and at least one thread is a writer,

the threads should be well synchronized to avoid data race. There are three

cases:

• Accumulator : the object accesses are mutually exclusive. The object is

updated by multiple threads concurrently, and therefore all the updat-

ing should happen in a critical section. That is, the read/write should

be preceded by a lock and followed by an unlock. Java provides syn-

chronized block and synchronized method to implement accumulator

pattern.

• Assignment : the object accesses obey the precedence constraint. The

object is used to safely transfer a value from one thread to another

thread. The source thread writes to the object first, followed by the

destination thread reading it. Synchronization actions should be used

25

to enforce that the write happens before the read according to the

memory model. Java provides the wait and notify methods in the

Object class to help implement the assignment pattern.

• No synchronization: synchronization is unnecessary.

Repetition

This indicates the number of consecutive repetitions of an access pattern. It

is desirable that an access pattern will repeat for a number of times so that

the GOS will be able to detect the pattern based on the history informa-

tion and then apply the optimization on the re-occurrence of the pattern.

Such a pattern will appear on the right side of the adaptation point along

the repetition axis. The adaptation point is an internal threshold parameter

in the GOS. When the pattern repeats for more times than what the adap-

tation point indicates, the corresponding adaptation will be automatically

performed.

On the other hand, some important patterns appear on the left of the

adaptation point, such as the producer-consumer pattern. Produce-consumer

pattern is also called the single assignment. The write must happen before

the read. However, in the producer-consumer pattern, after the object is

created, it is written and read only once, and then turned into garbage.

26

Chapter 4

Distributed-Shared Object

This chapter presents how the memory consistency issue is efficiently solved

by leveraging the concept of distributed-shared objects. We define distributed-

shared object and discuss the benefits it brings to our GOS. We then present

a lightweight mechanism for the detection of DSOs and the basic cache co-

herence protocol used in the GOS.

4.1 Definitions

In the JVM, connectivity exists between two Java objects if one object con-

tains a reference to another. Therefore, we can conceive the whole picture

of an object heap to be a connectivity graph, where vertices represent ob-

jects and edges represent references. Reachability describes the transitive

referential relationship between a Java thread and an object based on the

connectivity graph. An object is reachable from a thread if its reference re-

sides in the thread’s stack, or if there is some path existing in the connectivity

graph between this object and some known reachable objects.

By the escape analysis technique [38], if an object is reachable from only

one thread, it is called thread-local object. The opposite is a thread-escaping

object, which is reachable from multiple threads. Thread-local objects can be

27

separated from thread-escaping objects at compile time using escape analysis.

The benefits from the escape analysis are: the synchronization operations on

thread-local objects can be safely removed, and the thread-local objects can

be allocated on the threads’ stack instead of the heap to reduce the heap

overhead.

In a distributed JVM, Java threads are distributed to different nodes, so

we need to extend the concepts of thread-local object and thread-escaping

object. We define the following.

• A node-local object (NLO) is an object reachable from thread(s) in

the same node. It is either a thread-local object or a thread-escaping

object.

• A distributed-shared object (DSO) is an object reachable from at least

two threads located at different nodes.

4.2 Benefits from DSO Detection

We introduce the concept of DSO to address both the memory consistency

issue and the memory management issue in GOS. We argue that the identi-

fication of DSOs benefits both the memory consistency maintenance and the

memory management, i.e., distributed garbage collection.

4.2.1 Benefits on Memory Consistency Maintenance

The detection of DSOs can help reduce the memory consistency maintenance

overhead. According to the JVM specification, there are two memory consis-

tency problems in a distributed JVM. The first one, local consistency, exists

among working memories of threads and the main memory inside one node.

The second one, distributed consistency, exists among multiple main mem-

ories of different nodes. The issue of local consistency should be addressed

by any JVM implementation, whereas the issue of distributed consistency

28

is only present in the distributed JVM. The cost to maintain distributed

consistency is much more than that of its local counterpart due to the com-

munication incurred. As we have mentioned before, synchronization in Java

is used not only to protect critical sections but also to enforce memory con-

sistency. However, synchronization actions on NLOs do not need to trigger

distributed consistency maintenance, because all threads that are able to

acquire or release the lock of an NLO must reside in the same node, and

therefore would not experience distributed inconsistency throughout.

Only DSOs are involved in distributed consistency maintenance since they

have multiple copies in different nodes. With the detection of DSOs, only

DSOs need to be visited to make sure that they are in a consistent state

during distributed consistency maintenance.

4.2.2 Benefits on Memory Management

According to the JVM specification, one vital responsibility of the GOS is

to perform automatic memory management in the distributed environment,

i.e., distributed garbage collection (DGC) [67]. The detection of DSOs also

helps improve the memory management in the GOS.

Since we detect DSOs at runtime, we are able to do pointer transla-

tion across node boundaries, i.e., between local object addresses and objects’

global unique identifications (GUID), so as to relocate objects at different

memory addresses on different nodes.

In this way, the heap management of each node is totally decoupled.

Each node performs independent memory management. The local garbage

collectors on each node can perform asynchronous collection of garbage ob-

jects independently. The global garbage collections can thus be postponed

or reduced.

Moreover, all the nodes are coordinated to present a huge virtual heap.

We can calculate the aggregated heap size of our distributed JVM with the

29

following formula:

H = (1− d)hn + dh (4.1)

H — The aggregated heap size;

h — The heap size on each node;

n — The number of nodes;

d — The ratio of the local heap space that DSOs occupy to the total local

heap size. We presume DSOs will be replicated on all nodes.

Obviously, when the ratio of DSOs, i.e., d, is small, H ≈ hn.

4.3 Lightweight DSO Detection and Recla-

mation

In the distributed JVM, whether an object is a DSO or an NLO is determined

by the relative locations of the object and the threads reaching it. Compile-

time solutions, such as the escape analysis, are not useful as the location

of objects and threads can only be determined at runtime. We propose a

runtime lightweight DSO detection scheme which leverages Java’s runtime

type information to unambiguously identify pointers, i.e. object references

in Java context.

Java is a strongly typed language. Each variable, either an object field

that is in the heap, or a thread-local variable in some Java thread’s stack, is

associated with a type. The type is either a reference type or a primitive type

such as integer, char, or float. The type information is known at compile time

and written into class files generated by the compiler. At runtime, the class

subsystem builds up the type information from the class files. By looking

up the runtime type information, we can identify those variables that are

of the reference type. Therefore, object connectivity can be determined at

runtime. The object connectivity graph is dynamic since the connectivity

between objects may change from time to time through the reassignment of

30

objects fields.

DSO detection is performed when there are some JVM runtime data to

be transmitted across node boundaries, which could be thread stack contexts

for thread relocation, object contents for remote object access, or diff data for

update propagation. On both the sending and the receiving side, these data

are examined for identification of object references. A transmitted object

reference indicates the object is a DSO since it is reachable from threads

located at different nodes.

On the sending side, if the corresponding object of an identified object

reference has not been marked as a DSO, it is marked at this moment. In

doing this, a global unique identification (GUID) will be assigned to it, which

is its global name in the cluster-based JVM. Before sent, all the object ref-

erences should be replaced by their GUIDs. Since the copies of DSOs reside

at different memory addresses on different nodes, local object references, i.e.,

memory addresses, do not make sense on other nodes.

In sending an object, usually the type information of all its fields can be

determined according to its class data structure. However, some additional

type information should be sent along in some situations in order to clearly

describe the type information of a field: (a) If the field is of an array, the

array’s size should be sent. The array’s size helps to shape an array. And it

is a special field of the array object in Java. (b) If the field is of a subclass of

the class type defined in the class, the subclass’s type should be sent. Java

allows a conversion from any class type S to any class type T , provided that

S is a subclass of T . And the subclass type can’t be determined from the type

information loaded from class file. (c) If the field is of an implementation of

the interface type defined in the class, the field’s actual type should be sent.

On the receiving side, all the GUIDs should be replaced by their corre-

sponding local object references. The receiver knows where a GUID should

appear according to the type information. When a GUID first emerges, an

empty object of corresponding type will be created to be associated with

31

it, so that the reference will not become a dangling pointer. The object’s

access state will be set to be invalid. When it is accessed later, its up-to-date

content will be faulted in.

In this scheme, only those objects whose references appear in multiple

nodes will be identified as DSOs.

We detect DSO in a lazy fashion. Since at anytime it is unknown whether

an object will be accessed by its reaching thread in the future or not, we

choose to postpone the detection to as close to the actual access as possible,

thus making the detection scheme lightweight.

To correctly reflect the sharing status of objects in the GOS, we rely on

distributed garbage collection to convert a DSO back to an NLO. If all the

cached copies of a DSO have become garbage, the DSO can be converted

back to an NLO. The distributed garbage collection will be discussed in

section 7.5.

An Example

Examining the case in Figure 4.1, a thread T1 prepares an object tree then

passes the reference of object c to another thread T2 as shown in the reach-

ability graph (Figure 4.1.a).

When T2 is distributed to another cluster node, i.e. node 1, all the objects

reachable from object c become DSOs. Object a, b, and d are not DSOs since

they are thread-local to T1. Instead of detecting all these objects as DSOs at

one blow, we detect object c as a DSO and send object c to node 1. Because

object e and f are directly connected with object a, we also detect object

e and f as DSOs but do not send them to node 1 (Figure 4.1.b). On node

1, we create two objects whose types are exactly the same as the types of

object e and f . Since the contents of object e and f are not available, we set

their access states to invalid.

Next time when object f is accessed by T2 on node 1 (Figure 4.1.c), an

object fault will occur. An object request message will be sent to node 0.

32

T2

Node 0 Node 1

T1 T2

Cluster network

Node 0 Node 1

Java thread stack frame

Java object

Connectivity between objects

Object reference in thread stack

Detected DSO

Invalid DSO

c b

a

f e d

h g

T1

c b

a

f d

T2

c

f

T1

c b

a

f d

c

f

i

e

h g i

e

e

h g i

e

i

(a) Reachability graph

(b) After thread T2 is distributed to Node 1

(c) Access on f by T2 triggers detection of i

Figure 4.1: The detection of distributed-shared object

33

This event will trigger the detection of object i as a DSO. The up-to-date

content of object f is copied from node 0 to node 1. The details of how to

maintain the coherence of objects located among multiple nodes are discussed

in next section. If object e is not accessed by T2, object e is always invalid

on Node 1 and object g and h will never be detected as DSOs.

4.4 Basic Cache Coherence Protocol

Our basic cache coherence protocol is a home-based, multiple-writer cache

coherence protocol. Figure 4.2 shows a state transition graph depicting the

lifecycle of an object from its creation to possible collection based on the

proposed DSO concept. At the right of the figure, the state transition graph

of the cache coherence protocol for DSOs at non-home nodes is shown. The

read/write arrows represent those happening on this object. The lock/unlock

arrows represent those happening on any object because lock/unlock actions

on other objects will also influence this object’s state according to JMM.

The lower part of the figure illustrates the interaction between the garbage

collection and the object’s states, which will be discussed in section 7.5.

When a DSO is detected, the node where the object is first created is

made its home node. The home copy of a DSO is always valid. A non-home

copy of a DSO can be in one of three possible access states: invalid, read

(read-only), or write (writable). Accesses to invalid copies of DSOs will fault

in the contents from their home node. Upon releasing a lock of a DSO,

all updated values to non-home copies of DSOs should be written to their

corresponding home nodes. Upon acquiring a lock, a flush action is required

to invalidate the non-home copies of DSOs, which guarantees that the most

up-to-date contents will be faulted in from the home nodes when they are

accessed later. Before the flush, all updated values to non-home copies of

DSOs should be written to the corresponding home nodes. In this way, a

thread is able to see the up-to-date contents of the DSOs after it acquires

34

��

��

��

��

��

��

��

�� !

� � ��"��

��

� ��

#�$

W

�

��

��

��

��

� ��

��

��

� ��

%&'&%&()&

��

�**

+,-.,+/01234/

#�$5��$�� 6� �� "� *78��5�� 6� �� "� *78�95�� 6� �� "� *78�

��

Figure 4.2: The state transition graph depicting object lifecycle in the GOS

35

the proper lock.

A multiple-writer protocol permits concurrent writing to the copies of a

DSO, which is implemented using the twin and diff techniques [57]. On the

first write to a non-home copy of the DSO, a twin will be created, which is

an exact copy of the object. On lock acquiring and releasing, the diff, i.e.,

the modified portion of the object, is created by comparing the twin with

the current object content word by word, and sent to the home node. On

releasing a lock, after the diffs are sent out, the access states of the updated

objects should be changed from write to read to capture the future writes.

Since a lock can be considered as a special field of an object, all the

operations on a lock, including acquire, release, as well as wait and notify

that are the methods of the Object class, are executed in the object’s home

node. Thus, the object’s home node acts as the object’s lock manager. The

detailed design and implementation of distributed synchronization will be

discussed in section 7.2.

With the availability of object type information, it is possible to invoke

different coherence protocols according to the type of the objects, as shown in

table 4.1. For example, immutable objects, such as instances of class String,

Integer, and Float, can be simply replicated and treated as an NLO. Some

objects are considered node-dependent resources, such as instances of class

File. When node-dependent objects are detected as DSOs, object replication

should be prohibited. Instead, accesses to them should be transparently

redirected to their home nodes. This is an important issue in the provision

of a complete single system image to Java applications.

36

Type Characteristics Protocoljava.lang.Thread Represents Java thread. On creation, choose a

running node for load balance.java.lang.String, Immutable object. Simply replicated and

java.lang.Integer, etc. treated as an NLO.java.io.File, etc. Represents node- No replication. Accesses will

dependant resources. be transparently redirectedto their home node.

Primitive array, such contains no object references. DSO detection is disabled.as float[ ], int[ ], etc.

Table 4.1: Coherence protocols according to object type

37

Chapter 5

Adaptive Cache Coherence

Protocol

Scientific applications usually exhibit diverse memory access patterns. The

performance of various cache coherence protocols is application-dependent:

the application’s inherent memory access patterns speak for the most suitable

protocol. This inspires us to go after an adaptive cache coherence protocol

to further improve the performance of our GOS. An adaptive cache coher-

ence protocol is able to detect the current access pattern and adjusts itself

accordingly.

Based on the access pattern space, we present several adaptations incor-

porated into our basic home-based multiple-writer cache coherence protocol

in three respective situations in the access pattern space: (1) object home

migration [45] which optimizes the single-writer access pattern by moving

the object’s home to the writing node according to the access history; (2)

synchronized method migration which chooses between default object (data)

movement and optional method (control flow) movement in order to optimize

the execution of critical section methods according to some prior knowledge;

(3) connectivity-based object pushing which scales the transfer unit to opti-

mize the producer-consumer access pattern according to the object connec-

38

tivity information.

5.1 Adaptive Object Home Migration

As a state-of-the-art DSM system, TreadMarks [57] adopts a multiple-writer

cache coherence protocol to implement lazy release consistency. TreadMarks

uses twin and diff techniques to support multiple processes writing on the

same shared virtual memory page simultaneously due to false sharing. The

protocol is considered to be homeless because the diffs are saved and managed

at each process.

Although TreadMarks’ homeless protocol can greatly alleviate the false

sharing problem, it may still suffer from heavy communication and protocol

overheads. In order to serve a page fault, the faulting process has to fetch the

diffs from each process that has updated the page before the fault according

to LRC, which causes multiple round-trip messages. Each diff needs to be

applied once at each process that fetches that diff, which amounts to a large

overhead. In addition, the diffs could consume a lot of memory, and cleaning

the useless diffs may trigger a global garbage collection.

In order to address the above problems, a home-based protocol to imple-

ment LRC, which is called HLRC, was proposed [55]. In the home-based

protocol, each shared coherence unit has a home to which all writes (diffs)

are propagated and from which all copies are derived. It has been shown that

the home-based protocol is more scalable than the homeless protocol because

the home-based protocol maintains a simpler state, sends fewer messages, has

a lower diff overhead, and consumes much less memory.

The asymmetry between the home copy and non-home copies in home-

based protocols raises the home assignment problem. In home-based proto-

cols, the home copy is always valid. Accesses at the home node never incur

communication overhead, while accesses at non-home nodes will trigger com-

munication with the home node. Therefore, which node to act as the home

39

will change the coherence data communication pattern, and influence the ap-

plication performance. In fact, the optimal home assignment is determined

by the memory access patterns of the application. This inspires some dy-

namic home assignment protocols that are able to adapt to runtime memory

access patterns [51, 35, 78, 44].

In DSM applications, the single-writer access pattern happens if the

shared coherence unit is only updated by one process for a certain period.

It does not prohibit the shared coherence unit from being read by multi-

ple processes at the same time. A few research projects [51, 23, 64] have

demonstrated that the single-writer pattern is common in DSM applications.

In our GOS, we propose a novel home migration protocol to optimize the

single-writer pattern. We only target the single-writer pattern because home

migration makes little difference in the multiple-writer situation so long as

the home node is one of the writers.

At runtime, an object can exhibit different access patterns during its

lifetime. For example, an object can be updated by multiple writers concur-

rently, and then by a single writer exclusively; or an object can be updated

by different writers sequentially, each persisting for sometime. Since home

migration has to have the effect that the other processes would be informed

of the new home, improper home migrations will degrade the performance

by introducing a host of messages for new home notification. Therefore, it is

a challenge to exploit the single-writer property as much as possible and at

the same time maintain an acceptable level of home migration overhead.

5.1.1 Home Migration Concepts

Figure 5.1 illustrates the home-based multiple-writer protocol that imple-

ments LRC. In the figure, X represents some shared coherence unit, which

could be either an object or a virtual memory page. Its home is at the proces-

sor where process P2 resides. Assuming the write on X performed by process

P1 causes a fault, because either the local cached copy is outdated according

40

Diff propagation

Fault-in

P1 P2 (Home of X)

Lock Write(X)

Make twin

Create diff

Unlock Apply

diff

Figure 5.1: Home-based Protocol for LRC with multiple-writer support

to LRC or X is not cached at all, P1 will then fault-in the valid copy from

X’s home, P2. Before P1 could write on the newly fetched copy, it needs to

create a twin, which is simply a copy of X. Later, when P1 releases the lock,

it eagerly creates the diff, which is the difference between the current X and

the previously saved twin, and sends the diff to the home. And the diff will

be applied to the home copy of X at the home.

If P1 is the only writer of X, we can migrate X’s home from P2 to P1, to

avoid the communication overhead including faulting in the shared data and

the diff propagation, the diff overhead including creating and applying the

diff, and the memory consumption caused by the twin and the diff.

On the other hand, if both P1 and P2 write on X, it does not matter which

node to become the home.

Home Location Notification Mechanism

We assume there is a way to determine the initial home for each unit. For

example, all units are initially assigned a home node by a well known hash

function. If the home of a shared coherence unit is subject to migration,

41

home miss could happen. Home miss is the situation that a process visits an

obsolete home. Therefore, we need some mechanism to inform other nodes

of the new home location. There are three mechanisms: broadcast, home

manager, and forwarding pointer.

Broadcast After home migration, the new home location is broadcast to all

the nodes.

Home Manager The most updated home location of a unit is always recorded

in a designated manager node, which is known to all nodes. On home

migration, the new home location is posted to the manager node. On

home miss, a process can visit the manager node to find out where the

current home is.

Forwarding Pointer On home migration, a forwarding pointer is left in

the former home to point to the new home. On home miss, a process

can always be redirected to the current home via the given forwarding

pointer.

With the broadcast and home manager mechanisms, it is possible that the

broadcast or update to the manager happens after some node tries to fault

in a copy from the home node. Then the former home is already obsolete,

but the new home is still not known. This situation needs to be handled

carefully, for example, by waiting for sometime before repeating the fault-in

again. Notice that this situation will not happen with the forwarding pointer

mechanism.

Of the three mechanisms, which is superior depends on the memory ac-

cess patterns of the applications and how frequent the home migration is.

If after a home migration, all the other nodes need to visit the new home,

then the broadcast mechanism is superior to the others because a well imple-

mented broadcast operation should be efficient for notifying all. Otherwise,

the broadcast may cause a large overhead. The merit of the forwarding

42

pointer mechanism is that it does not need to broadcast the new home loca-

tion on home migration. However, the redirection effect may cascade where

multiple home migrations may form a distributed chain of home forward-

ing pointers. Therefore, a process may be redirected multiple times before

coming upon the current home, which is called redirection accumulation. It

could cause significant overhead when home migration happens frequently.

The manager mechanism strikes a balance between the home notification

cost and the home miss cost. However, on a home miss, the process needs

to visit the old home, the manager, and the new home in sequence, which

is heavyweight compared with the broadcast mechanism and the forwarding

pointer mechanism in the absence of redirection accumulation.

5.1.2 Home Migration with Adaptive Threshold

In order to detect the single-writer access pattern, the GOS monitors all

home accesses as well as non-home accesses at the home node. With the

cache coherence protocol, the object request can be considered a remote read

and the diff received on synchronization points a remote write. To monitor

the home accesses, the access state of the home copy will be set to invalid

on acquiring a lock and to readonly on releasing a lock. Therefore, the home

access faults can be trapped and returned after the access is recorded. We

call the write fault at home node the home write, and the read fault at home

node the home read, respectively.

At the home node, we define an object’s consecutive remote writes to be

those issued from the same remote node and not interleaved with the writes

from either the home node or other remote nodes. Note that under the Java

memory model, the remote writes are only reflected to the home node on

synchronization points. Therefore the number of consecutive writes is the

number of synchronizations during which the object is only updated by that

node. At runtime, the GOS continuously monitors consecutive remote writes

for each object. We also introduce a predefined home migration threshold

43

that represents some prior knowledge on the single-writer pattern. We fol-

low a heuristic that an object is in the single-writer pattern if the number

of consecutive remote writes exceeds the home migration threshold. If the

single-writer pattern is detected, when the object is requested again by the

writing node, not only the object is replied, but also its home is migrated.

We adopt the forwarding pointer mechanism to notify others of the new home

location. When an obsolete home node is requested for an object, it simply

replies with the valid home node location.

However, this protocol is still not satisfactory. Above all, it is difficult to

decide the fixed home migration threshold. If it is too large, which implies a

lazy migration policy, the home migration will be less sensitive to the single-

writer pattern, thus causing unnecessary remote access overhead. If the

home could be migrated earlier, more remote accesses could be transformed

to local accesses. On the contrary, if the threshold is too small, it implies

an eager migration policy. Although sensitive to the single-writer pattern, it

will be less capable of avoiding unnecessary home migrations. If the single-

writer pattern is transient in that it repeats for a very limited times, then

the threads on the new home node may not perform any more accesses after

the home migration. Thus the home migration decision will not gain any

performance improvement, but suffer from the home redirection overhead.

We observe that the transient single-writer pattern is not worthy of home

migration. The home migration protocol should capitalize on the lasting

single-writer pattern.

The challenge here is to choose a threshold that yields both sensitivity and

robustness for the single-writer pattern. By robustness we mean taking no

migration action for the transient single-writer pattern, and by sensitivity the

approach responds actively to the lasting single-writer pattern. Furthermore,

it is anticipated that different objects may have different access behaviors.

It is more reasonable to use different thresholds on different objects.

Based on the above discussion, we propose a home migration protocol

44

with an adaptive threshold. The adaptive threshold is monotonously de-

creasing with increased likelihood that an object presents the lasting single-

writer pattern. A lower threshold will allow home migration to happen more

quickly. The adaptive threshold is continuously adjusted at runtime accord-

ing to the feedback of previous home migration decisions for each object.

Runtime Feedback

In order to measure the feedback of previous home migration decisions, the

GOS will observe exclusive home writes and redirected object requests at

runtime.

We define exclusive home write to be that there is no remote write be-

tween an exclusive home write and an earlier home write. Clearly, exclusive

home writes reflect the single-write pattern happening at the home node. So

it represents a positive feedback of previous home migration decisions.

A redirected object request reflects the home redirection effect due to

home migration. It represents a negative feedback of previous home migra-

tion decisions. Redirected object requests take the redirection accumulation

into account. For example, if an object request is redirected for three times

before reaching the current home node, the number of redirected object re-

quests will be considered to be three instead of one.

In addition, it is observed that exclusive home writes and redirected object

requests are associated with different costs. The home redirection overhead,

which is measured by redirected object requests, is equal to the round-trip

time for a unit-sized message. The benefits due to home migration are from

eliminated pairs of object fault-ins and diff propagations. They are measured

by exclusive home writes, and are related to the object size. Therefore, we

introduce the home access coefficient which is the overhead ratio of one elim-

inated pair of object fault-in and diff propagation to one home redirection.

Here we mainly consider the communication overhead.

45

Formalization

We formalize the idea of object home migration with adaptive threshold as

follows. For each object, we have:

• Ci : the number of consecutive remote writes since the (i− 1)th home

migration.

• Ti : the value of the adaptive home migration threshold since the (i−1)th home migration.

• Tinit : the initial threshold, which is set to 1.

• Ri : the number of redirected object requests since the (i − 1)th home

migration.

• Ei : the number of exclusive home writes since the (i − 1)th home

migration.

• α : the home access coefficient.

• m 12

: the half-peak length in bytes, which is the message length required

to achieve half of the asymptotic bandwidth [50].

Home migration decision is taken when the following condition is met:

Ci = Ti (5.1)

Adaptive home migration threshold, Ti, is calculated by

Ti = max{(Ti−1 + Ri − αEi), Tinit} (5.2)

where

T0 = Tinit = 1 (5.3)

46

and

α ≈ 2 + bsizeof(object)

m 12

c (5.4)

Equation (5.2) is the core of the above equations, which determines the

adaptive home migration threshold. Both the positive feedback (exclusive

home writes) and the negative feedback (redirected object requests) of previ-

ous home migrations will affect the current threshold. The positive feedback

tends to indicate that the object presents a lasting single-writer pattern, thus

decreases the threshold. Remember the threshold is monotonously decreas-

ing with increased likelihood of the lasting single-writer pattern. While the

negative feedback tends to indicate that the object presents the transient

single-writer pattern, thus increases the threshold. We also take the home

access coefficient into account. Whenever the home migration condition, i.e.,

Equation (5.1), is met, a home migration takes place. All these computations

are done by the GOS at the home node of the object.

The initial threshold is set to 1 in order to speed up the initial data

relocation if possible. It is possible that the initial data layout is not opti-

mal with respect to the data access behavior, particularly when the writing

nodes of single-writer objects are not their home nodes. A small initial home

migration threshold could alleviate this situation. We rely on the adaptive

threshold mechanism to adjust the threshold automatically after the initial

home migration.

The Deduction of Home Access Coefficient

Hockney [50] has proposed a model to characterize the communication time

(in µs) for a point-to-point operation as follows, where the communication

overhead, t(m), is a linear function of the message length m (in bytes):

t(m) = t0 +m

r∞(5.5)

47

t0 is the start-up time in µs.

r∞ is the asymptotic bandwidth in MB/s.

Recall that home access coefficient is the overhead ratio of one eliminated

pair of object fault-in and diff propagation to one home redirection. Here we

mainly consider the communication overhead. We assume the object size is

o, the diff size is d, and the home redirection is a unit-sized message. Then

we have:

α =(t0 + o

r∞) + (t0 + d

r∞)

t0 + 1r∞

(5.6)

=2t0r∞ + (o + d)

t0r∞ + 1(5.7)

The half-peak length, denoted by m 12

bytes, is the message length required

to achieve half of the asymptotic bandwidth. It can be derived using the

relationship:

m 12

= t0r∞ (5.8)

Based on m 12À 1 and o > d, we derive equation (5.4). We restate it

here:

α ≈ 2 + b o

m 12

c (5.9)

5.2 Synchronized Method Migration

Synchronized method migration is not meant to directly optimize synchro-

nization related access patterns such as assignment and accumulator. In-

stead, it optimizes the execution of the synchronized method itself, which is

usually related to those access patterns.

Java’s synchronization primitives, including synchronized block, as well

as the wait and notify methods of the Object class, are originally designed

48

for thread synchronization in a shared memory environment. The synchro-

nization constructs built upon them are inefficient in a distributed JVM that

is implemented in a distributed memory architecture like clusters.

Fig. 5.2 shows the skeleton of a Java implementation of the barrier func-

tion. The execution cannot continue until all the threads have invoked the

barrier method. We assume the instance object is a DSO and the node

invoking barrier is not its home node. On entering and exiting the syn-

chronized barrier method, the invoking node will acquire and then release

the lock of the barrier object, while maintaining distributed consistency. In

line 8, the barrier object will be faulted in. It is a common behavior that

the locked object’s fields will be accessed in a synchronized method. In line 9

and line 11, the synchronization requests wait and notifyAll respectively,

will be issued. The wait method will also trigger an operation to main-

tain distributed consistency according to the JMM.1 Therefore, there are

four synchronization or object requests sent to the home node and multiple

distributed consistency maintaining operations involved.

We propose synchronized method migration to reduce communication and

consistency maintenance overhead in the execution of synchronized methods

at non-home nodes. On synchronized method migration, instead of invoking

the method, the synchronized object’s GUID, the method’s index in the

dispatch table, and the arguments of the method, will be sent to the home

node of the synchronized object. The method will be executed there. The

method’s return value if exists will be sent back so that the execution at the

non-home node can continue.

While object shipping is the default behavior in the GOS, we apply

method shipping particularly to the execution of synchronized methods of

DSOs. With the detection of DSOs, this adaptation is feasible in our GOS.

The synchronized method migration code is generated at JIT compilation

time. All the non-synchronized methods are untouched so that they can go

1According to the JMM, wait behaves as if the lock is released first and acquired later.

49

class Barrier {

private int count; // the number of threads to barrier

private int arrived; // currently arrived threads

public Barrier(int numOfThreads) {

count = numOfThreads;

arrived = 0;

}

public synchronized void barrier() {

try {

if (++arrived < count)

wait();

else {

notifyAll();

arrived = 0;

}

} catch (Exception e) {

// handle the synchronization exception

}

}

}

Figure 5.2: Barrier class

in full speed. A code stub is inserted into the beginning of each synchronized

method, which includes a condition check to see whether the current execu-

tion needs migration, and the actual code to perform synchronized method

migration.

The method shipping will cause the workload to be redistributed among

the nodes. However, the synchronized methods are usually short in terms

of the execution time; therefore, synchronized method migration will not

significantly affect the load distribution in the distributed JVM.

50

5.3 Connectivity-based Object Pushing

Some important patterns, such as the single-writer pattern, tend to repeat for

a considerable number of times, therefore giving the GOS the opportunity

to detect the pattern using history information. However, there are some

significant access patterns that do not repeat, which cannot be detected by

using access history information.

Connectivity-based object pushing is applied in our GOS to the situa-

tions where no history information is available. Essentially, object pushing

is a prefetching strategy which takes advantage of the object connectivity

information to more accurately pre-store the objects to be accessed by a re-

mote thread, therefore minimizing the network delay in subsequent remote

object accesses.

Connectivity-based object pushing actually improves the reference local-

ity. It is useful in applications of fine-grained object sharing.

The producer-consumer pattern is one of the patterns that can be opti-

mized by connectivity-based object pushing. Similar to the assignment pat-

tern, the producer-consumer pattern obeys the precedence constraint. The

write must happen before the read. However, in the producer-consumer pat-

tern, after the object is created, it is written and read only once, and then

turned into garbage. Therefore, producer-consumer is single-assignment.

The producer-consumer pattern is popular in Java programs. Usually, in a

producer-consumer pattern, one thread produces an object tree, and prompts

another consuming thread to access the tree. In the distributed JVM, the

consuming thread suffers from network delay when requesting objects one by

one from the node where the object tree resides.

In order to apply connectivity-based object pushing, we follow the heuris-

tic that after an object is accessed by a remote thread, all its reachable

objects in the connectivity graph may be “consumed” by that thread after-

wards. Therefore, upon request for a specific DSO in the object tree, the

home node pushes all the objects that are reachable from it to the requesting

51

node.

Object pushing is better than pull-based prefetching which relies on the

requesting node to specify explicitly which objects to be pulled according to

the object connectivity information. A fatal drawback of pull-based prefetch-

ing is that the connectivity information contained in an invalidated object

may be obsolete. Therefore, the prefetching accuracy is not guaranteed.

Some unneeded objects, even garbage objects, may be prefetched, which will

end up wasting communication bandwidth. On the contrary, object push-

ing gives more accurate prefetching since the home node has the up-to-date

copies of the objects and the connectivity information in the home node is

always valid.

In our implementation, we rely on an optimal message length, which is

the preferred aggregate size of objects to be delivered to the requesting node.

Reachable objects from the requested object will be copied to the message

buffer until the current message length is larger than the optimal message

length. We use a breadth-first search algorithm to select the objects to be

pushed. If these pushed objects are not DSOs yet, they will be detected.

This way, DSOs are eagerly detected in object pushing.

Since object connectivity information does not guarantee that future ac-

cesses are bound to happen, object pushing also risks sending unneeded ob-

jects. We disable object pushing upon request of an array of reference type,

e.g., a multi-dimension array, since such an array usually represents some

workload shared among threads with each thread accessing only a part of it.

52

Chapter 6

Object Access Pattern

Visualization

We design and implement a visualization tool called PAT (Pattern Analysis

Tool) that can be used to visualize object access traces and analyze object

access patterns in our GOS.

PAT is useful in two aspects. For the protocol designers, such a tool can

expose the inherent memory access patterns inside a benchmark application,

and thus enable the evaluation of the effectiveness of the adaptive protocol

in reducing the number of network-related memory operations and the pro-

tocol’s pattern detection mechanism. It can reveal how frequent a particular

memory access pattern appears in an application, and how well a particular

adaptation can optimize a target memory access pattern.

On the other hand, it can help the application developer in planning the

initial data layout and runtime data relocation. Since DSM systems tend

to hide the communication details from application developers, performance

tuning is rather difficult if not impossible. With PAT, the parallel application

developer is able to discover the performance bottleneck in the application by

observing the application’s memory access behavior. He may then redesign

the algorithm to avoid some heavyweight memory access patterns. In this

53

Pattern Visualization Component

Map pattern to access events

log DJVM Node

log DJVM Node

log DJVM Node

log DJVM Node

log DJVM Node

log DJVM Node

log DJVM Node

log DJVM Node

Lifetime Pattern

Analyzer

Global Phase Pattern

Analyzer

Producer-consumer Analyzer

Other Pattern

Analyzer

Pattern Analysis Engine

Pattern

Window

Timeline Window

Source Code

Window

Map pattern to allocation site

Merged Object Access Events Log

Runtime Operations Postmortem Operations

Figure 6.1: PAT Architecture

aspect, PAT plays the role of a profiling tool.

PAT comprises three components: the object access trace generator (OATG)

that is plugged into the distributed JVM, the pattern analysis engine (PAE),

and the pattern visualization component (PVC), as shown in figure 6.1.

OATG gathers object access information at runtime. Improper runtime

logging could introduce intolerable overheads and interruptions to the appli-

cation being traced, which makes the logging unaccepted. For example, the

recorded memory access behavior could be quite different from that without

logging due to the interruption caused by heavyweight logging. To tackle this

problem, OATG was designed to be lightweight. It activates the recording

only on DSOs. Logs are stored in a memory structure and flushed to the

local disk at synchronization points or when the buffer is full. The just-in-

time compiler is used to instrument only the user-interested methods; all the

other methods execute at full speed.

54

PAE is used to discover knowledge concerning patterns from the raw ac-

cess information collected by OATG. After an application’s execution, the

global (of all the processes) and complete (the entire lifetime of the applica-

tion) access information can be compiled, based on which an analysis of the

object access patterns is carried out precisely and thoroughly.

PVC uses a pattern-centric representation to visualize object access pat-

terns. It can display the global and complete access pattern information. In

addition, for objects of interest to the user, it can associate access patterns

with the source code lines that create the corresponding objects—referred

to as allocation sites. The object access patterns can be further mapped to

low-level object access operations.

StormWatch [37] is a profiling tool that visualizes the execution of DSM

systems and links it to the program’s source code. StormWatch provides

three linked graphic views: trace, communication and source. The trace

and the communication view combined together reflect the low level access

operations in the execution. The major difference between our tool and

StormWatch is that StormWatch only focuses on the low level access oper-

ations, which may not provide straightforward and intuitive information to

the users. However, our pattern analysis and visualization system provides

the access pattern knowledge that, as high level information, will definitely

be more helpful to the users.

Xu et al. described a profiling approach for DSM systems in [81]. It can

detect and visualize some cache block level access patterns. However, as an

online tool, it suffers from the memory and time constraints in a runtime

analysis. For example, it can only show lifetime access pattern that a certain

cache block presents in the whole execution time. The pattern change cannot

be expressed because the memory consumption is expensive if each pattern

change per cache block is recorded. This is neither flexible nor precise. On

the contrary, our approach is postmortem so that we can invest as much

effort as affordable to precisely and thoroughly analyze the access patterns

55

after the execution.

6.1 Object Access Trace Generator

OTAG uses several techniques to achieve the lightweight runtime logging of

memory access information.

Firstly, it relies on the Java memory model to carefully choose the memory

access operations to be logged. Figure 6.2 shows all the memory access

operations in the GOS, with only those access types in bold font being logged.

In the GOS, we focus on DSOs since only they will incur communica-

tion overheads. Consequently, we are only interested in the access patterns

presented by DSOs. On non-home nodes, the object faulting-in and diff prop-

agation can represent the reads and writes on the cached copy, respectively.

Similarly, the home read fault and home write fault can represent all the

reads and writes happening in the home node, respectively. All these remote

and home reads/writes, together with synchronization operations on objects

and synchronized methods, constitute the object’s access behavior.

Secondly, usually we are interested not only in those access operations

themselves, but also the relationship between them and other program states.

For example, we may want to know what the object access behavior is inside

a Java method. Or we may want to log a particular method that imple-

ments barrier synchronization among all threads to observe the object access

operations against the barrier synchronization.

To address the above requirement, OATG leverages the just-in-time com-

piler in a distributed JVM to dynamically instrument translated Java method

code to log interesting operations. PAT allows the user to provide a list of

interested Java method signatures1 to the distributed JVM. During the just-

in-time compilation, the signature of the to-be-translated method will be

compared against the user provided list. If there is a match, the just-in-time

1The format of Java method signature is defined in the JVM specification.

56

Memory Access Operations in distributed JVM

On node-local objects On distributed-shared objects

Synchronization: lock, unlock, wait, notify

Read Write

Remote read: object faulting-in from the home node

Read on cached copy

Write on cached copy

Remote write: diff propaga-tion to the home node

Synchroniza-tion: lock, unlock, wait, notify

Synchronized method

Read/write issued on non-home nodes

Home read: home read fault

Other read on home copy

Other write on home copy

Home write: home write fault

Read/Write issued on the home node

Figure 6.2: Memory access operations in GOS

57

compiler will insert the log code at both the start and the end of the method.

In doing so, the user is able to choose his interested method operations to log.

All the other methods are left untouched and operate at full speed. If the

just-in-time compiler is not used, we have to do the instrumentation for each

method in advance since each method could potentially be a user-interested

operation. The overall slowdown could be significant.

We make use of some source code of the logging facility in MPE (Multi-

Processing Environment of MPICH) [34] for collecting the access logs. How-

ever, our logging facility does not require MPI support during logging. It is

implemented as a library and linked against the distributed JVM. At run-

time, each process of the distributed JVM independently generates its own

log. The log records are firstly put into the local memory and then dumped

to the local disk at synchronization points or when the memory buffer is full.

After the multi-threaded Java program exits, an MPI program will merge all

those local logs into one log file according to the time stamps. We rely on

the Network Time Protocol (NTP) [63] to synchronize the computer times

on different cluster nodes. The time offset between cluster nodes can be ad-

justed to less than one millisecond. On merging the node local logs, the time

stamps will be further tuned by calculating the current time offset.

6.2 Pattern Analysis Engine

There can be many independent modules sequentially reading in the same

log in the analysis engine. Each module is responsible for detecting one or

several related access patterns. The access pattern analysis results from all

the modules are fed into the pattern visualization component, which will be

discussed in the next section. It is extensible in the sense that we can plug

in new modules to detect any precisely defined access patterns. Currently

there are two analysis modules in place: the lifetime pattern analyzer and

the global phase pattern analyzer.

58

The lifetime pattern analyzer detects object access pattern that is fixed

in the whole lifetime for each DSO. It will check whether an object presents

read-only, single-writer, or multiple-writers pattern in its whole lifetime.

The global phase pattern analyzer works for those applications adopting

a phase parallel paradigm (section 12.1.1 of [54]), as shown in figure 6.3. In

this paradigm, every thread does some computation before arriving at a bar-

rier. After all the threads arrive at the barrier, they can continue to the next

computation phase. Two consecutive barriers define a global synchronization

phase agreed by all threads. This is a very common paradigm in paral-

lel programming. The global phase pattern analyzer will check whether an

object presents read-only, single-writer, or multiple-writers pattern in each

global synchronization phase. The barrier, as a synchronized Java method,

will be logged as a special operation at runtime. If the application does not

present the phase parallel paradigm, i.e. no barrier operations are found in

the log, the global phase pattern analyzer simply ignores the log. Detecting

read-only, single-writer, and multiple-writers patterns in the log is straight-

forwardly done by counting the number of writers among all the accesses on

the object during each phase.

6.3 Pattern Visualization Component

There are three windows in the presentation, a time lines window displaying

the low-level access operations, a pattern result window revealing the ob-

ject access patterns, and a source code window displaying the application’s

source code. The time lines window also reflects the overall access operations

incurred in the execution.

The time lines window, as shown in figure 6.4, provides a complete exe-

cution picture on 8 cluster nodes for an application called SOR. The x-axis

represents the time. In the y-axis direction, there are 8 time lines in the fig-

ure, representing 8 threads, one thread on each node in this experiment. The

59

Thread1

Comp.

Thread2

Comp.

Thread3

Comp.

Barrier Synchronization

…

Comp. Comp. Comp.

Barrier Synchronization

…

… … … …

Figure 6.3: Phase parallel paradigm

rectangles on the time lines show some states, e.g., barrier synchronization

in this case. The arrows show the object access operations. Those in green

are writes and those in white are reads. Furthermore, the arrows started in

one thread’s time line and ended in another thread’s time line represent the

remote reads (object faulting-ins) or the remote writes (diff propagations).

They are issued by the threads represented by the arrows’ starting time lines.

The corresponding home nodes are the nodes represented by the arrows’ end-

ing time lines. The arrows overlapping with the time lines are the home reads

or home writes. We can click any arrow to see the detail information about

that object access, e.g. the class name, size, and ID of that object. The time

lines can be zoomed out to get an overall picture of the accesses behavior,

or zoomed in to examine some particular object accesses. We implement the

time lines window by modifying Jumpshot in MPE [34].

Moreover, clicking the “Pattern Analysis” button in the time lines window

will trigger the pop-up of the pattern result window, as shown in figure 6.5.

As SOR is a barrier synchronized application, the global phase pattern ana-

lyzer can provide the pattern analysis result for each object. The objects are

60

Read

Write

Figure 6.4: The time lines window

firstly sorted by their allocation sites where they are created in the source

code. Each allocation site may create many objects at runtime. For each

object, its access pattern at each phase is displayed. As observed from the

analysis result, most objects in SOR present the single-writer access pattern.

For example, in figure 6.5, the object being observed presents the read-only

and the single-writer pattern in alternate phases.

The pattern result window is in the center of the visualization. Inside

this window, we can choose any object to highlight its accesses in the time

lines window. Thus we provide a convenient association between the high

level access pattern knowledge and the low-level access operation details.

Since the objects are sorted by their allocation sites in the pattern analysis

result window, we can map any object to its actual allocation site in the

application’s source code by clicking it, as shown in figure 6.5. Note that

the highlighted line in the source code window is the actual position for the

highlighted allocation site in the pattern analysis result window. Thus we

61

Figure 6.5: The window of object access pattern analysis result (the biggerone) and the window of the application’s source code (the smaller one)

provide a convenient association between the object access pattern and the

object’s corresponding allocation site in the source code.

In such a design, our visualization tool not only helps us, the GOS de-

signer, to visually evaluate the effectiveness of the adaptive protocol being

applied, but also helps the multi-threaded Java application programmer to

better understand the access behavior inherent in the program.

62

Chapter 7

Implementation

In this chapter, we discuss several implementation details in the cluster-based

JVM.

7.1 JIT Compiler Enabled Native Instrumen-

tation

In DSM, shared data units have different access states, such as invalid, read

(readonly), and write (writable). A faulting access will trigger some operation

according to the cache coherence protocol. For example, an access on an

invalid data unit will cause the data to be faulted in, and a write on a

readonly data unit will cause its twin to be created under the multiple-writer

protocol. Therefore, it is the responsibility of the DSM system to trap all

the faulting accesses. Unlike page-based DSMs which rely on the MMU

hardware to trap the faulting accesses, object-based DSMs need to insert

software checks before the memory accesses in order to trap the possible

faulting ones. So does our GOS.

The GOS provides transparent object accesses for Java threads distributed

among different nodes in a cluster-based JVM. The GOS needs to insert soft-

63

ware checks before all the bytecodes accessing heap in the Java programs,

which include:

• GETFIELD/PUTFIELD: load/store object fields.

• GETSTATIC/PUTSTATIC: load/store static fields.

• XALOAD/XASTORE1: load/store array elements.

In a JVM, the bytecode execution engine is the processor for Java byte-

code, which could be an interpreter or a just-in-time (JIT) compiler. An

interpreter will emulate the behavior of the bytecode one by one. While a

JIT compiler will translate a Java method in bytecode to the native code

on the first time it is invoked. Usually, a JIT compiler improves the JVM

performance by one order of magnitude compared with an interpreter. Since

the cluster-based JVM is targeted at high performance scientific and engi-

neering computing, the JIT compiler is our choice for the execution engine

of the cluster-based JVM.

Under the JIT compiler mode, a heap access operation takes only one

native machine instruction. We should be very careful to design the check

code to make it as lightweight as possible.

A straightforward solution could be inserting a function call before each

heap access, as shown in figure 7.1. The object state is checked and the

necessary protocol operation is done in the function. Although simple, this

approach is very heavyweight because a function call will cause a lot of over-

head, such as saving registers before the call, preparing for a new stack frame,

and restoring registers after the call. We had better avoid the function call

as much as possible.

A more efficient way is to make a comparison to check the access state,

as illustrated in figure 7.2. If the object has the proper access state, the

function call can be avoided.

1X represents a type indicator, e.g., A (reference), B (byte), C (char), etc.

64

call gos_check(object1);

access object1;

Figure 7.1: Pseudo code for access check: using a function call

if (object1 does not have the proper state)


access object1;

Figure 7.2: Pseudo code for access check: by comparison

Since we classify all the objects to either DSOs or NLOs, and only DSOs

have access states, we can easily come up with a straightforward algorithm

for a read operation, as shown in figure 7.3. In this way, two comparisons

are required for each read operation.

In order to reduce the comparisons, we let NLOs also have access state,

i.e., write (writable). Thus only one comparison is necessary to check the

access state of an object. We have patched the JIT compiler engine to per-

form native instrumentation to insert the access state check before each heap

access. Figure 7.4 shows the Intel assembly code for a read access after the

native instrumentation by the JIT compiler in our distributed JVM. Register

esi is used for the object reference, register ecx for the object access state,

and register eax for the object field to read. When the object is readable,

only 3 machine instructions are taken to check the access state, which include

one memory read, one comparison, and one jump instruction.

if (object1 is DSO)

if (object1 is invalid)


read object1;

Figure 7.3: Detailed pseudo code for a read check

65

0x08eac045: mov 0xc(%esi),%ecx // load access state

0x08eac04b: cmp $0x20000000,%ecx // make a comparison

0x08eac051: jge 0x8eac076 // go to access

0x08eac057: mov %ecx,0xffffffac(%ebp) // save register

0x08eac05d: mov %esi,0xffffffb0(%ebp) // save register

0x08eac063: push %esi // push argument

0x08eac065: call 0x8a3da0 <checkRead> // call gos_check

0x08eac06a: add $0x4,%esp // pop argument

0x08eac070: mov 0xffffffb0(%ebp),%esi // restore register

0x08eac076: mov 0x80(%esi),%eax // read object field

Figure 7.4: IA32 assembly code for a read check

7.2 Distributed Threading and Synchroniza-

tion

In the cluster-based JVM, threads within one Java application are automat-

ically distributed among cluster nodes to achieve parallel execution. Thus,

we need to extend the threading subsystem inside the JVM to the cluster

scope, and to visualize a single thread space across machine boundaries. Par-

ticularly, we need to solve the following technical issues:

Thread distribution The threads need to be efficiently distributed among

the nodes of the cluster-based JVM to achieve maximum parallelism.

Thread synchronization Even running on different nodes, the threads can

still interact and coordinate with each other through the methods pro-

vided in class java.lang.Thread, and any Java objects by synchro-

nization according to JMM.

JVM termination As we mentioned in the introduction, from the perspec-

tive of system architecture, a cluster-based JVM is composed of a group

of collaborating daemons, one on each cluster node. Each cluster-based

66

JVM daemon can exit if and only if the multi-threaded Java application

terminates.

In a standard JVM, all threads can be classified into either user threads

created by the application, or daemon threads created by the JVM itself.

Any Java application will create at least one user thread, i.e., main thread.

Daemon threads include, e.g., gc thread for performing garbage collection,

finalizer thread for performing the finalization work for the unreachable

objects before the collection. So the JVM is a multi-threaded system even

the running Java application is single-threaded. The whole JVM exits if

all user threads exit. The thread subsystem performs the tasks of thread

scheduling and thread synchronization. The thread subsystem also provides

non-blocking I/O interfaces.

7.2.1 Thread Distribution

In our cluster-based JVM, we classify all the nodes into two types, the master

node and the slave node. The master node is where the Java application is

started. The slave nodes accept threads distributed from the master node

to share the workload of the application. A daemon thread, called gosd, is

created on each node, which sits in a big loop to handle the cache coherence

protocol requests such as object fault-ins, diff propagations, and synchro-

nization operations.

We follow an initial placement approach to distribute user threads to

slave nodes. Upon the creation of a user thread on the master node, if there

is an underloaded slave node, the information of the thread which includes

the thread class name and the thread object will be sent to the slave node.

The gosd thread on the slave node then creates a new user thread based

on the thread information, and invokes the start() method of the thread

object to run the thread. The slave node is made the home of the thread

object to improve the access locality.

67

Our cluster-based JVM does not support dynamic thread distribution

mechanisms such as thread migration, by which a thread can be migrated

from one node to another during its execution.

7.2.2 Thread Synchronization

After the threads are distributed among the cluster nodes in a cluster-based

JVM, they should be able to coordinate with each other during the execu-

tion. This can be achieved through synchronization operations on any Java

object, such as lock, unlock, wait, and notify. Since each object has a home,

all the synchronization operations are executed in the object’s home node.

The object’s home node also acts as the object’s lock manager. If the ob-

ject’s home is not local, synchronization requests are sent to the home node,

which are handled by the gosd thread there. Some synchronization requests

are blocking, such as lock and wait. A lock request will suspend till the

corresponding lock is acquired. Since the gosd thread can not be blocked

anytime, upon receiving a synchronization request, it arranges another kind

of daemon thread, the monitorproxy thread, to actually process and reply

it.

The monitorproxy thread performs the synchronization operation indi-

cated in the request on behalf of the requesting remote thread. Since the

synchronization is stateful, the gosd thread will always arrange the same

monitorproxy thread for the requests from the same remote thread. For ex-

ample, after a monitorproxy thread MP have acquired a lock on an object

as requested by a remote thread T , it sends a notice to T . Then T continues

its execution as if it has acquired the lock. When T requests to release this

lock, it should be MP instead of any other monitorproxy threads to pro-

cess this request. Only after MP does not hold any lock state any more, it

can process the synchronization requests from other remote thread than T .

In the startup phase of the cluster-based JVM, a number of monitorproxy

threads are created. When new synchronization request comes, the gosd

68

thread tries to pick up a current available monitorproxy thread. If failed, a

new monitorproxy thread will be created to process it.

Java threads can also coordinate with each other through the methods

of class java.lang.Thread. For example, invocation of join will block till

the callee thread finishes. Like other methods of java.lang.Thread, join

is built on the synchronization operations on the thread object. Thus it is

also implemented through the mechanism discussed above.

7.2.3 JVM Termination

A cluster-based JVM is composed of a group of collaborating JVM daemons,

one on each cluster node. Each cluster-based JVM daemon can exit if and

only if the multi-threaded Java application terminates. If a JVM daemon

exits earlier, the home-based cache coherence protocol will be violated for

those DSOs whose homes are there. If a JVM daemon exits later, it becomes

unattended and wastes system resources.

A termination protocol is designed to coordinate all cluster-based JVM

daemons to exit when the Java application terminates.

1. When a slave node is started up, a main thread is also started there,

which will wait on an internal lock, called slavemain.

2. On the master node, a counter is increased by one whenever a user

thread is created in the Java application, and it is decreased by one

whenever a user thread exits.

3. A user thread could be distributed to a slave node. When it exits

there, a notice will be sent back to the master node. The master node

decreases the user thread counter accordingly. Thus the counter reflects

the number of currently live user threads.

4. When the counter reaches zero, which means all user threads have

exited, the master node can safely exit. Before exiting, the master

69

node sends notices to all slave nodes, informing them to exit.

5. Receiving the notice to exit, the slave node wakes up the main thread

waiting on the slavemain lock. The main thread then exits. Since all

user threads exit now, the slave node will also exit. Until now, the

cluster-based JVM terminates and all its JVM daemons exit.

7.3 Non-Blocking I/O Support

There are multiple threads on each node of the distributed JVM. Non-

blocking I/O support is a must in the distributed JVM so that a thread

doing I/O will not block the whole node.

We use the remote unlock operation on a DSO as an example to illustrate

non-blocking I/O support in the GOS, as shown in figure 7.5. After the

requesting thread sends out the unlock request, it will be switched off to

give the CPU to other runnable threads. The multi-threading nature of the

cluster-based JVM asks for a non-blocking I/O processing, otherwise a thread

in I/O will block the whole JVM. Therefore, the receiving thread should

never busy wait for an incoming message. Instead, it should give up the

CPU. Later, the signal SIGIO will be catched to switch on the corresponding

I/O-waiting thread. The significant signal processing overhead has to be

introduced. On the requested node, the currently running thread may be

some other thread than the GOS daemon thread that takes care of all the

GOS request messages. Thus, a thread switch is necessary to switch on

the GOS daemon thread. The GOS daemon thread will schedule a proper

monitor proxy thread to process this unlock request. The proper monitor

proxy thread is the one that currently holds the lock. Here another thread

switch is incurred. A similar situation happens on the requesting node when

it receives the unlock reply message. We can see along the critical path

of a remote unlock, an unlock overhead, 2 signal processings and 3 thread

switches are incurred.

70

SomeThread

Requested Node Requesting Node

Requesting Thread

SignalProcessing

ThreadSwitch

GOS DaemonThread

ThreadSwitch

Monitor Proxy

Thread

Thread Switch

Signal Processing

Requesting Thread

Thread Switch

Some Thread

Unlock Request

Unlock Reply

Figure 7.5: Remote unlock of a DSO

7.4 Distributed Class Loading

A Java class file defines a single class’ static/instance fields, static/instance

methods, and the constant pool that serves a function similar to that of

a symbol table in conventional programming languages. At runtime, the

JVM dynamically loads, links, and initializes classes when they appear in

the application, as illustrated in figure 7.6.

Loading is the process of finding the Java class file and reading it into the

memory. Linking is the process of combining it into the runtime state of the

Java virtual machine. During the linking phase, the bytecode will be verified,

71

Loading

Java Class File

Linking

Initialization

Class in Method Area

Figure 7.6: JVM’s dynamical loading, linking, and initialization of classes

the static fields will be allocated and initialized to the default values, and

all the symbolic references in the constant pool are resolved. Initialization is

the process of executing the class initialization code. The Java class finally

stays in the method area of JVM.

Our cluster-based JVM provides the dynamical class loading capability

defined in the JVM specification. Since Java classes contain readonly defi-

nitions of fields and methods, they are allowed to be loaded independently

be each node. However, two particular issues needed to be addressed to

maintain the single system image:

• Although each cluster-based JVM daemon can load Java classes inde-

pendently, it must be guaranteed that they load Java classes from the

same source. In other words, they must load the same Java classes.

• Although each cluster-based JVM daemon can load Java classes in-

72

dependently, a Java class can only be initialized for once. And the

static variables should be made consistent according to JMM during

the execution.

To address the first issue, we have configured a Network File System

(NFS) [74] for the cluster-based JVM so that each JVM daemon sees the

same file system hierarchy where the Java class files are stored. Since the

NFS is a very popular file system used in the cluster environment, such

configuration does not impair the portability of our cluster-based JVM.

To address the second issue, we let the master node maintain a central-

ized table recording all the cluster-wide initialized classes. For each initialized

class, the table also records where it is initialized and a lock to prevent the

race condition on the initialization. In the GOS, a class is also considered as

an object, which contains static fields and has a home. The node initializing

the class becomes its home node. Whenever a JVM daemon loads a class,

it will check the table to see whether it has already been initialized or not.

If yes, the JVM daemon will skip the initialization, and fetch the current

content of the static fields from the home node of the class. If not, the class

is initialized locally. Before the initialization, the corresponding lock in the

table must be acquired. Having acquired the lock, the JVM daemon will dou-

ble check whether the object have been initialized. The lock is released after

the initialization is done. Since the static fields are allowed to be replicated

on different nodes, they are also handled by the cache coherence protocol to

maintain the consistency according to the JMM.

7.5 Garbage Collection

In this section, we will discuss distributed garbage collection in GOS. There

are two GC algorithms in place in GOS, one for local garbage collection

that will be discussed in section 7.5.1, and the other for distributed garbage

collection of DSOs that will be discussed in section 7.5.2.

73

Node 0 (Home)

Root set

a

Node 1

Root set

a

b

Figure 7.7: Tolerating inconsistency in DGC

7.5.1 Local Garbage Collection

An adapted uniprocessor garbage collector, e.g., a mark-sweep collector [79]

in our case, can function independently on each node in our cluster-based

JVM. The challenge here is to put the right stuff into the root set to assure

the correctness of GC. The home copy of DSO should always been put into

the root set since the collector has no idea whether its non-home siblings

are still alive or not. As long as there are some non-home siblings, the home

copy should be kept due to its special role in the home-based cache coherence

protocol.

The inconsistency among the copies of a DSO introduces a new problem

in DGC, which dose not exist when there is no consistency issue involved [46].

Fig. 7.7 gives an example. Node 0 is the home of DSO a. Node 1 cached a

and modified it by installing a reference to object b in a. Now the copies of a

are inconsistent. If a becomes unreachable on node 1, and node 1 performs

a local GC, both a and b will be mistakenly collected. Therefore, when each

node performs independent local GC, all the non-home copies of DSOs that

are inconsistent with their home copies, i.e. in the write access state, should

be put into the root set.

74

0

1 2

3 4

Export list = {1, 2} Import list = {null}

Export list = {3, 4} Import list = {0} Export list = {null}

Import list = {0}

Export list = {null} Import list = {2}

Export list = {null} Import list = {2}

Figure 7.8: DSO reference diffusion tree

7.5.2 Distributed Garbage Collection

A DGC algorithm, Indirect Reference Listing [66], is adopted to collect

garbage DSOs.

Essentially, the indirect reference listing (IRL) algorithm maintains a dis-

tributed reference diffusion tree for each DSO. In GOS, a reference of DSO

can be transmitted either from the home node to a non-home node or be-

tween two non-home nodes. The former is referred as reference creation and

the latter as reference duplication. With IRL, both the home and non-home

copy of a DSO will maintain two lists, an import list recording where its ref-

erence comes from, and an export list recording where its reference is sent to.

In a DSO’s reference diffusion tree, every vertex represents a node possessing

one of its copies. The root of the tree is its home node. An edge in the

tree represents that the reference is transmitted from one node to another

node. The sending node adds the receiving node into its export list, while

the receiving node adds the sending node into its import list. If the node to

be added is already in the list, the addition has no effect. Fig. 7.8 gives an

example. The figure in the circle is the node number.

When a non-home copy DSO is figured out to meet the following two

conditions, it can be reclaimed locally and a garbage notice will be sent to its

75

parent in the diffusion tree: (1) its export list is empty; (2) it is not reachable

from the local root set, which can be determined by the local collector. If

one node receives a garbage notice of a DSO, it will remove the sending node

from the DSO’s export list. When the export list of the home copy of a DSO

is empty, the DSO will be reversed to an NLO. IRL requires that the local

collector also put those non-home copies of DSOs with non-empty export

lists into the root set.

The transmission path of a DSO reference may form some cycle among

the nodes. Then the export list on every node in the cycle is not empty and

all the copies will be put into the local root sets. It makes this DSO never be

reclaimed even it is not reachable from any local root set. In order to prevent

such cycles polluting the structure of the diffusion tree, we assure each node

can only have one valid parent in the tree. If a DSO reference arrives from

a node different from the current parent, the sender will not be added into

the import list. Instead, the receiver prepares a pseudo garbage notice for

the sender, since the sender has already added the receiver into the export

list. Having received the pseudo garbage notice, the sender can remove the

receiver from its export list.

IRL inherits the idempotency property from reference listing [67]. The

effect of multiple transmissions of a DSO reference between two nodes is as

same as that of once. This property is very helpful in GOS since DSOs will

be transmitted many times due to the cache coherence protocol. The indirect

nature of IRL avoids the race condition in reference listing when reference

deletion and duplication happen at the same time [67]. IRL can not collect

the cycle of garbage DSOs whose home nodes are different. However, this

usually is not a serious problem.

The major overheads of IRL are maintaining import and export lists

for every DSO as well as sending garbage notices. The list maintenance

coexists with the reference transmission. Compared with the transmission,

the maintenance overhead, which is simply bitmap setting, is negligible. The

76

garbage notices can be batched and piggybacked with coherence messages.

So IRL will not contribute a significant overhead to GOS.

77

Chapter 8

Performance Evaluation

8.1 Experiment Environment

We conducted the performance evaluation on the HKU Gideon cluster [14].

Each node has an Intel 2GHz P4 CPU and 512M memory, running Linux

kernel 2.4.22. A Network File System (NFS) is set up and mounted on all the

cluster nodes so that the user has a same view of his home directory on all

nodes. All the cluster nodes are connected by two Fast Ethernet networks,

one for NFS, the other for high performance communication such as MPI.

Our cluster-based JVM is implemented based on the Kaffe JVM [9] which

is an open-source JVM. A Java application is started on the master node.

When a Java thread is created, it is automatically dispatched to a cluster

node to achieve parallel execution. Unless specified otherwise, the number

of computation threads created is the same as the number of cluster nodes

in all the experiments.

In our implementation, we leverage TCP/IP Socket interface for all the

communications. We use Netperf [40] to evaluate the TCP/IP performance

of Gideon cluster. It takes 114 microseconds to send a one-byte request

message and get a one-byte response message. The network throughput is

94.05Mb/s when the message size is 4096 bytes.

78

8.2 Application Suite

In this section, the application suite used to evaluate the performance of our

cluster-based JVM will be presented. The application suite contains CPI,

ASP, SOR, NBody, NSquared, and TSP.

8.2.1 CPI

CPI is a multi-threaded Java program to calculate π. The π is computed by

π =∫ 1

0

4

1 + x2dx (8.1)

The program follows a fork-and-join parallelism style. The integral in-

tervals are equally divided among threads.

8.2.2 ASP

The All-pairs Shortest-Path (ASP) problem is to find the shortest path be-

tween all pairs of vertices in a graph. ASP is an important problem in

graph theory and has applications in communications, transportation, and

electronics problems [47].

A graph can be represented as a distance matrix D in which each element

(i, j) represents the distance between vertex i and vertex j. We assume for

any i and j, Dij exists so that 0 ≤ Dij < ∞. Also, Dij = Dji and Dii = 0.

Floyd gives a sequential algorithm for ASP. It solves a graph of N vertices

in N steps, constructing an intermediate matrix I(k) containing the best-

known shortest distance between each pair of nodes at step k. Initially,

I(0) is set to D. The kth step of the algorithm considers each Iij in turn

and determines whether the best-known path from i to j is longer than the

combined lengths of the best-known paths from i to k and from k to j. If so,

the entry Iij is updated to reflect the shorter path.

We design a parallel version of Floyd’s algorithm by making a row-wise

79

black[j][k] = (red[j-1][k] + red[j+1][k] + red[j][k-1]

+ red[j][k]) / (float)4.0;

Figure 8.1: The typical operation in SOR

domain decomposition of the distance matrix D and the intermediate matrix

I among threads. Appendix A.2 shows the run method of the Worker thread

in our ASP. The instances of the Worker thread perform the actual compu-

tation. At step k, all threads need the value of the kth row of the distance

matrix. There is a barrier at the end of each iteration. The workload is

distributed equally among the Worker threads. The rows of D are allocated

among cluster nodes in a round-robin manner initially.

8.2.3 SOR

The red-black Successive Over Relaxation (SOR) is used to solve partial

differential equations:

∂2f

∂x2+

∂2f

∂y2= 0 (8.2)

A matrix is created with the perimeter elements being initialized to be

the boundary conditions of a given mathematical problem. The interior

elements are repeatedly computed as the average of its top, bottom, left, and

right neighbors until the computed values are sufficiently close to the values

computed in the last iteration.

Two matrixes, a red one and a black one, are used in the SOR. At any

iteration the elements are read from one matrix, and the computed values are

written to the other. After finishing this iteration, the roles of two matrixes

are swapped. Figure 8.1 shows the typical operation in SOR.

We partition the red and black matrixes among threads in a row-wise way.

Each thread computes the parts of matrixes it has been assigned. Thus, the

80

(a) Space decomposition (b) Barnes-Hut tree

Figure 8.2: Barnes-Hut tree for 2D space decomposition

workload is equally partitioned among the threads. Each thread needs to

access not only its own sub-matrixes but also the neighboring rows in the

matrixes which are computed by other threads. After each iteration, all

threads are synchronized through a barrier operation. The rows of red and

black matrices are allocated among cluster nodes in a round-robin manner

initially.

8.2.4 NBody

NBody is used to simulate the motion of particles due to gravitational forces

between each other. The Barnes-Hut method [29] is a well-known hierarchical

NBody algorithm. In Barnes-Hut method a physical space is recursively

divided into sub-domains until each sub-domain contains at most one body.

The space decomposition is based on the spatial distribution of the bodies.

Figure 8.2 (a) gives an example of space decomposition in 2D space. Initially,

the space is equally divided into four sub-domains. If there are more than

one bodies in a sub-domain, the sub-domain will be further decomposed into

four smaller sub-domains. A Barnes-Hut tree is built based on the space

decomposition as figure 8.2 (b) shows.

81

In the Barnes-Hut tree, the bodies reside on the leaves. Inner cells in the

tree correspond to the sub-domains, and represent the centers of mass for

the bodies beneath it. The force computation is performed by traversing the

tree. The Barnes-Hut tree is built at the beginning of each iteration. If the

body is far enough from a cell, no further traversal will be made beneath the

cell. The force influences from the bodies below the cell can be computed

as the force influence from the cell, which is the center of mass. Otherwise,

the body should proceed to traverse the children of the cell. After the force

computation, each body updates its position in the space as the result of

force influences. That ends one simulation loop. The tree should be rebuilt

at the beginning of next iteration to reflect the new body distribution in the

space.

We parallelize the Barnes-Hut method by equally dividing the bodies

among threads. The workload of threads is not balanced as the computation

load associated with each body is different. The tree construction is not

parallelized. In the NBody application, there is a main thread responsible

for the tree construction, and a number of worker threads responsible for

the computation of the force and the resulted body movement. During each

iteration, after the main thread has built the tree, it wakens all the waiting

worker threads. A barrier operation synchronizes the worker threads after

they finish their computation. Then the main thread is notified to begin

the tree construction of the next iteration. A lot of Java objects, which

describe bodies’ position, velocity, and forces, have been created during the

tree construction.

8.2.5 NSquared

NSquared solves NBody problem with an O(n2) complexity, just like Water-

NSquared in Splash-2 benchmark suite [80].

All n bodies are stored in an array. The workload are evenly partitioned

among threads by assigning an identical number of bodies to each thread.

82

A thread is responsible to calculate the force on each of its assigned bodies,

and update the bodies’ positions accordingly. To calculate the force on one

body, we need to combine the interactions between this body and the other

n− 1 bodies respectively.

8.2.6 TSP

The Traveling Salesman Problem (TSP) is to find the cheapest way of visiting

all the cities and returning to the starting point. Our TSP finds the optimal

solution instead of an approximate one by searching the entire solution space.

Our TSP follows a branch-and-bound approach. It prunes large parts of

the solution space by ignoring partial routes that are already longer than the

current best solution.

The program divides the whole solution space into many small ones to

build up a job queue in the beginning. A sub-space contains all the routes

of a same prefix. A number of worker threads are created initially. Every

thread repeatedly requests a sub-space from the job queue to search for the

optimal solution until the queue is empty. The workload of threads is not

balanced. A lot of objects have been created during the searching.

8.3 Application Performance

In our experiments, unless stated otherwise, CPI makes the integration on

100,000,000 sub-intervals, ASP solves a graph of 1024 vertices, SOR per-

forms the successive over-relaxation on a 2-D matrix of 2048 by 2048 for

30 iterations, NBody simulates the motion of 2048 particles over 30 steps,

NBody simulates the motion of 2048 particles over 10 steps, and TSP solves

a problem of 12 cities.

83

0%

20%

40%

60%

80%

100%

120%

140%

160%

ASP SOR NSquared NBody TSP CPI

No

rmal

ized

exe

cuti

on

tim

e

Kaffe

GOS/1

Figure 8.3: Single node performance

8.3.1 Sequential Performance

Our cluster-based JVM is based on Kaffe JVM. When running our cluster-

based JVM on only one processor, we can measure its sequential performance.

The major GOS overhead incurred in the sequential performance is caused

by software checks inserted before object accesses, which are used to en-

sure the corresponding objects are in the right access states, as discussed in

section 7.1. By comparing the sequential performance of our cluster-based

JVM with the performance of original Kaffe, we can measure the overhead

of software checks.

Figure 8.3 compares the performances of the cluster-based JVM and Kaffe

using our application suite. In the figure, GOS/1 denotes the cluster-based

JVM on one processor. Both the Kaffe JVM and our cluster-based JVM run

in the just-in-time mode.

Among all applications, ASP, SOR, and NSquared incur a heavy check

overhead due to their intensive array object accesses. NBody and TSP’s

84

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

Number of processors

Sp

eed

up

CPI

TSP

NBody

NSquared

SOR

ASP

Figure 8.4: Speedup

check overheads are well contained, less than 10%. In CPI, most time is

spent on calculation, and the object accesses are very few.

8.3.2 Parallel Performance

We measure the speedup for all applications on up to 16 processors as an

overall performance evaluation for our cluster-based JVM. Figure 8.4 shows

the speedup curves. In the experiments, n threads will be created when

running on n processors. The sequential time on 1 processor is measured

on the original Kaffe JVM where only one thread is created. All the cache

coherence protocol optimizations are enabled. Both the Kaffe JVM and our

cluster-based JVM run in the just-in-time mode.

The applications’ parallel performances are determined by their computation-

to-communication ratios. Among all the applications, TSP and CPI are com-

putationally intensive programs. Therefore, they are able to achieve speedups

85

of more than 13 on 16 processors. NBody and NSquared also achieves accept-

able speedups on 16 processors. SOR and ASP’s performances are embar-

rassing. They achieve speedups less than 3.5 on 8 processors. Their speedup

curves drop on 16 processors.

In order to further investigate factors contributing to the applications’

performance, we try to break down the execution time to various parts,

including Comp that denotes the computation time; Obj, the object access

time to fault in up-to-date copies of invalid objects; Syn, the time spent

on synchronization operations, such as lock, unlock, wait, notify, and mi-

grated synchronized method; and GC, the garbage collection overhead. We

instrument internal functions of our cluster-based JVM to measure the ac-

cumulated overheads of Obj, Syn, and GC. The Comp time is computed by

subtracting all the other parts from the total time.

All the breakdown data are normalized to the total execution time, as

displayed in figure 8.5. How we obtain the breakdown data is discussed in

appendix A.3 in detail. In spite of a certain impreciseness, figure 8.5 helps

us gain an insight into the executions.

Notice that not every application requires GC. Obj and Syn portions are

the GOS overhead to maintain a global view of a virtual object heap shared

by physically distributed threads. Obj and Syn portions not only include

the necessary local management cost and the time spent on the wire for

moving the protocol-related data, but also the possible waiting time on the

requested node. The percentage of Comp roughly reflects the efficiency of

parallel executions.

ASP requires n iterations to solve an n-node graph problem. There is a

barrier at the end of each iteration, which requires participation of all threads.

When ASP is running on more processors, the computation workload of

each thread decreases. On the contrary, the Syn part increases when more

processors join. The Obj part also increases with the number of processors.

On the ith iteration, all threads need to access the ith row of the distance

86

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

2 4 8 16 2 4 8 16 2 4 8 16

ASP SOR NBody


No

rmal

ized

exe

cuti

on

tim

e

Comp Syn Obj GC

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

2 4 8 16 2 4 8 16 2 4 8 16

TSP NSquared CPI


No

rmal

ized

exe

cuti

on

tim

e

Comp Syn Obj GC

Figure 8.5: Breakdown of normalized execution time against number of pro-cessors

87

matrix. When the number of processor increases, the home node of the ith

row needs to serve more requests. Thus the waiting time of each request

increases correspondingly. When scaled up to a large number of processors,

ASP’s performance is hindered by the intensive data communication and

synchronization overheads.

The situation of SOR is similar to that of ASP. In SOR, there are two

barriers in each iteration. The Syn part contributes a significant portion to

the execution time when scaled to a large number of processors. The absolute

time of Obj roughly stays constant because each thread only needs to access

the neighbor rows with respect to the rows it manages in the matrixes. The

data to be accessed do not increase with the number of processors. However,

the percentage of Obj in the total time increases because each thread’s com-

putation load is reduced when SOR is running on more processors. Similar

to ASP, SOR’s performance is hindered by the intensive data communica-

tion and synchronization overheads when scaled up to a large number of

processors.

NBody also involves synchronization in each simulation step. The syn-

chronization overhead becomes a significant part of the overall execution

time when we increase the number of processors. The absolute time of Obj

decreases when the number of processors increases, but slower than that

the absolute time of Comp decreases. So we observe that the percentage of

Obj increases with respect to the total time. NBody is a memory intensive

application and therefore triggers garbage collection. With our distributed

garbage collection mechanism in place, the GC overhead is highly parallelized.

The absolute time of GC is inversely proportional to the number of processors.

The breakdown of NSquared is similar to that of NBody.

TSP is a computationally intensive application, and the GOS overhead

accounts for less than 1% of the total execution time. TSP is also a mem-

ory intensive application. The absolute times of GC and Obj are inversely

proportional to the number of processors. Nevertheless, their percentages

88

Parameters Messages Traffic (KB)CPI 100,000,000 255 12

sub-intervalsASP A graph of 169,130 347,425

1024 verticesSOR A 2048 by 2048 matrix 35,999 93,286

for 30 iterationsNBody 2048 particles 752,878 321,505

over 30 stepsNSquared 2048 particles 698,192 74,230

over 10 stepsTSP 12 cities 4,849 411

Table 8.1: Communication effort on 16 processors

in the total time stay constant on various numbers of processors. CPI is a

computation-intensive application. Most of its time is Comp.

8.4 Effects of Adaptations

In this section, we evaluate the effectiveness of the adaptations discussed in

chapter 5. They are adaptive object home migration, synchronized method

shipping, and connectivity-based object pushing.

All applications except TSP and CPI incur a lot of communication during

the parallel executions. Table 8.1 shows their communication effort when

running on 16 processors. The measurements are made after all the cache

coherence protocol optimizations are enabled.

Figure 8.6 shows the overall performance improvement due to the adap-

tations for four benchmark applications respectively. We do not show the

figures for CPI and TSP becuase they are computationally intensive applica-

tions and incur little communication. The adaptations do not have obvious

effects on them. In the figures, Basic represents the basic cache coherence

protocol with three adaptations disabled, Adaptive represents the adaptive

89

0

100

200

300

400

500

600

700

0 2 4 6 8 10 12 14 16


Exe

cuti

on

tim

e (s

eco

nd

s)

Basic

Adaptive

0

20

40

60

80

100

120

140

0 2 4 6 8 10 12 14 16


Exe

cuti

on

tim

e (s

eco

nd

s)

Basic

Adaptive

(a) ASP (b) SOR

0

10

20

30

40

50

60

70

80

90

100

0 2 4 6 8 10 12 14 16


Exe

cuti

on

tim

e (s

eco

nd

s)

Basic

Adaptive

0

10

20

30

40

50

60

70

80

90

100

0 2 4 6 8 10 12 14 16


Exe

cuti

on

tim

e (s

eco

nd

s)

Basic

Adaptive

(c) NBody (d) NSquared

Figure 8.6: The adaptive protocol vs. the basic protocol

cache coherence protocol with all adaptations enabled. We display the appli-

cations’ execution times against the number of processors. The cluster-based

JVM runs in the JIT compilation mode.

We can observe from the figures that the adaptive cache coherence proto-

col greatly improves the performance of ASP and SOR. For example, 76% to

89.7% of ASP’s execution time can be eliminated when the adaptive protocol

is enabled. The adaptive protocol also improves the performance of NBody

and NSquared considerably. For example, as seen from figure 8.6 (c), 23.8%

of NBody’s execution time can be eliminated on 16 nodes when the adaptive

90

protocol is enabled.

In order to further investigate the effectiveness of various adaptations, we

try to breakdown the effects of adaptations. In the experiments, all adap-

tations are disabled initially; and then we would enable the planned adap-

tations incrementally. Figure 8.7 shows the effects of adaptations on the

execution time. Figure 8.8 shows the effects of adaptations on the message

number generated during the execution. Figure 8.9 shows the effects of adap-

tations on the network traffic generated during the execution. All data are

normalized to those when none of the adaptations are enabled. We present

the normalized data against different numbers of processors. In the legend,

No denotes no adaptive protocol enabled, HM denotes adaptive object home

migration, SMM denotes synchronized method migration, and Push denotes

connectivity-based object pushing.

We will elaborate the effectiveness of each adaptation respectively in the

following sub-sections.

8.4.1 Adaptive Object Home Migration

Among four applications, adaptive object home migration improves the per-

formance of ASP and SOR a lot, as seen in figure 8.7 (a) and (b). In ASP and

SOR, the data are in the 2-D matrices that are shared by all threads. In Java,

a 2-D matrix is implemented as an array object whose elements are also array

objects. Many of these array objects exhibit the single-writer access pattern

after they are initialized. The shared data are allocated to different cluster

nodes in a round robin manner initially. However, their original homes are

not the writing nodes. The home migration protocol automatically makes

the writing node the home node to eliminate remote accesses. As seen in

figure 8.8 (a) and figure 8.9 (a) for ASP, and figure 8.8 (b) and figure 8.9 (b)

for SOR, home migration greatly reduces the messages and network traffic

generated during the executions for ASP and SOR, which explains the reason

of performance improvement.

91

0%

20%

40%

60%

80%

100%

2 4 8 16


No

rmal

ize

exec

uti

on

tim

e

No HM HM+SMM HM+SMM+Push

0%

20%

40%

60%

80%

100%

2 4 8 16


No

rmal

ize

exec

uti

on

tim

e


(a) ASP (b) SOR

0%

20%

40%

60%

80%

100%

2 4 8 16


No

rmal

ize

exec

uti

on

tim

e


0%

20%

40%

60%

80%

100%

2 4 8 16


No

rmal

ize

exec

uti

on

tim

e



Figure 8.7: Effects of adaptations w.r.t. execution time

As a further demonstration, figure 8.10 visualizes the effect of object home

migration on SOR by using PAT that is discussed in chapter 6. Figure 8.10

(a) is the time line window without home migration. There are four global

phases, each taking approximately the same amount of time. Figure 8.10 (b)

is the time line window with home migration enabled. Three global phases

are marked in the figure: “Before Home Migration”, “Home Migrating”, and

“After Home Migration”. Before home migration takes effect, we observe

that a lot of remote reads and writes are sent to their home node, node 01.

1The shared objects are intentionally allocated on node 0 to simplify the visualization

92

0%

20%

40%

60%

80%

100%

2 4 8 16


No

rmal

ize

mes

sag

e n

um

ber


0%

20%

40%

60%

80%

100%

2 4 8 16


No

rmal

ize

mes

sag

e n

um

ber


(a) ASP (b) SOR

0%

20%

40%

60%

80%

100%

2 4 8 16


No

rmal

ize

mes

sag

e n

um

ber


0%

20%

40%

60%

80%

100%

2 4 8 16


No

rmal

ize

mes

sag

e n

um

ber



Figure 8.8: Effects of adaptations w.r.t. message number

During the home migrating phase, we observe that although the reads (white

arrows) are still sent to the original home node, the writes (gray arrows) are

performed locally. It means the home has already been migrated to the

local node at that moment. We can also observe that the phase after home

migration takes much less time than the phase before home migration since

most remote reads and writes are eliminated by object home migration. As

can be observed, the effect of home migration is to change remote read/write

to home read/write.

view.

93

0%

20%

40%

60%

80%

100%

2 4 8 16


No

rmal

ize

net

wo

rk t

raff

ic


0%

20%

40%

60%

80%

100%

2 4 8 16


No

rmal

ize

net

wo

rk t

raff

ic


(a) ASP (b) SOR

0%

20%

40%

60%

80%

100%

2 4 8 16


No

rmal

ize

net

wo

rk t

raff

ic


0%

20%

40%

60%

80%

100%

2 4 8 16


No

rmal

ize

net

wo

rk t

raff

ic



Figure 8.9: Effects of adaptations w.r.t. network traffic

Home migration also improves the performance of NSquared. In NSquard,

the data of particles are stored in an array. The particles are evenly dis-

tributed among threads. Each thread will only update its assigned particles.

Thus the particle objects present single-writer pattern, and the communi-

cation is reduced by migrating the homes of the particle objects to their

updating threads respectively.

Home migration has little impact on the performance of Nbody because

NBody lacks the single-writer pattern, as seen in figure 8.7 (c). This also

indicates that our home migration protocol has little negative side effect

94

(a) Without home migration

(b) With home migration

Before Home Migration

Home migrating After Home migration

Figure 8.10: The effect of object home migration on SOR

because of its lightweight design.

8.4.2 Synchronized Method Migration

Synchronized method migration optimizes the execution of a synchronized

method of a non-home DSO. Although it does not reduce the network traffic,

it reduces the number of messages and the protocol overheads, as we discussed

in section 5.2.

ASP requires n barriers for all the threads in order to solve an n-node

graph. SOR requires two barriers in each iteration. NSquared requires one

barrier in each simulation step. The barrier operation is implemented as a

95

synchronized method. We see in figure 8.8 (a) and (b) that synchronized

method migration reduces the messages generated during the executions of

ASP and SOR. For example, on 16 processors, 35% of ASP’s messages are

eliminated by enabling synchronized method migration. Consequently, ASP

and SOR’s overall performances are improved to some extent, particularly

when running on a large number of processors, as observed in figure 8.7 (a)

and (b).

However, there is an exception: running on 4 processors, ASP’s execution

time increases by 8.2% after enabling synchronized method migration. The

detail analysis for this phenomenon is in section 8.6.

Synchronized method migration has a very limited effect on NSquared,

because the synchronization-related messages are only a very small percent-

age of the total messages. Most messages are object fault-ins. Synchronized

method migration has on effect on NBody because NBody uses synchronized

block instead of synchronized method.

8.4.3 Connectivity-based Object Pushing

Connectivity-based object pushing is a prefetching strategy which takes ad-

vantage of the object connectivity information to improve the reference lo-

cality. Particularly, it improves the producer-consumer pattern greatly.

NBody is a typical application of the producer-consumer pattern. In

NBody, a quadtree is constructed by one thread and then accessed by all

other threads in each iteration. The quadtree consists of a lot of small-

sized objects. We can see that object pushing greatly reduces the number of

messages for NBody, as seen in figure 8.8 (c). Since object pushing may push

unneeded objects as well, the amount of communication increases slightly,

as seen in figure 8.9 (c). The improvement on execution time due to object

pushing is also significant in NBody, as seen in figure 8.7 (c). When NBody

runs on a large number of processors, the percentage of object fault-in time

in the total execution time increases, as shown in figure 8.5. Thus the effect

96

of object pushing is amplified correspondingly.

Object pushing improves the reference locality in NSquared, too. In

NSquared, the particle object contains multiple sub-objects, describing its

coordination, its velocity, and the integrated forces on it. Object pushing

aggregates multiple objects in one message.

Compared with NBody and NSquared, most DSOs in ASP and SOR are

array objects of reference type and primitive type, and object pushing is not

performed on them to reduce the impact of pushing unneeded objects.

8.5 Sensitivity and Robustness Analysis for

HM Protocol

In order to clearly examine the performance difference between home mi-

gration protocols with different fixed thresholds and that with the adaptive

threshold, we carefully design some synthetic benchmark programs that pre-

dominantly present the single-writer pattern. Thus we can rule out any other

factors that influence the performance, and concentrate on the effectiveness

of our home migration protocol on the single-writer pattern.

Shown in our object access pattern space, there are two major basic

patterns along the synchronization dimension: accumulator and assignment.

To avoid data race condition, the object accesses presenting the single-writer

pattern can be coordinated by using either the accumulator synchronization

or the assignment synchronization. With the accumulator synchronization,

the objects of single-writer pattern are accessed inside the critical section.

With the assignment synchronization, the objects of single-writer pattern are

accessed outside the critical section. Proper synchronization guarantees that

the reads happen after the writes so that the values can be safely transfered

from one thread to another.

We have elaborately designed two benchmark programs: RCounter, i.e.,

repeating counter operations, representing the single-writer pattern under

97

while (true) {

synchronized (lock0) {

if (counter.internal >= n) {

break;

}

counter.internal++;

for (int j=0; j<r-1; j++) {

synchronized (lock1) {

counter.internal++;

}

}

}

// Some simple arithmetic

// computation goes here.

}

Figure 8.11: RCounter’s Source code skeleton run by each thread

the accumulator synchronization; DSOR, i.e., dynamical SOR, representing

the single-writer pattern under the assignment synchronization. In addi-

tion, RCounter demonstrates the behavior of the home migration protocol

on small-sized objects, and DSOR demonstrates that on relatively large-sized

objects. RCounter and DSOR represent the most important two situations

of the single-writer pattern in real applications. They are very suitable for

evaluating our adaptive home migration protocol. Below we present and

analyze their experiment results respectively.

In the experiments, we start with eight working threads all running on

the slave nodes. All synchronization operations are distributed ones that are

sent to the master node. So all the performance differences come from the

effects of different home migration protocols.

98

RCounter

Figure 8.11 shows Rcounter’s source code skeleton run by each thread. In the

benchmark, after a thread acquires the lock of object lock0, it will update a

shared counter for a number of times, which we refer to as the repetition of the

single-writer pattern. It is represented by r in the code. The home migration

protocols try to change the home of this shared counter object to improve the

performance. In order to reflect these updates to the home copy, each update

is enclosed in a synchronized block. Notice after this thread releases lock0, it

may acquire it again, or another thread may get the chance to acquire it. For

example, if the repetition of single-writer pattern is 4, the actual consecutive

writing times could be a multiple of 4, such as 8 and 16. This happens

randomly at runtime. We also embed some computation in the benchmark

to make it more realistic. We measure the performance of different home

migration protocols against different repetitions of the single-writer pattern.

Figure 8.12 shows the normalized execution time against different repe-

titions of the single-writer pattern. NM denotes no home migration. FT1

denotes home migration with a fixed threshold of 1. FT2 denotes home mi-

gration with a fixed threshold of 2. FT1 always performs home migration

more eagerly than FT2. AT denotes the home migration protocol with an

adaptive threshold. For each repetition, the execution times are normalized

to the largest one among them.

Figure 8.13 shows the normalized message number against different repe-

titions of the single-writer pattern. For each repetition, the message numbers

are normalized to the largest one among them. We further break down the

messages into four categories: obj denotes normal object fault-in without

home migration happening at the same time, mig denotes object fault-in

with home migration, diff denotes diff propagation, and redir denotes ob-

ject home redirection. We do not consider synchronization messages because

they are invariable in all cases as mentioned before.

In the message breakdown, the communication overhead without home

99

60%

80%

100%

2 4 8 16Repetition of single-writer pattern

No

rmal

ized

exe

cuti

on

tim

e

NM FT1 FT2 AT

Figure 8.12: Effects of home migration protocols against repetition of single-writer pattern: normalized execution time (RCounter)

0%

50%

100%

NM

FT

1F

T2

AT

NM

FT

1F

T2

AT

NM

FT

1F

T2

AT

NM

FT

1F

T2

AT

2 4 8 16

Repetition of single-writer pattern

No

rmal

ized

mes

sag

e n

um

ber redir

diff

mig

obj

Figure 8.13: Effects of home migration protocols against repetition of single-writer pattern: normalized message number (RCounter)

100

migration includes obj and diff. They are the overheads that the home

migration protocol tries to reduce. Under situations with home migration,

the total number of object fault-in equals to obj plus mig, and redir is the

negative impact of home migration.

We have several observations from figure 8.12 and 8.13. Firstly, when the

repetition of the single-writer pattern is large enough, e.g., 16, the benefit

from home migration is quite obvious. As seen, 87.2% of object fault-ins and

diff propagations are eliminated by FT1. In other words, remote read/write

changes to home read/write. We can expect even better performance im-

provement due to home migration when the repetition is larger.

Secondly, when the repetition of the single-writer pattern is not large

enough, the benefit from home migration may not pay off when compared to

the home redirection overhead. Particularly, when the object’s home and the

lock’s home are at the same node, as in the situation without home migra-

tion, the diff propagation can be piggybacked on synchronization messages.

This explains why home migration protocols incur much less messages but

still perform roughly the same as that without home migration when the

repetition of the single-writer pattern is 8.

Thirdly, in all cases, FT1 is more sensitive than FT2 towards the single-

writer pattern in that the message numbers of object fault-in and diff prop-

agation in FT1 are less than those in FT2. FT1 changes more remote

read/write to local read/write. When the repetition is relatively large, such

as 8 and 16, AT performs as well as FT1 in this aspect. This fact confirms

our claim that AT presents good sensitivity towards the lasting single-writer

pattern.

Finally, when the repetition is relatively small, such as 2 or 4, i.e. the

transient single-writer pattern, fixed threshold home migration protocols in-

cur a lot of redirection overhead. This shows that fixed threshold protocols

usually do not have robustness against the transient single-writer pattern, ex-

cept in some individual cases. For example, FT2 prohibits home migration

101

for (int cur = 0; cur < iteration; cur++) {

for (int i = from; i < to; i++) {

for (int j = 0; j < sizeOfMatrix; j++) {

matrix[i][j] = average(i, j);

}

if ((cur + 1) % repetition == 0) {

from = from + sizeOfMatrix / numOfThreads;

to = to + sizeOfMatrix / numOfThreads;

if (from >= sizeOfMatrix) {

from -= sizeOfMatrix;

to -= sizeOfMatrix;

}

}

}

}

Figure 8.14: DSOR’s Source code skeleton run by each thread

when the repetition is two. As we can see, AT demonstrates better robust-

ness than fixed threshold protocols in this aspect. AT is able to detect the

transient single-writer pattern and strike a good balance between performing

home migration to reduce remote accesses and prohibiting home migration

to reduce the redirection overhead. When the repetition is relatively small,

such as 2 or 4, AT greatly reduces the home redirection messages.

DSOR

DSOR is a variant version of SOR discussed in section 8.2.3. In SOR, the

computation workload of the matrix is evenly distributed among threads in a

row-wise manner. Each thread is the only writer for the assigned rows, which

are array objects. These single-writer patterns are fixed in the execution time.

However, in DSOR, after an adjustable number of iterations, we reassign the

102

array objects to threads in a circular way. The threads still conduct the

single-writer pattern on their newly assigned array objects. Therefore, each

array object presents a changeable single-writer pattern, where the writer

varies from time to time. Figure 8.14 shows DSOR’s source code skeleton

run by each thread. Here we calculate a 1024 by 1024 matrix for 64 iterations.

We can specify the number of iterations in which a particular thread

is writing some array objects, i.e., the repetition of a single-writer pattern,

which is represented by repetition in figure 8.14. Similar as figure 8.12,

figure 8.15 shows the normalized execution time against different repetitions

of the single-writer pattern. Similar as figure 8.13, figure 8.16 shows the

normalized message number against different repetitions of the single-writer

pattern.

Many observations from RCounter can also be found in DSOR. For ex-

ample, FT1 is more sensitive than FT2 towards the single-writer pattern. So

FT1 converts more remote accesses to local accesses than FT2. In this as-

pect, AT is as good as FT1, as shown when the repetition of the single-writer

pattern is 4, 8, and 16.

Since the size of shared object presenting the single-writer pattern is

relatively large, the benefit from home migration is obvious even when the

repetition is not very large, e.g., 2 and 4. While in RCounter, the home

migration benefit is convincible only when the repetition is 16 because of the

small size of the shared object. Remember in our protocol, the object size is

taken into account when calculating the home access coefficient that is the

overhead ratio of one eliminated pair of object fault-in and diff propagation to

one home redirection. Although not shown in the figure, we do experiment

with a fixed home access coefficient that does not take into account the

object size. When the repetition is 4, the protocol with a fixed home access

coefficient increases the execution time by 10.2%, and reduces the number of

home migrations by more than half. Clearly, the fixed home access coefficient

does not correctly consider the benefit from home migrations.

103

0%

20%

40%

60%

80%

100%

2 4 8 16Repetition of single-writer pattern

No

rmal

ized

exe

cuti

on

tim

e

NM FT1 FT2 AT

Figure 8.15: Effects of home migration protocols against repetition of single-writer pattern: normalized execution time (DSOR)

0%

50%

100%

NoM

ig

FT

1

FT

2

AT

NoM

ig

FT

1

FT

2

AT

NoM

ig

FT

1

FT

2

AT

NoM

ig

FT

1

FT

2

AT

2 4 8 16

Repetition of single-writer pattern

No

rmal

ized

mes

sag

e n

um

ber

redir

diff

mig

obj

Figure 8.16: Effects of home migration protocols against repetition of single-writer pattern: normalized message number (DSOR)

104

In DSOR, the most interesting thing happens when the repetition is 2:

AT performs better than both FT1 and FT2. FT2 prevents most of the home

migrations, so it is almost as same as NoMig. FT1 allows home migration

on the remote read coming from the last writer, so it incurs a lot of home

migrations as well as a lot of home redirections. While AT strikes a very

smart balance between FT1 and FT2. By considering the home redirection

overhead, AT incurs about 10% home migrations of those FT1 incurs, so thus

reduces home redirections very much. But AT still manages to eliminate some

remote accesses through home migrations.

To sum up, both RCounter and DSOR, which represent the most impor-

tant two situations of single-write pattern in real applications, reflect that

our adaptive home migration protocol is sensitive to the lasting single-writer

pattern, and at the same time robust against the transient single-writer pat-

tern.

8.6 More on Synchronized Method Migration

As shown in figure 8.7 (a), running on 4 processors, ASP’s execution time

increases by 8.2% after enabling synchronized method migration. Since the

computer cluster is dedicated for the experiments and we run each test for

many times to take an average, we have removed most, if not all, unpre-

dictable factors in the execution. So there must be some reason behind this

exception.

As a first step to investigate the reason, we measure the effect of synchro-

nized method migration on a barrier test benchmark. In the benchmark, all

working threads repeat doing the barrier operations for a number of times.

Figure 8.17 shows the effect of synchronized method migration on the bar-

rier operation against the number of processors. The barriers are repeated

for 10,000 times in the experiments. As seen from the figure, the effect of

synchronized method migration on the barrier operation is very clear, and it

105

0

10

20

30

40

50

60

70

80

90

100

0 2 4 6 8 10 12 14 16


Exe

cuti

on

tim

e (s

eco

nd

s) w/o SMM

w/ SMM

Figure 8.17: Effect of synchronized method migration on the barrier opera-tion against the number of processors

increases with the number of processors.

However, when a thread does both computation and synchronization,

synchronized method migration will complicate the situation. The effect of

synchronized method migration is to migrate the load from the requesting

node to the requested node, and at the same time to aggregate multiple

synchronization requests into one. So its positive effect is to reduce the

the processing and transmission overhead of the requesting node, while its

negative effect is to possibly overload the requested node so as to increase

the waiting time on the requesting node. Which one becomes the dominant

factor decides the effect of synchronized method migration.

Figure 8.18 shows ASP’s performances on different problem sizes with

various configurations of the cluster-based JVM. We scale the size of the

distance matrix. There are 4 working threads in all cases. All-Slave denotes

that all working threads are running on the slave nodes. Thus the master

106

0

20

40

60

80

100

120

140

160

256 512 768 1024 1280 1536

Problem size

Exe

cuti

on

tim

e (s

eco

nd

s)

HM

HM+SMM

HM All-Slave

HM+SMM All-Slave

Figure 8.18: ASP’s execution times on different problem sizes

node is dedicated for the synchronization operations.

The performance comparison between HM and HM+SMM reveals the influence

of computation load on the effect of synchronized method migration. When

the problem sizes are small, the synchronization overhead is relatively large.

The positive effect of synchronized method migration is dominant. When

the problem size increases, for each thread, the relative computation load

compared with the synchronization load increases, too. Synchronized method

migration causes the workload moved from the slave nodes to the master

node to worsen the situation of load imbalance. Here the negative effect of

synchronized method migration plays a major role. When we run ASP of a

same problem size on a large number of processors, the relative computation

load compared with the synchronization load for each thread decreases. So

the positive effect of synchronized method migration becomes dominant and

is amplified when we increase the number of processors, as seen in figure 8.7

(a).

When we run all working threads on the slave nodes, as shown by HM

107

All-Slave and HM+SMM All-Slave in figure 8.18, the master node is dedi-

cated for the synchronization workload. The negative effect of synchronized

method migration is gone. As we see in the figure, the effect of synchronized

method migration is always positive.

In conclusion, under situations with relatively heavy synchronization over-

head, synchronized method migration can be effective; while under situations

with relatively light synchronization overhead, it may not be helpful and

even worsen the performance by unbalancing the workload. It is helpful

to dedicate a processor for synchronization, which is shown to be effective,

particularly when synchronized method migration is enabled.

108

Chapter 9

Related Work

9.1 Overview

This chapter presents a survey of the works related to the thesis. Firstly, we

discuss some works on high performance parallel Java computing that do not

follow the cluster-based JVM approach. Then we put more focuses on the

works following the cluster-based JVM approach.

9.2 Augmenting Java for Parallel Computing

As a network-centric language, Java has already provided some capabilities

facilitating distributed computing, such as Socket, RMI [75]. Similar to re-

mote procedure call (RPC), RMI is used to build distributed applications in

which an object is able to invoke the methods of a remote object.

However, it is the consensus of the parallel computing community that

the official Java distribution, i.e., Sun JDK (Java Development Kit), is still

not capable enough to carry on high performance parallel computing. There-

fore, many works aim to augment Java in different ways to promote Java’s

capability for parallel computing.

In this section, we discuss two approaches to augmenting Java for parallel

109

computing, language augmentation and class augmentation. We will discuss

three representative systems below. Each of them belongs to a different

programming paradigm: JavaParty follows RMI style, HPJava follows data

parallel paradigm, and mpiJava follows message passing paradigm. Although

they have facilitated parallel programming to some extent, they still impose

some considerable programming complexity.

9.2.1 Language Augmentation

Some researchers take a language augmentation approach by introducing

new keywords or syntax extensions into Java language. Because new syntax

features are incorporated, a customized Java compiler or preprocessor is re-

quired to translate augmented Java source code to standard bytecode. No

modification to JVM and Java bootstrap classes is required.

JavaParty

JavaParty [65] is a case of language augmentation approach. The motiva-

tion behind JavaParty is to overcome the programming complexity of RMI

but still to use RMI as the communication method in cluster environment.

JavaParty only introduces a new class modifier, remote, to Java language.

If used, the new modifier indicates that the instances of the modified class

represent some form of parallelism, and thus are possible to reside at remote

cluster nodes. The invocations of remote instances’ methods are through

RMI. JavaParty provides a preprocessor that translates JavaParty’s source

code to RMI implementations in pure Java, which are further compiled to

Java bytecode by a standard Java compiler.

JavaParty is location transparent, i.e., the programmers do not need to

map instances of remote classes to some specific nodes. JavaParty’s runtime

system will handle the distribution of remote instances. The runtime system

may also schedule object migration to enhance locality, in which the object

110

is moved to where the invocations take place.

Although JavaParty reduces RMI’s programming complexity and pro-

vides some desirable features, it has some shortcomings. Firstly, JavaParty

may cause memory inconsistency. In method invocations, RMI passes non-

remote objects by copies. If the receiving method further changes the repli-

cated objects, the inconsistency will happen. The inconsistency situation can

only be solved by programmers, by either declaring the passed objects to be

remote or ensuring that the replicated objects will not be modified. Secondly,

because all the remote interactions are based on RMI that is heavyweight,

JavaParty is not suitable for fine-grained sharing situations in parallel com-

puting.

HPJava

Like HPF [3], HPJava [4] is a data parallel programming language exploiting

parallelism at data level. HPJava introduces new constructs into Java for de-

scribing how data is distributed among processors and how each process runs

program segment over different data sets simultaneously. For example, The

overall construct defines a parallel, distributed loop, in which each process

will work on a defined separate data range.

In the data parallel paradigm, some advanced compilation techniques

can be taken to direct the mapping from data to processor and to generate

efficient communication code accessing non-local data [69]. However, the data

parallel paradigm can not effectively address the irregular problems [48].

9.2.2 Class Augmentation

Observing that the requirements from parallel computing can not be effec-

tively addressed by the bootstrap Java classes that are distributed along with

Sun JDK, some researchers take a class augmentation approach by creating

Java classes specially designed for parallel computing. The Java program-

111

mers leverage them by creating their instances and invoking their methods.

No modification to Java compiler and JVM is required.

mpiJava

Message Passing Interface (MPI) [16] is a widely accepted message pass-

ing standard used in parallel computing. Compared with Socket interface,

MPI defines higher level abstraction and routines for parallel computing.

For example, MPI defines various blocking and non-blocking point-to-point

communication routines, as well as collective communication routines, such

as broadcast and reduction. The collective communication takes place in a

communicator context, which is a collection of communicating processes.

mpiJava [27] is a collection of Java APIs following the MPI standard.

mpiJava does not reinvent the MPI implementation. Instead, through Java

Native Interface (JNI), mpiJava provides a set of Java wrappers to native

MPI packages, such as MPICH [11] which is one of the most popular open

source MPI implementations. Thus, mpiJava should be portable to any plat-

form that provides compatible Java runtime and native MPI environments.

9.3 Cluster-based JVM

In the above approaches to using Java for high performance parallel com-

puting, the Java programmers are supposed to learn and use either new

Java language extensions in the cases of JavaParty and HPJava, or new Java

classes in the case of mpiJava. This is not transparent for Java programmers

and creates a non-trivial entry barrier for Java programmers to the field of

parallel programming.

In fact, Java has already provided some built-in mechanism that can be

used for parallel computing: Java has inherent multi-threading support.

The state-of-the-art JVMs have already been able to leverage multiple

available processors in symmetric multiprocessing (SMP) computers to sched-

112

ule Java threads, where Java threads can be distributed to different proces-

sors to achieve the speedup. However, the achievable speedup is limited by

the number of processors in SMPs.

The approach of cluster-based JVM is aiming to run unmodified multi-

threaded Java applications on computer clusters, which is considered to be

more scalable and affordable than SMPs. With this approach, the threads

within a Java application can be automatically distributed onto different

cluster nodes to achieve parallelism or leverage cluster-wide resources such

as memory and network bandwidth. With this approach, programming a

parallel computer, e.g., a computer cluster, is almost equivalent to program-

ming a sequential computer, except that parallel programming raises new

performance issues, such as minimizing communication and synchronization

overheads.

Since Java threads within the same application share the object heap,

the cluster-based JVM calls for a global object space, which is a distributed-

shared memory service to provide transparent object accesses and synchro-

nizations for distributed threads.

In this section, we discuss several research works following the cluster-

based JVM approach. The cluster-based JVM is also called distributed JVM.

9.3.1 Jackal

Jackal [78] directly compiles multi-threaded Java programs’ source code or

bytecode into distributed native machine code, which can directly run on a

cluster. Therefore, JVM, the standard Java runtime system, is not required in

this approach. It means the static compilation approach needs to provide an

alternative runtime system to meet runtime requirements of Java programs,

such as garbage collection, dynamical class loading and compilation, as well

as exception handling.

Most of its effort to improve performance is done at compile time. Jackal

incorporates some compiler optimizations to remove object access checks,

113

which are used to guarantee that the objects are in the right access states

according to the cache coherence protocol. In addition, Jackal’s compiler per-

forms two optimizations to reduce the communication caused by distributed

object accesses and synchronizations: object-graph aggregation and automatic

computation migration.

Object-graph aggregation uses a heap approximation algorithm [49] to

identify those connected objects. If an object’s field contains a reference

to another object, the connectivity exists between these two objects. At

runtime, when the root object of the object graph is faulted in, the whole

object graph could be prefetched together to improve reference locality.

Automatic computation migration will generate a remote procedure call

(RPC) alike code to move a computation part and its states to a remote

cluster node at runtime. This may be more efficient than executing the

computation at the local node that may involve a lot of communication of

objects and synchronizations. For example, migrating a synchronized block

or method to the node where the synchronized object resides can aggregate

multiple lock/unlock/object request and reply messages into one message

round trip.

Jackal uses a fine-grain DSM to build the GOS. The coherence unit is a

fixed-size region of 256 bytes. Like page-based DSM systems, all the repli-

cations of an object reside at the same virtual address on different nodes.

Since the virtual memory address allocatable to applications is limited in 32-

bit operating systems and all nodes share the same virtual memory address

range, Jackal may not leverage all the available physical memory across the

cluster. In fact, this drawback is inherited from page-based DSMs running

on 32-bit operating systems. Unlike page-based DSM systems, the object

access states are enforced through software checks.

Jackal’s runtime system enables an optimization called lazy flushing. The

home of a shared coherence unit is fixed. However, if a unit is not shared by

any other node and some node requests a copy for write access, the requesting

114

node becomes the exclusive owner. Then the later reads and writes will be

performed locally just as if they were at the home. If other nodes want

to shared the unit, the current exclusive owner needs to be notified. The

drawback of lazy flushing is that it ignores the application’s inherent access

patterns. Frequent transitions from and to exclusive owner will cause a lot

of communication overhead, thus the transition times are set to maximally

five in Jackal.

9.3.2 Hyperion

Hyperion [62] compiles Java bytecode to C source code, then compiles to

native code. To build the GOS, Hyperion introduces a redirection table,

which is replicated in each node. Each object has an entry in the table,

storing the actual memory address of the object. Its index represents the

global identification for the object. Each node manages a portion of the

table. It can create new objects in its own portion without synchronizing

with other nodes.

The introduction of a redirection table not only impairs the performance

by incurring a redirection overhead on object accessing, but also complicates

the memory management. The size of the table is the maximum object

number in one application. Even the object heap is not yet fully occupied,

but the full occupation of table entries on some node may trigger a distributed

garbage collection. Without knowing the actual number of objects in the

application, it is difficult to choose a proper redirection table size, and assign

the parts of the table to every node.

9.3.3 JavaSplit

JavaSplit [42] rewrites and instruments the bytecode of a multi-threaded Java

application. The result is a distributed application in pure Java that is linked

with some runtime classes and can run on a cluster of standard JVMs. Each

115

of them runs on a cluster node.

The execution effect of the Java application after the bytecode instru-

mentation is equivalent to that of the original multi-threaded Java applica-

tion. However, each thread of the original multi-threaded Java application

is transformed to a standalone sub-application that can run on a standard

JVM. With the help of some runtime classes, all the sub-applications coop-

erate with each other to accomplish the job of the original multi-threaded

application.

JavaSplit uses the Bytecode Engineering Libarry (BCEL) [13] to instru-

ment and rewrite the bytecode. JavaSplit provides some runtime classes to

build the GOS and facilitates remote thread creations.

In the bytecode of a multi-threaded Java program, JavaSplit will inter-

cept and instrument object field and array element accesses to maintain the

memory consistency among replicated objects by adding software checks and

calls to fault-in routines. In addition, JavaSplit will intercept and instru-

ment thread creations and synchronization operations to implement remote

thread creation and distributed synchronization operations according to Java

memory model semantics.

JavaSplit argues that the bytecode instrumentation approach maintains

Java’s cross-platform portability by making use of the standard JVMs as the

execution platforms. JavaSplit can also rely on the advancement of standard

JVMs for performance improvement.

On the other hand, JavaSplit’s approach has some drawbacks. Firstly,

JavaSplit needs to instrument all the Java bootstrap classes because they

may be dynamically loaded by JavaSplit’s applications. However, this process

can not be fully automated because some bootstrap classes contain native

methods. Currently, JavaSplit needs to manually create JavaSplit version of

those bootstrap classes with native methods.

Secondly, because the instrumentation is done at the bytecode level, the

resulted bytecode size could be considerably enlarged, and the performance

116

class javasplit.A extends javasplit.somepackage.C {

// fields

private int myIntField;

public char myCharField;

// Send this object

public void DSM_serialize(DSM_ByteOutputStream out) {

super.DSM_serialize(out);

out.writeInt(myIntField);

out.writeChar(myCharField);

}

// Receive this object

public void DSM_deserialize(DSM_ByteInputStream in) {

super.DSM_deserialize(in);

in.readInt(myIntField);

in.readChar(myCharField);

}

}

Figure 9.1: JavaSplit’s code sample to send and receive objects

could be impaired.

For example, figure 9.1 illustrates how JavaSplit sends and receives an

object. To enhance the readability, the instrumented bytecode is described

in Java source code. Method DSM serialize is used to send this object,

and DSM deserialize to receive this object. To send or receive each field, a

method call is made and it will further call other methods to finally deliver

the value to the output stream or retrieve a value from the input stream.

The total overhead is much more expensive than the simple memory copy.

Consequently, the data communication in JavaSplit involves large overhead.

JavaSplit’s performance evaluation [42] conforms our observation.

Please note this performance issue is inherent in the bytecode instrumen-

tation approach. Due to the fact that Java is a strongly typed language,

the bytecodes can not directly operate on the memory. Instead, they oper-

117

ate on the data items with type semantics. Thus the bytecode operations

can not efficiently address the issue of memory copy that takes place in the

communication.

9.3.4 cJVM

cJVM [24] follows the cluster-based JVM approach. It uses a master-proxy

object model and a method shipping approach to implement the GOS. A

proxy object will be created locally on accessing a remote object and the

remote object becomes the master object. Method invocation of the proxy

object as well as field accessing to the proxy object are shipped to the node

where the master object resides. No consistency issue is involved in this

approach. Also usually, the object is not replicated to improve the access

locality. Since the method shipping approach will forward the execution flow

to the node where the master object resides, the workload distribution is

determined by the distribution of master objects in cJVM. Load balancing

may be difficult to achieve without an effective strategy enforced by either the

programmer or some runtime mechanism. The method shipping causes the

thread stack scattered on multiple nodes, which complicates the exception

handling. cJVM is evaluated on up to 4 nodes. This can not fully reflect

the scalability of cJVM’s master-proxy object model and method shipping

approach.

Several optimization techniques are applied to reduce the number of ac-

cess and method shipping [25]. They are a combination of many simple opti-

mizations, such as caching read-only fields or objects, locally executing state-

less methods that do not write on the heap, and single chance object migra-

tion. Some optimizations are possible through exploiting Java’s semantics.

For example, objects of type java.lang.String and java.lang.Integer

are read-only.

118

9.3.5 JESSICA

the JESSICA system [61], which stands for Java-Enabled Single System Im-

age Computing Architecture, uses some page-based DSM systems, such as

JUMP [35] and TreadMarks [57], to build the GOS. All objects are allocated

in the distributed shared memory. Each node manages a segment of shared

memory and creates new objects in its own segment independently.

Although this approach greatly alleviates the burden of constructing the

GOS because all the cache coherence issues, such as object addressing, fault-

ing, replication, and update propagation, can be managed by the page-based

DSM, it suffers from certain problems. Firstly, the sharing granularity of

Java and that of the page-based DSM are incompatible. Java organizes data

into variable-sized objects, while the page-based DSM enforces coherence at

the granularity of virtual memory page. If two threads are updating differ-

ent objects coincidentally residing on the same memory page, it will cause

communication. This phenomenon is called the false sharing problem. In

addition, if one object happens to reside across the memory page boundary,

faulting in it will incur two memory page fault-ins, which is quite heavy-

weight.

Secondly, as a low-level supportive layer, the page-based DSM is not

aware of the abundant runtime information in JVM, e.g., the object type

information, which makes it difficult to look for opportunities to improve

the performance of the GOS. Particularly, the accesses to different objects

residing at the same memory page are mingled at the page level. Therefore,

it is difficult to detect access patterns in applications that exhibit fine-grain

sharing.

The detailed analysis of various factors contributing to the efficiency of

using a page-based DSM to build the GOS can be found in [36].

JESSICA has a unique feature that it supports transparent thread mi-

gration. At runtime, a Java thread can be preemptively migrated from one

node to another node, e.g., from an overloaded node to an underloaded node.

119

Thread migration could be useful to achieve load balance or fault tolerance

in cluster-based JVMs.

JESSICA employs a master-slave thread model. In the very beginning,

all threads reside at the master node. Then some threads will be migrated to

the slave nodes according to the migration policy. For each migrated thread,

a thread still remains at the master node to handle thread synchronization

and I/O redirection. The migrated thread is called the slave thread, and

the corresponding thread staying at the master node is called as the master

thread. If the slave thread needs to do some synchronization, it will inform

the corresponding master thread to perform the actual synchronization on

behalf of it. Similarly, all the I/O operations issued by the slave thread will be

redirected to the corresponding master thread. Although simple, this master-

slave thread model makes the master node the performance bottleneck.

9.3.6 Java/DSM

Java/DSM [82] has also built its GOS on top of a page-based DSM, i.e.,

TreadMarks [57]. Since Java/DSM is intended to work on a heterogeneous

cluster, data conversion between heterogeneous hardware architectures is re-

quired. However, using TreadMarks as an infrastructure for building GOS

contradicts the goal, because TreadMarks does not perform data conversion

across machine boundaries. Although the data conversion function could be

added into TreadMarks, it will incur some considerable overhead. To our

best knowledge, Java/DSM’s attempt to provide a heterogeneous DSM has

not been fully achieved, and no performance result has been reported.

120

Chapter 10

Conclusion

10.1 Discussions

10.1.1 Effectiveness of the Adaptations

We have presented three adaptations, namely, adaptive object home migra-

tion, synchronized method shipping, and connectivity-based object pushing.

The adaptive object home migration is definitely useful for a home-based

cache coherence protocol. Without home migration, the fixed home node for

an object could become a performance bottleneck. Two factors contribute to

the effectiveness of our adaptive object home migration protocol. Firstly, the

GOS is object-based and we are able to separate accesses on different objects

so that the access behavior for a certain object can be precisely observed.

Secondly, we rely on the runtime feedback to continuously adapt to objects’

access behavior. As a result, the experiments show that our adaptive home

migration protocol demonstrates both the sensitivity to the lasting single-

writer pattern and the robustness against the transient single-writer pattern.

In the latter case, the protocol inhibits home migration in order to reduce

the home redirection overhead.

Synchronized method migration can have both positive and negative ef-

121

fects on the performance. Synchronized method migration can reduce the

messages and the protocol overheads significantly on one hand. On the

other hand, it moves the workload to the home node of the synchronized

object, which could cause load imbalance if the computation load becomes

more dominant than the synchronization load. Currently, we have no way

to determine whether a synchronized method migration is profitable or not.

However, we show an arrangement where a particular node is dedicated to

the global synchronization. This arrangement can always give the perfor-

mance improvement, and the synchronized method migration approach has

only positive effect under this arrangement. In addition, not only synchro-

nized methods are worth being migrated. By migrating a method to the

home node of the receiver object, it is possible to aggregate multiple object

fault-ins inside the method into one message round trip. Nevertheless, the

GOS is subject to load imbalance in doing this. A smart switching between

object shipping and method shipping needs a thorough investigation.

Connectivity-based object pushing is essentially a prefetching technique

to improve the reference locality. It is impossible in previous cluster-based

JVMs that use a page-based DSM system to build the GOS. It is enabled by

our GOS design that intentionally exploits Java’s runtime information. For

the application where the reference locality prevails, the effect of object push-

ing is obvious. In particular, object pushing optimizes producer-consumer

pattern based on connectivity information.

All these adaptations are respectively working on different dimensions of

the access pattern space. Theoretically, they are orthogonal and thus do not

interact with each other. As shown in our experiments, all these adaptations

hold the property that they have little negative impact when they do not

work.

122

10.1.2 Which Existing JVM is Based on

All current works on building a cluster-based JVM are based on modifying an

existing JVM. For example, cJVM is based on Sun JDK 1.2, Jessica is based

on Kaffe 0.9.1, and our cluster-based JVM is based on Kaffe 1.0.6. cJVM

and Jessica are running in the interpretation mode, while ours is running in

the JIT mode.

We chose Kaffe because it was one of the popular open source JVMs

at that time and we did not have access to Sun JDK’s source code. Kaffe

was designed to work in an embedded environment. Its JIT engine and

garbage collector do not provide satisfactory performance suitable for parallel

computing.

Since our cluster-based JVM is based on Kaffe, its performance suffers

from the undesired performance of Kaffe’s execution engine and GC subsys-

tem. However, the contributions of our research are to demonstrate that a

considerable speedup can be achieved by building a cluster-based JVM and

the design of the GOS is the key to the performance of cluster-based JVMs.

Thus the GOS techniques proved effective in this research can be applied

to any existing high performance JVM in order to build a high performance

cluster-based JVM.

Recently, a JVM called Jikes RVM (research virtual machine) [7] has

drawn many researchers’ attention. RVM is open source, high performance,

designed for research specially, and written in Java. RVM could be a good

candidate for the foundation of a high performance cluster-based JVM.

10.1.3 Thread Migration vs. Initial Placement

Some cluster-based JVMs, such as Jessica, supports thread migration. They

can preemptively move a thread from one cluster node to another during the

execution. Thread migration is supposed to provide the functions of load

balance and fault tolerance.

123

Our cluster-based JVM does not support thread migration. Instead, we

follow an initial placement approach to distribute user threads to different

nodes, i.e., a thread could be remote created in another node.

Compared with thread migration, initial placement is lightweight, easier

to implement, and it can well balance the load for the regular structured

problems. Initial placement lacks the ability of dynamic load balancing that

thread migration has. However, thread migration has its own problems.

Firstly, after a thread is migrated, it may need to access the objects at the

source node of the migration. In this way thread migration generates new

remote accesses. It is possible to carry all the objects reachable from the

migrated thread on thread migration. But this could cause a big migration

overhead and the threads staying at the source node may still need to access

those objects. Secondly, thread migration is an intra-application load balance

mechanism, in which a thread is the minimal workload to be moved. The

workload of a thread is difficult to predict. Sometimes a thread is too much

for migration so that the destination node of the migration becomes the

new performance bottleneck. One possible solution is to create much more

threads than the number of nodes so that each thread carries a relatively

small workload that is suitable to balance the workload differences between

two nodes. But in this way, the thread creation and synchronization overhead

could be significant. The benefits of thread migration in cluster-based JVMs

need to be carefully justified.

10.2 Future Work

10.2.1 Compiler Analysis to Reduce Software Checks

In JIT mode, software checking of object access state will likely be a sig-

nificant overhead, as shown in figure 8.3. The check overhead is serious in

particular when there are array accesses inside loops. Some compiler opti-

mization technique can be applied during the JIT compiler enabled native

124

instrumentation to hoist the array object check outside the loop if there is

no synchronization in the loop. That is, a check is made before entering

the loop, and no more check will be made during the loop. For the normal

object accesses, it is possible to aggregate the checks before the accesses to

the different fields of the same object to further improve the performance.

These techniques have already been demonstrated in a software fine-grain

DSM system Shasta [71]. The reduction of software checks is important for

a high performance cluster-based JVM.

10.2.2 Automatic Performance Bottleneck Detection

The cluster-based JVM supports a “sequential development parallel execu-

tion” method for parallel programming. That is, a multi-thread Java pro-

gram can be coded and debugged at any existing JVM, and submitted to

a cluster-based JVM for high performance execution after it is proved func-

tionally correct on one computer node. The cluster-based JVM promises a

correct execution result and takes the responsibility to optimize the parallel

performance through the GOS.

However, the performance on the cluster-based JVM may not satisfy the

expectation of programmers. And programmers may want to know what

actually happens on the cluster-based JVM, so as to revise the algorithm to

avoid some performance bottlenecks in the program.

Therefore, the cluster-based JVM not only needs to transparently im-

prove the parallel execution performance, e.g., through the adaptive cache

coherence protocol researched in this thesis, but also needs to provide the

programmers some automatic performance report, e.g., to list some serious

performance bottlenecks.

The PAT can be a good staring point for the research of automatic per-

formance bottleneck detection in the cluster-based JVM. PAT is lightweight,

and provides some preliminary functions for access pattern analysis. Access

pattern analysis can be a useful approach towards detecting performance

125

bottlenecks in GOS. Future research should put emphasis on both the quan-

titative analysis of the access behavior and an intuitive visualization of object

access patterns.

10.2.3 High Performance Communication Substrate

Our research shows that the performance of GOS is very important for a

high performance cluster-based JVM. Advances in JIT techniques do not

help reduce the communication and synchronization overheads required by

the parallel execution of multi-threaded Java programs.

Our research focuses on designing an intelligent cache coherence protocol

to improve the performance of GOS. Another way to improve the perfor-

mance of GOS is to use high performance communication substrate, e.g., to

use some lightweight protocols such as Directed Point [39] instead of TCP/IP

we are using, or to use some high performance network technologies such as

InfiniBand [5] instead of the Fast Ethernet we are using. A high perfor-

mance communication substrate is very important for DSM systems where

communications are triggered by memory accesses that are fine-grained and

frequent.

It is interesting to investigate how to efficiently map the interface exposed

by the lightweight protocol to the GOS operations. It is also interesting to

investigate the usage of new features in advanced network technologies, such

as remote DMA, in the GOS implementation.

126

Appendix A

Appendix

A.1 Overheads of GOS Primitive Operations

We have measured the overheads of three primitive operations in our GOS:

(1) the remote lock of a DSO, (2) the remote unlock of a DSO, and (3) the

fault-in of a DSO. They help us to evaluate the performance of GOS in a

micrographic view.

Figure A.1 shows the source code segment of the multi-threaded Java

program used to measure the overheads of those primitive GOS operations.

Each Worker thread will repeat to acquire the lock of object synObj, to

update its only integer field, then to release its lock, for n times. synObj

is a small object with only one integer field. The remote lock and unlock

happen on the entrance and exit of the synchronized block. Since the locally

cached DSOs will be flushed at the time of locks, the update of synObj

will fault-in its up-to-date content from its home node. All the overheads

are measured inside our cluster-based JVM by instrumenting the internal

functions performing those tasks.

Table A.1 shows the overheads of primitive operations with respect to

different number of threads. In the experiments, all the Worker threads run

on the slave nodes of the cluster-based JVM, while the object synObj’s home

127

class Worker extends Thread {

int n; // number of operations

SynObj synObj;

Worker(int n, SynObj synObj) {

this.n = n;

this.synObj = synObj;

}

public void run() {

for (int i = 0; i < n; i++) {

synchronized (synObj) {

++synObj.elapsed;

}

}

}

}

Figure A.1: The source code to measure GOS primitive operations

Number of Overhead Overhead Overhead ofWorker thread of lock of unlock object fault-in

1 178.77 162.16 166.462 365.34 170.20 176.714 759.39 171.29 191.078 1544.75 175.79 190.1316 3119.53 185.49 190.44

Table A.1: Overheads (in microseconds) of primitive operations with respectto different number of threads

128

is at the master node. Thus all the locks, unlocks, and object fault-ins go to

the master node. When the number of threads is larger than 1, we sum up the

overheads of all threads and take an average for each operation respectively.

When the number of thread is 1, there is no synchronization contention

contributing to the operation overheads. Let’s take the unlock operation as

an example. The unlock request message contains 16 bytes, which include a

4-byte requesting node ID, a 4-byte message type ID, a 4-byte payload length,

and the GUID of the object to be unlocked. The successful reply message

also contains 16 bytes. The round-trip time to send and receive a 16-byte

message through TCP Socket measured by Netperf is 122.49 microseconds.

The difference between the overhead of unlock operation and the “Netperf”

time is due to GOS’ non-blocking I/O support.

The object fault-in incurs less thread switches than lock/unlock, since

the GOS daemon thread can reply it directly. However, the object fault-in

incurs the DSO detection overhead. The overheads of both object fault-in

and unlock increases along with the number of Worker threads, due to the

longer waiting time for each request when the number of requests increase.

The lock overhead incurs the synchronization contention when the num-

ber of threads is larger than 1. Not surprisingly, the average lock overhead

is roughly proportional to the number of threads.

A.2 ASP Code Segment

The following code shows the run method of the Worker thread in ASP.

public void run() {

int i, j, k;

float cur[ ][ ], next[ ][ ];

// n: the number of vertices; wsize: the number of threads;

// id: thread id.

int from = n / wsize * id;

129

int to = n / wsize * (id + 1);

for (k = 0; k < n - 1; k++) {

if (k % 2 == 0) { // d0, d1: intermediate matrixes.

cur = d0; next = d1;

} else {

cur = d1; next = d0;

}

for (i = from; i < to; i++) {

for (j = 0; j < n-1; j++) {

if (cur[i][j] <= cur[i][k] + cur[k][j])

next[i][j] = cur[i][j];

else

next[i][j] = cur[i][k] + cur[k][j];

}

}

// barrier synchronization among all threads.

bar.barrier();

}

}

A.3 The Method for Parallel Performance Break-

down

The multi-threading characteristics of JVM raises some difficulty for us to

performing the precise breakdown. For example, when we measure an oper-

ation, it is possible that the current thread is switched off and later switched

on. Therefore what we have measured includes not only the overhead of the

operation itself but also the thread scheduling overhead and the CPU time

spent by another thread.

130

To carefully control the impreciseness caused by JVM’s multi-threading,

we measure the breakdown only at the slave nodes for the following reasons:

Firstly, there is one working thread on each slave node. Since all bench-

mark applications’ workload is fairly balanced, the breakdown of the working

thread can represent that of the application to a great extent.

Secondly, for all the benchmark applications, there are three threads on

each slave node, i.e., the application’s working thread, the GOS daemon

thread gosd, and the garbage collection thread gc. The measurement impre-

ciseness comes from the CPU time taken by gosd. However, this imprecise-

ness can be tolerated because (1) it is negligible, as in TSP, or (2) it probably

overlaps with the idle time in Obj and Syn for ASP, SOR, and NBody. For

NBody, almost all the time taken by gosd overlaps with the idle time in

Syn, and the measurement impreciseness is negligible. We admit that the

measurement impreciseness for ASP and SOR does exist so that the actual

computation time should be a little smaller than the Comp time since Comp

contains some time taken by gosd.

Thirdly, we abandon the breakdown data on the master node because they

are messed up by the complicated multi-threading situation on the master

node. On the master nodes, gosd takes more workload than its counterparts

on the slave nodes. Moreover, gosd will schedule the monitor proxy threads

to do the synchronization on behalf of the remote threads.

We average the breakdown data on all slave nodes if the number of slave

nodes is larger than 1.

A.4 JIT Compilation vs. Interpretation

Our GOS can be integrated with both the interpretation mode and the JIT

compilation mode of the cluster-based JVM. By comparing the performances

of our cluster-based JVM under these two execution modes, we can reveal

how the JIT compilation technology improves the performance of cluster-

131

0

200

400

600

800

1000

1200

0 2 4 6 8 10 12 14 16


Exe

cuti

on

tim

e (s

eco

nd

s)

Interpret

JIT

0

50

100

150

200

250

300

350

400

0 2 4 6 8 10 12 14 16


Exe

cuti

on

tim

e (s

eco

nd

s)

Interpret

JIT

(a) ASP (b) SOR

0

500

1000

1500

2000

2500

3000

3500

0 2 4 6 8 10 12 14 16


Exe

cuti

on

tim

e (s

eco

nd

s)

Interpret

JIT

0

500

1000

1500

2000

2500

3000

3500

4000

0 2 4 6 8 10 12 14 16


Exe

cuti

on

tim

e (s

eco

nd

s)

Interpret

JIT

(c) NBody (d) TSP

Figure A.2: JIT Compilation vs. Interpretation

based JVMs.

Figure A.2 compares the performances under the JIT compilation mode

and the interpretation mode for four applications respectively. The per-

formances are measured against the number of processors. All the cache

coherence protocol optimizations are enabled.

For all the applications, we observe that the performance improvement

due to the JIT compilation decreases with the number of processors. The

JIT compilation technique aims to improve the computation performance.

When the application is running on a larger number of processors, the com-

132

putation load on each processor decreases, while the communication and

synchronization load increase. Since the JIT compilation technique reduces

the computation-to-communication ratio, the applications’ speedups signifi-

cantly drop on a larger number of processors.

The JIT compilation is proven to be a key technique to improve the

performance of JVMs. In order to maintain a GOS, cluster-based JVMs

incur communication and synchronization overheads that are absent in non-

distributed JVMs. Based on these observations, we consider the JIT compi-

lation and GOS techniques as two orthogonal key techniques to improve the

performance of cluster-based JVMs.

133

Bibliography

[1] Distributed Shared Memory Homepage.

http://www.ics.uci.edu/˜javid/dsm.html.

[2] Ganglia distributed monitoring and execution system.

http://ganglia.sourceforge.net/.

[3] High Performance Fortran (HPF) Forum.

http://www.crpc.rice.edu/HPFF/.

[4] HPJava Home Page. http://www.npac.syr.edu/projects/pcrc/HPJava/.

[5] InfiniBand Trade Association. http://www.infinibandta.org/home.

[6] Java Grande Forum. http://www.javagrande.org/.

[7] Jikes RVM. http://www-124.ibm.com/developerworks/oss/jikesrvm/.

[8] JSR-000133 Java Memory Model and Thread Specification Revision.

http://www.jcp.org/aboutJava/communityprocess/review/jsr133/.

[9] Kaffe Java Virtual Machine. http://www.kaffe.org.

[10] Maui Scheduler. http://www.supercluster.org/maui/.

[11] MPICH-A Portable Implementation of MPI. http://www-

unix.mcs.anl.gov/mpi/mpich/.

[12] Rocks Cluster Distribution. http://rocks.npaci.edu/Rocks/.

134

[13] The Byte Code Engineering Library. http://jakarta.apache.org/bcel/.

[14] The HKU Gideon 300 Cluster. http://www.csis.hku.hk/˜clwang/gideon300-

main.html.

[15] The Java Memory Model. http://www.cs.umd.edu/users/pugh/java/memoryModel/.

[16] The Message Passing Interface (MPI) standard. http://www-

unix.mcs.anl.gov/mpi/.

[17] TOP500 Supercomputer Sites. http://www.top500.org/.

[18] Torque Resource Manager. http://www.supercluster.org/projects/torque/.

[19] Joint ACM Java Grande - ISCOPE 2001 Conference, Stanford Univer-

sity, California, USA, June 2001.

[20] Joint ACM Java Grande - ISCOPE 2002 Conference, Seattle, Washing-

ton, USA, November 2002.

[21] S. V. Adve and K. Gharachorloo. Shared Memory Consistency Models:

A Tutorial. IEEE Computer, 29(12):66–76, 1996.

[22] S. V. Adve and M. D. Hill. A Unified Formalization of Four Shared-

Memory Models. IEEE Trans. on Parallel and Distributed Systems,

4(6):613–624, 1993.

[23] C. Amza, A.L. Cox, S. Dwarkadas, L.-J. Jin, K. Rajamani, and

W. Zwaenepoel. Adaptive Protocols for Software Distributed Shared

Memory. In Proceedings of IEEE, Special Issue on Distributed Shared

Memory, volume 87, pages 467–475, March 1999.

[24] Y. Aridor, M. Factor, and A. Teperman. cJVM: a Single System Image

of a JVM on a Cluster. In Proc. of International Conference on Parallel

Processing, 1999.

135

[25] Yariv Aridor, Michael Factor, Avi Teperman, Tamar Eilam, and Assaf

Schuster. Transparently Obtaining Scalability for Java Applications on

a Cluster. Journal of Parallel and Distributed Computing, 60, Oct. 2000.

[26] Mark Baker. Cluster Computing White Paper. Technical report, IEEE

Task Force on Cluster Computing, December 2000.

[27] Mark Baker, Bryan Carpenter, Geoffrey Fox, Sung Hoon Ko, and Sang

Lim. mpiJava: An Object-Oriented Java Interface to MPI. In In-

ternational Workshop on Java for Parallel and Distributed Computing,

IPPS/SPDP 1999, April 1999.

[28] H. E. Bal, R. Bhoedjang, R. Hofman, C. Jacobs, K. Langendoen,

T. Ruhl, and M. F. Kaashoek. Performance Evaluation of the Orca

Shared Object System. ACM Transactions on Computer Systems, 16(1),

Feberury 1998.

[29] J. Barnes and P. Hut. A Hierarchical O (N log N) Force-Calculation

Algorithm. Nature, 324(4):446–449, 1986.

[30] G. Bell and J. Gray. High Performance Computing: Crays, Clusters and

Centers. What Next?, 2001.

[31] Gilad Bracha, James Gosling, Bill Joy, and Guy Steele. The Java Lan-

guage Specification, Second Edition. Addison Wesley, 2000.

[32] Rajkumar Buyya, editor. High Performance Cluster Computing: Archi-

tecture and System, volume 1. Prentics Hall PTR, 1999.

[33] John B. Carter, John K. Bennett, and Willy Zwaenepoel. Techniques for

Reducing Consistency-Related Communication in Distributed Shared-

Memory Systems. ACM Transactions on Computer Systems, 13(3):205–

243, 1995.

136

[34] Anthony Chan, William Gropp, and Ewing Lusk. User’s Guide for MPE:

Extensions for MPI Programs.

[35] B. Cheung, C.L. Wang, and Kai Hwang. A Migrating-Home Protocol for

Implementing Scope Consistency Model on a Cluster of Workstations. In

International Conference on Parallel and Distributed Processing Tech-

niques and Applications, pages 821–827, 1999.

[36] W.L. Cheung, C.L. Wang, and F.C.M. Lau. Annual Review of Scal-

able Computing, volume 4, chapter Building a Global Object Space for

Supporting Single System Image on a Cluster. World Scientific, 2002.

[37] Trishul M. Chilimbi, Thomas Ball, Stephen G. Eick, and James R .

Larus. StormWatch: A Tool for Visualizing Memory System Protocols.

In Supercomputing ’95, December 1995.

[38] Jong-Deok Choi, Manish Gupta, Mauricio J. Serrano, Vugranam C.

Sreedhar, and Samuel P. Midkiff. Escape Analysis for Java. In Pro-

ceedings of the Conference on Object-Oriented Programming Systems,

Languages, and Applications (OOPSLA), pages 1–19, 1999.

[39] C.L. Wang and A. Tam and B. Cheung and W. Zhu and D. Lee. Directed

Point: High Performance Communication Subsystem for Gigabit Net-

working in Clusters. Journal of Future Generation Computer Systems,

pages 401–420, 2002.

[40] Information Networks Division. Netperf: A Network Performance

Benchmark. Hewlett-Packard Company, revision 2.1 edition, February

1996.

[41] Jack Dongarra, Thomas Sterling, Horst Simon, and Erich Strohmaier.

High Performance Computing: Clusters, Constellations, MPPs, and Fu-

ture Directions. submitted to CACM for publication, June 2003.

137

[42] Michael Factor, Assaf Schuster, and Konstantin Shagin. JavaSplit: A

Runtime for Execution of Monolithic Java Programs on Heterogeneous

Collections of Commodity Workstations. In IEEE International Con-

ference on Cluster Computing, page 110, December 2003.

[43] Weijian Fang, Cho-Li Wang, and Francis Lau. Efficient Global Object

Space Support for Distributed JVM on Cluster. In the 2002 Interna-

tional Conference on Parallel Processing, August 2002.

[44] Weijian Fang, Cho-Li Wang, and Francis C.M. Lau. On the Design of

Global Object Space for Efficient Multi-threading Java Computing on

Clusters. Parallel Computing, 29:1563–1587, November 2003.

[45] Weijian Fang, Cho-Li Wang, Wenzhang Zhu, and Francis C. M. Lau.

A Novel Adaptive Home Migration Protocol in Home-based DSM. In

the 2004 IEEE International Conference on Cluster Computing (Cluster

2004), September 2004.

[46] Paulo Ferreira and Marc Shapiro. Garbage Collection and DSM Con-

sistency. In First Symposium on Operating Systems Design and Imple-

mentation, pages 229–241, Monterey, CA, 1994. ACM Press.

[47] Ian Foster. Designing and Building Parallel Programs: Concepts and

Tools for Parallel Software Engineering. Addison-Wesley, 1995.

[48] Thierry Gautier, Jean-Louis Roch, and Gilles Villard. Regular versus

Irregular Problems and Algorithms. In Parallel Algorithms for Irreg-

ularly Structured Problems, Second International Workshop, IRREGU-

LAR ’95, 1995.

[49] Rakesh Ghiya and Laurie J. Hendren. Putting Pointer Analysis to Work.

In 25th Annual ACM SIGACT-SIGPLAN Symposium on the Principles

of Programming Languages, pages 121–133, January 1998.

138

[50] R.W. Hockney. A Framework for Benchmark Performance Analysis.

Supercomputer, IX-2(48):9–22, 92.

[51] Weiwu Hu, Weisong Shi, and Zhimin Tang. Home Migration in Home-

based Software DSMs. In Proc. of the 1st Workshop on Software Dis-

tributed Shared Memory (WSDSM’99), 1999.

[52] Richard L. Hudson and J. Eliot B. Moss. Incremental Collection of

Mature Objects. In International Workshop on Memory Management,

1992.

[53] K. Hwang, H. Jin, E. Chow, C.L. Wang, , and Z. Xu. Designing SSI

Clusters with Hierarchical Checkpointing and Single I/O Space. IEEE

Concurrency Magazine, 7(1):60–69, Jan-Mar 1999.

[54] Kai Hwang and Zhiwei Xu. Scalable Parallel Computing. McGraw-Hill,

1998.

[55] L. Iftode. Home-based Shared Virtual Memory. PhD thesis, Princeton

University, August 1998.

[56] P. Keleher. Lazy Release Consistency for Distributed Shared Memory.

PhD thesis, 1994.

[57] P. Keleher, S. Dwarkadas, A. L. Cox, and W. Zwaenepoel. TreadMarks:

Distributed Shared Memory on Standard Workstations and Operating

Systems. In Proc. of the Winter 1994 USENIX Conference, pages 115–

131, 1994.

[58] Leslie Lamport. How to Make a Multiprocessor Computer That Cor-

rectly Executes Multiprocess Programs. IEEE Transactions on Com-

puters, September 1979.

139

[59] Kai Li and Paul Hudak. Memory Coherence in Shared Virtual Memory

Systems. ACM Transactions on Computer Systems, 7(4), November

1989.

[60] Tim Lindholm and Frank Yellin. The Java Virtual Machine Specifica-

tion, Second Edition. Addison Wesley, 1999.

[61] Matchy J. M. Ma, Cho-Li Wang, and Francis C. M. Lau. JESSICA:

Java-Enabled Single-System-Image Computing Architecture. Journal

of Parallel and Distributed Computing, 60(10):1194–1222, Oct. 2000.

[62] M. MacBeth, K. McGuigan, and P. Hatcher. Executing Java Threads in

Parallel in a Distributed-Memory Environment. In Proc. of IBM Center

for Advanced Studies Conference, 1998.

[63] David L. Mills. RFC 1305 - Network Time Protocol (Version 3) Speci-

fication, Implementation, March 1992.

[64] L. R. Monnerat and R. Bianchini. Efficiently Adapting to Sharing Pat-

terns in Software DSMs. In the 4th IEEE International Symposium on

High-Performance Computer Architecture, Feb 1998.

[65] Michael Philippsen and Matthias Zenger. JavaParty — Transpar-

ent Remote Objects in Java. Concurrency: Practice and Experience,

9(11):1225–1242, 1997.

[66] Jose M. Piquer and Ivana Visconti. Indirect Reference Listing: A Robust

Distributed GC. In Euro-Par ’98 Parallel Processing, September 1998.

[67] David Plainfoss and Marc Shapiro. A Survey of Distributed Garbage

Collection Techniques. In Proc. of Interational Workshop on Memory

Management, 1995.

[68] William Pugh. The Java Memory Model is Fatally Flawed. Concurrency:

Practice and Experience, 12(1):1–11, 2000.

140

[69] J. Ramanujam, S. Dutta, A. Venkatachar, and A. Thirumalai. Advanced

Compilation Techniques for HPF. In Proc. 7th International Workshop

on Compilers for Parallel Computers, Linkoping, Sweden, June 1998.

[70] M. C. Rinard, D. J. Scales, and M. S. Lam. Jade: A High Level Machine-

Independent Language for Parallel Programming. Computer, 26(6):28–

38, 1993.

[71] D. J. Scales, K. Gharachorloo, and C. A. Thekkath. Shasta: A Low

Overhead, Software-Only Approach for Supporting Fine-Grain Shared

Memory. In Proc. of the 7th Symp. on Architectural Support for Pro-

gramming Languages and Operating Systems (ASPLOSVII), pages 174–

185, 1996.

[72] Daniel J. Scales and Monica S. Lam. The Design and Evaluation of a

Shared Object System for Distributed Memory Machines. In Operating

Systems Design and Implementation, pages 101–114, 1994.

[73] Kazuyuki Shudo. Performance comparison of JITs.

http://www.shudo.net/jit/perf/, 2002.

[74] Sun Microsystems, Inc. RFC 1094 - NFS: Network File System Protocol

specification, March 1989.

[75] Sun Microsystems, Inc. Java Remote Method Invocation Specification.

1999.

[76] Sun Microsystems, Inc. The Java Hotspot Performance Engine Archi-

tecture, Oct. 1999.

[77] Sun Microsystems, Inc. Java Object Serialization Specification. 2001.

[78] Ronald Veldema, Rutger F. H. Hofman, Raoul Bhoedjang, and Henri E.

Bal. Runtime Optimizations for a Java DSM Implementation. In Java

Grande, pages 153–162, 2001.

141

[79] Paul R. Wilson. Uniprocessor Garbage Collection Techniques. In

Proc. Int. Workshop on Memory Management, number 637, Saint-Malo

(France), 1992. Springer-Verlag.

[80] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal

Singh, and Anoop Gupta. The SPLASH-2 programs: Characteriza-

tion and methodological considerations. In Proceedings of the 22th In-

ternational Symposium on Computer Architecture, pages 24–36, Santa

Margherita Ligure, Italy, 1995.

[81] Zhichen Xu, James R. Larus, and Barton P. Miller. Shared Memory

Performance Profiling. In Principles Practice of Parallel Programming,

pages 240–251, 1997.

[82] W. Yu and A. Cox. Java/DSM: A Platform for Heterogeneous Comput-

ing. In Proc. of ACM 1997 Workshop on Java for Science and Engi-

neering Computation, 1997.

[83] Matthew J. Zekauskas, Wayne A. Sawdon, and Brian N. Bershad. Soft-

ware Write Detection for a Distributed Shared Memory. In Proceedings

of the First Symposium on Operating Systems Design and Implementa-

tion (OSDI), 1994.

142

distributed object sharing for cluster-based java virtual machine

Documents