distributed object sharing for cluster-based java virtual machine
TRANSCRIPT
Distributed Object Sharing for Cluster-based Java
Virtual Machine
Fang Weijian
������
A thesis submittedin partial fulfillment of the requirement for
the degree of Doctor of Philosophyat the University of Hong Kong
2004
Abstract of thesis entitled
“Distributed Object Sharing for Cluster-based Java VirtualMachine”
submitted by Fang Weijian
for the degree of Doctor of Philosophy
at the University of Hong Kong in 2004
Java has already become one of the most popular programming languages
since its debut. Recent advances in Java compilation and execution technolo-
gies have further pushed Java into the arena of high performance parallel and
distributed computing. On the other hand, the computer cluster has gradu-
ally been accepted as a scalable and affordable parallel computing platform
by both academia and industry in recent years.
We were therefore inspired to design a cluster-based Java Virtual Ma-
chine (JVM) that can run unmodified multi-threaded Java applications on a
computer cluster, where Java threads can be automatically distributed to dif-
ferent computer nodes to achieve high parallelism and leverage cluster-wide
resources such as memory and network bandwidth.
In a cluster-based JVM, the shared memory nature of Java threads calls
for a global object space (GOS) that “virtualizes” a single Java object heap
spanning the cluster to facilitate transparent distributed object sharing. The
performance of the cluster-based JVM hinges on the GOS’s ability to mini-
mize the communication and coordination overheads in maintaining the sin-
gle object heap illusion.
i
Different from the previous approaches to build a cluster-based JVM, we
build the GOS as an object-based distributed shared memory (DSM) service
embedded in the cluster-based JVM, which facilitates the exploitation of
abundant runtime information for performance improvement. Distributed-
shared objects (DSOs) that are reachable from threads at different nodes are
detected to facilitate efficient consistency maintenance and memory manage-
ment in the cluster-based JVM.
Furthermore, based on the concept of DSO, we propose a framework to
characterize object access patterns, along three orthogonal dimensions. With
this framework, we are able to effectively calibrate the runtime memory access
patterns and dynamically apply an adaptive cache coherence protocol to
minimize consistency maintenance overhead. The adaptation devices include
an adaptive object home migration method that optimizes the single-writer
access pattern, synchronized method migration that allows the execution of
a synchronized method to take place remotely at the home node of its locked
object, and connectivity-based object pushing that uses object connectivity
information to optimize the producer-consumer access pattern. Extensive
experiments have demonstrated the effectiveness of our design.
ii
Declarations
I hereby declare that the thesis entitled “Distributed Object Sharing for
Cluster-based Java Virtual Machine” represents my own work and has not
been previously submitted to this or any other institution for a degree,
diploma and other qualifications.
————————
Fang Weijian
2004
i
Acknowledgements
I would like to thank my supervisors, Dr. Cho-Li Wang and Dr. Francis C.
M. Lau, for their advices and help on my research and daily life, which are
endless, patient, and invaluable. It is their encouragement and support that
have brought this research to completion. In particular, the experiences of
intensively revising papers before deadlines with Dr. Wang were painful, but
remarkably rewarding. From them, I not only learned how to write papers
but also learned how to do research. Dr. Lau is inspiring and enlightening
in directing my research.
I also want to thank my internal and external examiners for their valuable
comments on my thesis.
It is my pleasure to work with Zhu Wenzhang in my PhD study. I am
full of gratitude to his suggestions and cooperations. It is also my pleasure
to hike with Zhu Wenzhang. He is energetic in both hiking and research.
I would like to thank many colleagues in HKU. They are Wang Lian,
Wang Tianqi, Chen Weisong, Chen Lin, Chen Ge, Zhu DongLai, Li Wei, Yin
Kangkai, etc. I really enjoyed the time spent with them. I also want to thank
Benny Cheung, Roy Ho, and Anthony Tam, for their help on my research
and teaching work.
Finally, I want to express my deepest gratitude to my wife and my parents.
ii
Contents
Declarations i
Acknowledgements ii
1 Introduction 1
1.1 Java and Java Virtual Machine . . . . . . . . . . . . . . . . . 1
1.2 Cluster Computing . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Cluster-based Java Virtual Machine . . . . . . . . . . . . . . . 3
1.4 Global Object Space . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . 7
1.7 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Background 10
2.1 Software Distributed Shared Memory . . . . . . . . . . . . . . 10
2.1.1 Memory Consistency Model . . . . . . . . . . . . . . . 11
2.1.2 Classification Based on the Coherence Granularity . . . 12
2.2 Java Memory Model . . . . . . . . . . . . . . . . . . . . . . . 14
3 Memory Access Pattern 17
3.1 Memory Access Pattern Optimization in DSM . . . . . . . . . 17
3.1.1 Programmer Annotation . . . . . . . . . . . . . . . . . 18
3.1.2 Compiler Analysis . . . . . . . . . . . . . . . . . . . . 19
iii
3.1.3 Runtime Adaptation . . . . . . . . . . . . . . . . . . . 21
3.2 Access Pattern Space . . . . . . . . . . . . . . . . . . . . . . . 24
4 Distributed-Shared Object 27
4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Benefits from DSO Detection . . . . . . . . . . . . . . . . . . 28
4.2.1 Benefits on Memory Consistency Maintenance . . . . . 28
4.2.2 Benefits on Memory Management . . . . . . . . . . . . 29
4.3 Lightweight DSO Detection and Reclamation . . . . . . . . . . 30
4.4 Basic Cache Coherence Protocol . . . . . . . . . . . . . . . . . 34
5 Adaptive Cache Coherence Protocol 38
5.1 Adaptive Object Home Migration . . . . . . . . . . . . . . . . 39
5.1.1 Home Migration Concepts . . . . . . . . . . . . . . . . 40
5.1.2 Home Migration with Adaptive Threshold . . . . . . . 43
5.2 Synchronized Method Migration . . . . . . . . . . . . . . . . . 48
5.3 Connectivity-based Object Pushing . . . . . . . . . . . . . . . 51
6 Object Access Pattern Visualization 53
6.1 Object Access Trace Generator . . . . . . . . . . . . . . . . . 56
6.2 Pattern Analysis Engine . . . . . . . . . . . . . . . . . . . . . 58
6.3 Pattern Visualization Component . . . . . . . . . . . . . . . . 59
7 Implementation 63
7.1 JIT Compiler Enabled Native Instrumentation . . . . . . . . . 63
7.2 Distributed Threading and Synchronization . . . . . . . . . . 66
7.2.1 Thread Distribution . . . . . . . . . . . . . . . . . . . 67
7.2.2 Thread Synchronization . . . . . . . . . . . . . . . . . 68
7.2.3 JVM Termination . . . . . . . . . . . . . . . . . . . . . 69
7.3 Non-Blocking I/O Support . . . . . . . . . . . . . . . . . . . . 70
7.4 Distributed Class Loading . . . . . . . . . . . . . . . . . . . . 71
7.5 Garbage Collection . . . . . . . . . . . . . . . . . . . . . . . . 73
iv
7.5.1 Local Garbage Collection . . . . . . . . . . . . . . . . . 74
7.5.2 Distributed Garbage Collection . . . . . . . . . . . . . 75
8 Performance Evaluation 78
8.1 Experiment Environment . . . . . . . . . . . . . . . . . . . . . 78
8.2 Application Suite . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.2.1 CPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.2.2 ASP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.2.3 SOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.2.4 NBody . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.2.5 NSquared . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.2.6 TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.3 Application Performance . . . . . . . . . . . . . . . . . . . . . 83
8.3.1 Sequential Performance . . . . . . . . . . . . . . . . . . 84
8.3.2 Parallel Performance . . . . . . . . . . . . . . . . . . . 85
8.4 Effects of Adaptations . . . . . . . . . . . . . . . . . . . . . . 89
8.4.1 Adaptive Object Home Migration . . . . . . . . . . . . 91
8.4.2 Synchronized Method Migration . . . . . . . . . . . . . 95
8.4.3 Connectivity-based Object Pushing . . . . . . . . . . . 96
8.5 Sensitivity and Robustness Analysis for HM Protocol . . . . . 97
8.6 More on Synchronized Method Migration . . . . . . . . . . . . 105
9 Related Work 109
9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
9.2 Augmenting Java for Parallel Computing . . . . . . . . . . . . 109
9.2.1 Language Augmentation . . . . . . . . . . . . . . . . . 110
9.2.2 Class Augmentation . . . . . . . . . . . . . . . . . . . 111
9.3 Cluster-based JVM . . . . . . . . . . . . . . . . . . . . . . . . 112
9.3.1 Jackal . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
9.3.2 Hyperion . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.3.3 JavaSplit . . . . . . . . . . . . . . . . . . . . . . . . . . 115
v
9.3.4 cJVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9.3.5 JESSICA . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.3.6 Java/DSM . . . . . . . . . . . . . . . . . . . . . . . . . 120
10 Conclusion 121
10.1 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
10.1.1 Effectiveness of the Adaptations . . . . . . . . . . . . . 121
10.1.2 Which Existing JVM is Based on . . . . . . . . . . . . 123
10.1.3 Thread Migration vs. Initial Placement . . . . . . . . . 123
10.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
10.2.1 Compiler Analysis to Reduce Software Checks . . . . . 124
10.2.2 Automatic Performance Bottleneck Detection . . . . . 125
10.2.3 High Performance Communication Substrate . . . . . . 126
A Appendix 127
A.1 Overheads of GOS Primitive Operations . . . . . . . . . . . . 127
A.2 ASP Code Segment . . . . . . . . . . . . . . . . . . . . . . . . 129
A.3 The Method for Parallel Performance Breakdown . . . . . . . 130
A.4 JIT Compilation vs. Interpretation . . . . . . . . . . . . . . . 131
vi
List of Figures
3.1 The object access pattern space . . . . . . . . . . . . . . . . . 23
4.1 The detection of distributed-shared object . . . . . . . . . . . 33
4.2 The state transition graph depicting object lifecycle in the GOS 35
5.1 Home-based Protocol for LRC with multiple-writer support . . 41
5.2 Barrier class . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.1 PAT Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 Memory access operations in GOS . . . . . . . . . . . . . . . . 57
6.3 Phase parallel paradigm . . . . . . . . . . . . . . . . . . . . . 60
6.4 The time lines window . . . . . . . . . . . . . . . . . . . . . . 61
6.5 The window of object access pattern analysis result (the bigger
one) and the window of the application’s source code (the
smaller one) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.1 Pseudo code for access check: using a function call . . . . . . . 65
7.2 Pseudo code for access check: by comparison . . . . . . . . . . 65
7.3 Detailed pseudo code for a read check . . . . . . . . . . . . . . 65
7.4 IA32 assembly code for a read check . . . . . . . . . . . . . . 66
7.5 Remote unlock of a DSO . . . . . . . . . . . . . . . . . . . . . 71
7.6 JVM’s dynamical loading, linking, and initialization of classes 72
7.7 Tolerating inconsistency in DGC . . . . . . . . . . . . . . . . 74
7.8 DSO reference diffusion tree . . . . . . . . . . . . . . . . . . . 75
vii
8.1 The typical operation in SOR . . . . . . . . . . . . . . . . . . 80
8.2 Barnes-Hut tree for 2D space decomposition . . . . . . . . . . 81
8.3 Single node performance . . . . . . . . . . . . . . . . . . . . . 84
8.4 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.5 Breakdown of normalized execution time against number of
processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.6 The adaptive protocol vs. the basic protocol . . . . . . . . . . 90
8.7 Effects of adaptations w.r.t. execution time . . . . . . . . . . . 92
8.8 Effects of adaptations w.r.t. message number . . . . . . . . . . 93
8.9 Effects of adaptations w.r.t. network traffic . . . . . . . . . . . 94
8.10 The effect of object home migration on SOR . . . . . . . . . . 95
8.11 RCounter’s Source code skeleton run by each thread . . . . . . 98
8.12 Effects of home migration protocols against repetition of single-
writer pattern: normalized execution time (RCounter) . . . . . 100
8.13 Effects of home migration protocols against repetition of single-
writer pattern: normalized message number (RCounter) . . . . 100
8.14 DSOR’s Source code skeleton run by each thread . . . . . . . 102
8.15 Effects of home migration protocols against repetition of single-
writer pattern: normalized execution time (DSOR) . . . . . . 104
8.16 Effects of home migration protocols against repetition of single-
writer pattern: normalized message number (DSOR) . . . . . 104
8.17 Effect of synchronized method migration on the barrier oper-
ation against the number of processors . . . . . . . . . . . . . 106
8.18 ASP’s execution times on different problem sizes . . . . . . . . 107
9.1 JavaSplit’s code sample to send and receive objects . . . . . . 117
A.1 The source code to measure GOS primitive operations . . . . 128
A.2 JIT Compilation vs. Interpretation . . . . . . . . . . . . . . . 132
viii
List of Tables
4.1 Coherence protocols according to object type . . . . . . . . . . 37
8.1 Communication effort on 16 processors . . . . . . . . . . . . . 89
A.1 Overheads (in microseconds) of primitive operations with re-
spect to different number of threads . . . . . . . . . . . . . . . 128
ix
Chapter 1
Introduction
1.1 Java and Java Virtual Machine
In less than ten years, Java [31] has become one of the most popular pro-
gramming languages since its debut on May 23, 1995 at SunWorld ’95. Java’s
following features contribute to its success.
• Java adopts a simplified C++ alike grammar, which make it a simple
yet expressive object-oriented language.
• Java is also a concurrent programming language by supporting multi-
threading.
• Java is by design a platform-independent language, through the intro-
duction of the bytecode. Java’s source code is first compiled to the
standard bytecode, which in turn can run on any platform where there
is a Java Virtual Machine (JVM) [60]. JVM is the runtime system
responsible for executing Java bytecode.
• JVM provides some very attractive runtime features, such as automatic
memory management through garbage collection [79], multi-threading
1
support, and runtime safety checks that include array boundary checks
as well as reference type checks.
• Java Development Kit (JDK) provides abundant libraries to support
Collection, Socket, Remote Method Invocation (RMI) [75], and Object
Serialization [77], etc.
Although Java has been considered as a productive and universal lan-
guage for a long time, its performance was unsatisfactory due to the poor
performance of JVM. However, recent advances in Java compilation and ex-
ecution technology, such as the just-in-time compiler [73], the hotspot tech-
nology [76], and the incremental garbage collection [52], add to the attrac-
tiveness of Java as a language for high performance scientific and engineering
computing [6]. As a consequence, more and more researchers are adopting
Java in high performance parallel and distributed computing [19][20].
1.2 Cluster Computing
A cluster is a type of parallel or distributed processing system, which consists
of a collection of interconnected stand-alone computers working together as
a single, integrated computing resource [32]. In recent years, the computer
cluster has been widely accepted as a scalable and affordable parallel com-
puting platform by both academia and industry [30, 26, 41]. For example,
in the TOP500 list [17] released on November 2003, 41.6% of the supercom-
puters, i.e., 208 systems, are of cluster, and account for 49.8% of aggregated
performance.
The prosperity of cluster computing is attributed to ever advancing com-
modity high performance microprocessors and high-speed networks, as well
as open source cluster softwares, such as Rocks [12] for Linux cluster soft-
ware installation, Torque [18] for resource management, Maui [10] for job
2
scheduling, MPICH [11] for message passing programming, and Ganglia [2]
for cluster monitoring.
Nevertheless, cluster programming is still a challenging task. One of the
major programming paradigms on clusters is message passing, e.g., by follow-
ing MPI standard [16]. Message passing paradigm requires the programmers
to write explicit code to send and receive data in order to coordinate processes
on different cluster nodes. In message passing paradigm, a superior perfor-
mance is usually achievable by fine tuning the timing and content of each
message, which however is widely believed to be a painful and error-prone
process.
Alternatively, software Distributed Shared Memory (DSM) [1] promises
a better programmability compared with the message passing paradigm, by
providing a globally shared memory abstraction across physically distributed
memory machines. In software DSM, programmers access distributed data
in the same way as local data. Special APIs are provided to synchronize
parallel processes. To improve the performance, shared data unit can be
replicated on multiple nodes. Inconsistency among multiple replications is
solved according to memory consistency models [21]. Since the enforcement
of data coherence is done automatically through the DSM infrastructure, it
is probable that the communications happen more frequently and involve
more data traffic than necessary. For example, the update or invalidation to
a cached copy is unnecessary if the copy will not be used any more.
1.3 Cluster-based Java Virtual Machine
Motivated by both the programmability of Java and the ample availability
of clusters as a cost-effective parallel computing environment, the transpar-
ent and parallel execution of multi-threaded Java programs on clusters has
become a research hotspot [62, 78, 82, 24, 61, 42, 44].
In this work, we build a cluster-based Java Virtual Machine to tackle this
3
problem. A cluster-based JVM conforms to the JVM specification [60], but
runs on a cluster. With a cluster-based JVM, the Java threads created within
one program can be transparently distributed onto different cluster nodes to
achieve a higher degree of execution parallelism. In addition, cluster-wide
resources such as memory, I/O, and network bandwidth can be unified and
used as a whole to solve resource-demanding problems. A cluster-based JVM
is also called a distributed JVM.
A cluster-based JVM is composed of a group of collaborating daemons,
one on each cluster node. A cluster-based JVM daemon is a standard JVM
augmented with the cluster awareness and capabilities to cooperate with each
other in order to present a single system image (SSI) [53] of cluster towards
Java applications. The single system image is enabled through the global
object space that will be discussed in the next section.
The adoption of the cluster-based JVM for parallel Java computing can
boost the cluster programming productivity. Given that the cluster-based
JVM conforms to the JVM specification, any Java program can run on the
cluster-based JVM without any modification. The steep learning curve can
thus be avoided since the programmers do not need to learn a new parallel
language, a new message passing library, or a new tool in order to develop
parallel programs on clusters. It is also convenient for program development
as the multi-threaded programs can be implemented and tested on a non-
parallel computer before it is submitted to a cluster for execution. Finally,
many existing multi-threaded Java applications, especially those server ap-
plications, can be ported to clusters when a cost-effective parallel platform
is sought for.
1.4 Global Object Space
In a cluster-based JVM, as Java threads are distributed around the cluster,
the shared memory nature of Java threads calls for a global object space
4
(GOS) that “virtualizes” a single Java object heap spanning the cluster to
facilitate transparent distributed object sharing.
In GOS, object replication is encouraged to improve the data locality,
which raises the consistency issue. The memory consistency issue is solved ac-
cording to the Java memory model (Chapter 8 of the JVM specification [60]).
Particularly, memory consistency operations are triggered by thread synchro-
nization. GOS is responsible to enforce the Java memory model, as well as
handle threads’ distribution and location-transparent synchronization. In
addition, in order to completely comply with the JVM specification, GOS
needs to perform distributed garbage collection for automatic memory man-
agement.
GOS is indeed a DSM service with functionality extensions in an object-
oriented Java system. The performance of the cluster-based JVM hinges on
the GOS’s ability to minimize the communication and coordination overheads
in maintaining the single object heap illusion. It is challenging to design and
implement a GOS that is both complete in terms of functionality and efficient
in terms of performance.
1.5 Our Approach
We design a cluster-based JVM. Different from previous approaches [82, 61]
that leverage a page-based DSM as an underlying infrastructure to build
the GOS, we build a GOS embedded in the cluster-based JVM [43]. In this
architecture, GOS is able to exploit abundant runtime information in JVM,
particularly the object type information, to improve the performance.
We leverage the runtime object connectivity information to detect distributed-
shared objects (DSOs). DSOs are the objects that are reachable from at least
two threads located at different cluster nodes in a cluster-based JVM. The
identification of DSOs allows us to handle the memory consistency problem
more precisely and efficiently. For example, in Java, synchronization primi-
5
tives are not only used to protect critical sections but also to maintain the
memory consistency. Clearly, only synchronizations of DSOs may involve
multiple threads on different nodes. Thus, the identification of DSOs can
reduce the frequency of consistency-related memory operations. Moreover,
since only DSOs that are replicated on multiple nodes would be involved in
the consistency maintenance, the detection of DSOs therefore leads to a more
efficient implementation of the cache coherence protocol. The identification
of DSOs also facilitates distributed garbage collection.
The choice of a good cache coherence protocol is often application-dependent.
That is, the particular memory access patterns in an application speak for
the more suitable protocol. That motivates us to go after an adaptive proto-
col. An adaptive cache coherence protocol is able to detect the current access
pattern and adjusts itself accordingly. We believe that adaptive protocols are
superior to non-adaptive ones due to their adaptability to object access pat-
terns in applications. In our design, we use an object-based adaptive cache
coherence protocol to implement the Java memory model.
The challenges of designing an effective and efficient adaptive cache co-
herence protocol are: (1) whether we can determine those important access
patterns that occur frequently or those that contribute a significant amount
of overhead to the GOS, and (2) whether the runtime system can efficiently
and correctly identify such target access patterns and apply the correspond-
ing adaptations in a timely fashion.
To further understand the first challenge and to overcome it, we propose
the access pattern space [44] as a framework to characterize object access
behavior. This space has three dimensions: number of writers, synchro-
nization, and repetition. We identify some basic access patterns along each
dimension: multiple-writer, single-writer, and read-only for the number-of-
writers dimension; mutual exclusion and condition for the synchronization
dimension; and patterns with different numbers of consecutive repetitions for
repetition dimension. Some combination of different basic patterns along the
6
three dimensions then portrays an actual runtime memory access pattern.
This 3-D access pattern space serves as a foundation on which we can iden-
tify those important object access patterns in the distributed JVM. We can
then choose the right adaptations to match with these access patterns and
improve the overall performance of the GOS.
To meet the second challenge, we take advantage of the fact that the GOS
is implemented embedded in the cluster-based JVM. Our adaptive protocol
can leverage all runtime object types and access information to efficiently
and accurately identify the access patterns worthy of special focus.
We apply three different protocol adaptations to the basic home-based
multiple-writer cache coherence protocol in three respective situations in the
access pattern space: (1) adaptive object home migration which optimizes
the single-writer access pattern by moving the object’s home to the writing
node according to the access history; (2) synchronized method migration
which chooses between default object (data) movement and optional method
(control flow) movement in order to optimize the execution of critical section
methods according to some prior knowledge; (3) connectivity-based object
pushing which scales the transfer unit to optimize the producer-consumer
access pattern according to the object connectivity information.
1.6 Contributions of the Thesis
We summarize the contributions of this thesis as follows:
1. We design a global object space embedded in a cluster-based JVM
that exploits Java’s runtime information to improve the performance.
In particular, distributed-shared objects are identified at run time to
reduce the overhead of memory consistency maintenance and to facili-
tate the distributed garbage collection.
2. We propose an object access pattern space as a framework to charac-
7
terize the object access behavior.
3. We propose a novel object home migration protocol that optimizes the
single-writer access pattern. The protocol demonstrates both the sensi-
tivity to the lasting single-writer pattern and the robustness against the
transient single-writer pattern. In the latter case, the protocol inhibits
home migration in order to reduce the home notification overhead.
4. We propose other optimizations in our GOS, including synchronized
method migration that allows the execution of a synchronized method
to take place remotely at the home node of its locked object, and
connectivity-based object pushing that uses object connectivity infor-
mation to optimize the producer-consumer access pattern.
5. We design and implement a visualization tool called PAT (Pattern
Analysis Tool) that can be used to visualize object access traces and
analyze object access patterns in our GOS.
6. We have prototyped a cluster-based JVM with our GOS design and all
optimizations incorporated. Extensive experiments demonstrate the
performance of our GOS and the effectiveness of the optimizations.
1.7 Thesis Organization
Chapter 2 introduces the background of this research. Chapter 3 elaborates
the memory access patterns in DSM and GOS. Chapter 4 presents the con-
cept of distributed-shared object and how we leverage it to improve GOS’
performance. Chapter 5 elaborates the adaptations we have adopted. Chap-
ter 6 presents our pattern analysis tool used to visualize object access pat-
terns. Chapter 7 discusses some implementation details in our cluster-based
JVM. Chapter 8 reports the experiments we conduct to measure the perfor-
mance of the prototype based on our design. Chapter 9 discusses the related
8
work and compares them with this work. Chapter 10 gives the conclusion
and presents a possible agenda for future work.
9
Chapter 2
Background
To support the truly parallel execution of Java threads on a cluster, we need a
global object space for transparent distributed object accesses. The concept
of global object space is rooted in software distributed shared memory, which
is a well-established research area in cluster computing. In this chapter, the
concepts of distributed shared memory will be introduced. We will also
discuss Java’s special constraints on global object space, i.e., Java Memory
Model.
2.1 Software Distributed Shared Memory
Software distributed shared memory (DSM1) [1] promises a higher programma-
bility compared with message passing paradigm, by providing a globally
shared memory abstraction across physically distributed memory machines.
To improve the performance, the replication of shared data is allowed. The
data consistency issue is solved by well defined memory consistency mod-
els [21].
1In this thesis, DSM denotes software distributed shared memory.
10
2.1.1 Memory Consistency Model
The memory consistency model of a DSM system provides a formal speci-
fication of how the memory system will appear to the programmer [21]. It
defines the restrictions on the legal values that a read can return among the
writes performed by other processors.
From the viewpoint of programmers, sequential consistency [58] is the
most intuitive model, which requires the memory accesses within each indi-
vidual process follow program order and writes be made atomically visible to
all the processes. Though intuitive, sequential consistency suffers from poor
performance. Sequential consistency not only prohibits some common com-
piler optimizations, such as reordering memory accesses to different memory
locations, but also results in excessive data communication on the distributed
shared memory platform [59].
In order to improve the efficiency of DSM, it has been considered to relax
the memory order constraints imposed by sequential consistency. Lazy re-
lease consistency (LRC) [56] is one of the state-of-the-art relaxed consistency
models widely used in software DSM systems. LRC distinguishes synchro-
nization variables from normal shared variables. LRC defines two operations
on synchronization variables, namely acquire and release. Acquire operations
are used to tell the memory system that a critical region is about to be en-
tered. Release operations are used to tell that a critical region is about to
be exited. In LRC, when a process P1 acquires a synchronization variable
that was most recently released by another process P2, all the writes that
are visible to P2 at the time of releasing the synchronization variable become
visible to P1.
LRC allows common compiler optimizations. LRC also allows the write
propagations to be postponed and batched until the synchronization points.
Moreover, correctly synchronized LRC programs that are data-race-free have
sequential consistent behavior [22]. Thus, it is intuitive for programmers to
reason the execution of a data-race-free LRC program.
11
2.1.2 Classification Based on the Coherence Granular-
ity
According to the coherence granularity in DSM, there are three kinds of
DSM systems: page-based DSMs, whose granularity is a virtual memory
page; object-based DSMs, whose granularity is a variable-sized structured
data unit defined by the application; and fine-grain DSMs, whose granularity
is a fixed-sized memory block that is much smaller than a virtual memory
page.
Page-based DSM
Page-based DSMs’ coherence granularity is the virtual memory page. The
page-based DSM leverages the memory management unit (MMU) to inter-
cept the faulting access on a shared page that is not locally available, because
it is either obsolete or not cached at all. Then the page-based DSM fetches
the valid copy from the other nodes according to the memory consistency and
resumes the faulting access. The advantage of page-based DSM is that by
using the MMU only faulting accesses are trapped by the page-based DSM,
all the non-faulting accesses go in full speed. However, the size of virtual
memory page is as large as 4K bytes, and thus raises the false sharing prob-
lem. The false sharing problem happens when two processes independently
access different parts of the same page. The page-based DSM’s effort for
the two processes to have the same view of the page is unnecessary for the
correctness of the program. The false sharing problem could be a serious
performance issue in page-based DSMs, particularly for those applications of
fine-grain sharing characteristics.
TreadMarks [57] is a page-based DSM which adopts a homeless cache
coherence protocol to implement lazy release consistency [56]. TreadMarks
uses twin and diff techniques to support multiple processes writing on the
same shared virtual memory page simultaneously due to false sharing. On a
12
write fault to a local cached page, a copy of that page, called twin, is created.
Later, the diff, which is the local update ever performed, can be figured out
by comparing the current page with the previously saved twin. The protocol
is considered to be homeless because the diffs are saved and managed at each
process.
Comparatively, HLRC [55] uses a home-based protocol to implement LRC.
In home-based protocol, each shared coherence unit has a home to which
all writes (diffs) are propagated and from which all copies are derived. It
has shown that the home-based protocol is more scalable than the homeless
protocol because the home-based protocol maintains a simpler state, sends
fewer messages, has a lower diff overhead, and consumes a much smaller
memory [55].
Object-based DSM
Having observed that the false sharing problem is rooted in the sharing gran-
ularity mismatch between the page-based DSM systems and the applications,
some researchers introduced the concept of object-based DSM. Object-based
DSMs’ coherence granularity is an object, which is a structured data unit
defined by the applications. Most existing object-based DSM systems are
language-based. They are either new parallel programming languages (e.g.,
Orca [28] and Jade [70]), or modifications of programming languages such
as C (e.g., Munin [33] and Midway [83]). In both cases, the compiler or the
preprocessor is leveraged to hook the source code with the routines in the
corresponding object-based DSM library.
Object-based DSMs reduce the false sharing problem due to the relatively
small probability that two processes independently access the different parts
of a shared object. However, they raise another performance issue. Since the
MMU can not be used to trap the faulting accesses on arbitrary-sized objects,
software checks must be inserted before the memory accesses to guarantee
the accessed objects are in the right access state. The software access checks
13
could introduce a large overhead in object-based DSMs.
Fine-grain DSM
The fine-grain DSM is a trade-off between the page-based DSM and the
object-based DSM. The fine-grain DSM provides a shared memory address
space as that the page-based DSM does. The copies of the same shared data
reside at the same virtual memory address on all nodes. Thus it eases the
memory management and the data transfer among nodes. To reduce the
false sharing problem, the coherence granularity of fine-grain DSM is much
smaller than that of the page-based DSM. For example, a fine-grain DSM,
Shasta [71], has a variable-sized coherence granularity, called block, which is a
multiple of line, occupying 64 bytes or 128 bytes’s memory. Software checks
are inserted before memory accesses to guarantee that the shared data are
in the right state, as in object-based DSMs. Shasta has demonstrated a set
of techniques to reduce the software checks. Jackal [78] also uses a fine-grain
DSM to build the GOS for a cluster-based JVM.
2.2 Java Memory Model
Java is a programming language incorporating multi-threading support. Java
threads interact with each other through a shared memory, i.e., the object
heap. It is necessary to define the rules describing which values may be
seen by a read of shared memory that is updated by multiple threads. The
Java memory model (JMM) (chapter 8 of JVM specification [60]) defines the
memory consistency semantics of multi-threaded Java programs.
There is a lock associated with each object in Java. The Java language
provides the synchronized keyword, used in either a synchronized method or
a synchronized statement, for synchronization among multiple threads. En-
tering or exiting a synchronized block corresponds to acquiring or releasing
the lock of the specified object. A synchronized method or a synchronized
14
statement is used not only to guarantee exclusive accesses in the critical sec-
tion, but also to maintain memory consistency of objects among all threads
that have performed synchronization operations on the same lock.
An abstract machine is defined in JMM to describe threads’ memory
behavior. All threads share a main memory, which contains the master
copies of all variables. A variable is an object field, an array element, or a
static field. Each thread has its own working memory, which is its private
cache for all the variables it uses. A use on a variable in the main memory
will make it cached in the thread’s working memory.
JMM defines: before a thread releases a lock, it must copy all assigned
values in its working memory back to the main memory; before a thread
acquires a lock, it must flush (invalidate) all variables in its working mem-
ory. In this way, the later uses will load the up-to-date values from the main
memory. In addition, with respect to a lock, the acquire and release op-
erations performed by all threads are sequentially consistent. The acquire
and release operations have their embodiments in the Java bytecode set, i.e.,
monitorenter and monitorexit.
JMM resembles LRC in that acquire/release operations are used to es-
tablish a partial order between the memory actions performed by multiple
threads. We follow the operations defined in the JVM specification to imple-
ment JMM.
Revising JMM
Some researchers argue that the current JMM is not well designed because
it prohibits some common compiler optimizations, causes some counterintu-
itive behavior, and even makes some well known design patterns unsafe [68].
Currently, the JMM is under active revision through JCP’s procedures [8].
JCP (Java Community Process) is the standard procedure to evolve Java
technology through the community effort and under the supervision of Sun
Microsystems at the same time. Hopefully, a new JMM will be introduced
15
in the Tiger (1.5) release of Java, to replace the original one. The latest in-
formation of the proposed JMM can be found at Pugh’s website [15], which
is still in constantly revising.
The detailed comparison between the current JMM and the proposed one
is beyond the scope of this thesis. Here we simply list some major changes
made in the proposed JMM:
The semantics of volatile variables have been strengthened to have acquire
and release semantics. A read to a volatile field has the acquire semantics
and a write to a volatile field has the release semantics.
The semantics of final fields have been strengthened to allow for thread-
safe immutability. A read on a final field will always return the correctly
initialized value as long as the object reference is not exposed during the
object construction.
In addition, the proposed JMM states that the useless synchronization
has no memory semantics. A synchronization action is useless in a number of
situations, including acquiring/releasing a lock of thread-local objects, and
re-acquiring an already acquired lock. This statement is very reasonable.
Based on our understanding of the current and proposed JMM, we believe
that although our cluster-based JVM mainly follows the current JMM, it can
be quickly adapted to the proposed JMM if it was officially approved.
16
Chapter 3
Memory Access Pattern
Our cluster-based JVM is highlighted for its adaptability to object access
patterns. In this chapter, we firstly survey various memory access pattern
optimizations in the area of DSM. Then we propose an access pattern space
as a framework to characterize object access behavior, which is used as a
foundation to design the effective adaptations.
3.1 Memory Access Pattern Optimization in
DSM
Although the DSM paradigm promises higher programmability than mes-
sage passing paradigm, it may involve more communication than necessary.
For example, the update or invalidation to a cached copy is unnecessary if
the copy will not be used any more. To make DSM applications’ perfor-
mance be comparable with their message passing counterparts, researchers
are investigating various ways to reduce the communication in DSMs.
In DSM systems, many cache coherence protocols have been proposed
to implement various memory consistency models. The home-based proto-
col [55] assigns a home node to each shared data object from which all copies
17
are derived. It is widely believed that home-based protocol is more scal-
able than the homeless protocol [57], for the reason that the former has less
memory consumption and can eliminate diff accumulation. The home in a
home-based protocol can be either fixed [55] or mobile [35]. There are also
variations for the coherence operations, such as a multiple-writer protocol, or
a single-writer protocol. The single-writer protocol allows only one process to
write on a shared data unit at the same time. In order to become a writer, a
process needs to acquire a writing permission from the previous writer. The
multiple-writer protocol introduced in Munin [33] supports concurrent writes
on different copies of the same object by using the diff technique. It may
however incur heavy diff overhead compared with conventional single-writer
protocols. Another choice is between the update protocol (e.g., Orca [28])
and the invalidate protocol. The latter is used in many page-based DSM
systems such as TreadMarks [57] and JUMP [35]. The update protocol can
prefetch the data before the access, but it may send much more unneeded
data when compared with the invalidate protocol.
A promising approach to further improve the performance of DSM sys-
tems is to design adaptive cache coherence protocols that are able to de-
tect and optimize memory access patterns. The rationale behind is that
the particular memory access patterns in an application speak for the more
suitable protocol. It means the choice of a good coherence protocol is of-
ten application-dependent. That motivates people to go after an adaptive
protocol.
In this section, we discuss three approaches used in the memory access
pattern optimization, namely programmer annotation, compiler analysis, and
runtime adaptation.
3.1.1 Programmer Annotation
The programmer annotation approach requires the programmers explicitly
annotate the shared data objects with pattern declarations. Strictly speak-
18
ing, this approach does not use an adaptive cache coherence protocol. Nev-
ertheless, it manages to optimize some memory access patterns.
Munin [33] follows the programmer annotation approach. Munin allows
programmers to explicitly annotate the object with pattern declarations,
which include conventional, read-only, migratory, and write-shared. Each
pattern has its own protocol that will be used by Munin at runtime. Munin
applies a multiple-writer protocol to the write-shared pattern, and a single-
writer protocol to the conventional pattern. For the migratory pattern, the
objects are migrated from machine to machine as critical regions are entered
and exited. The read-only data are replicated on demand without further
consistency maintenance, but a runtime error will be generated if some pro-
cess tries to write read-only data.
SAM [72] is an object-based DSM runtime system that supports to opti-
mize some object access patterns, such as the producer-consumer pattern and
the accumulator pattern. An accumulator is used to represent a piece of data
that must be updated in a critical section. SAM provides synchronization
primitives to let user explicitly tie the patterns to the object accesses. SAM
automatically migrates the accumulator data, and prefetches the producer-
consumer data before they are consumed.
3.1.2 Compiler Analysis
The programmer annotation approach allows programmers choose a most
suitable cache coherence protocol among a set of candidates for an object
presenting a particular access pattern. Although this approach helps to im-
prove the performance, it is inconvenient for programmers and error-prone.
The compiler analysis approach tries to overcome the shortcoming of the
programmer annotation approach by leveraging compiler analysis techniques
to automatically extract the access pattern information from the programs.
Orca [28] is a language-based DSM system. At runtime, a shard object
can be either replicated on all processors, or not replicated at all. For the
19
replicated objects, the broadcast is used to deliver the update to all replica-
tions. For the non-replicated objects, the remote procedure call is used to
access the objects. The actual replication policy for each object is determined
by both the compiler and the runtime system. Orca’s compiler estimates the
expected read to write ratio of each shared object in the program. For ex-
ample, an object with a large read/write ratio on a cheap broadcast network
will be replicated on all processors. Orca’s runtime system can also collect
the actual read/write information to amend the compiler derived decisions.
Although the compiler analysis approach is able to automatically extract
access pattern information from the programs, it has several shortcomings
inherited from compiler analysis techniques. Firstly, since the input of the
compiler is the source code of the program, the compiler analyzes the pro-
grams based on allocation sites. An allocation site is the location in the
program source code where object instances are created at runtime. Though
the compiler analysis works well for the situation where all object instances
created from the same allocation site present the same access pattern, it
may be difficult to distinguish among the object instances of different access
patterns from the same allocation site.
Secondly, the compiler analysis approach may be difficult to detect the
access pattern changes. Even it is able to notice the possible changes, it may
be difficult to predict the actual change time.
Thirdly, the compiler analysis can not precisely predict the access patterns
without the knowledge of the actual thread-to-node mapping in the situation
of multi-threading. For example, assuming two threads concurrently write on
a shared large object, which can be detected by the compiler, if they reside
on different nodes at runtime, a multiple-writer protocol is suitable for the
shared object. The twin and diff techniques are used to support concurrent
multiple writers. However, if these two threads are on the same node at
runtime, all the twin and diff overheads are simply wasted.
20
3.1.3 Runtime Adaptation
To overcome the shortcomings of the compiler analysis approach, people are
investigating the runtime approach to optimize the memory access patterns,
called the runtime adaptation approach. It leverages the adaptive cache co-
herence protocol to detect and adapt to some particular access patterns. It
is transparent to the programmers. Since all the runtime access information
is accessible, the precise and prompt access pattern optimizations are pos-
sible. Usually the runtime adaptation approach speculatively detects access
patterns based on some heuristics. The false speculation can be corrected at
runtime.
Currently, most works in the runtime adaptation approach are done on
page-based DSMs. In the context of page-based DSMs, accesses to different
objects residing at the same page are mingled at the page level. So it is
difficult for them to detect access patterns in applications with fine-grain
sharing.
Some homeless page-based DSM systems use adaptive cache coherence
protocols to optimize memory access patterns. Adaptive TreadMarks [23]
can adapt between the single-write protocol and the multiple-writer protocol.
The single-writer protocol does not use twin and diff technique. Instead,
one process must get the ownership of a shared page before writing on it.
Adaptive TreadMarks switches to the single-writer protocol when it observes
that the overhead of requesting and applying diffs is larger than that of
requesting the whole page. It can also perform dynamical page aggregation,
which groups several pages together as a coherence unit. When a page of the
group is faulted in, the whole group of pages will be faulted in, too.
ADSM [64] can also adapt between the single-write protocol and the
multiple-writer protocol based on the approximate association between locks
and the data they protect. Initially, all pages are in the initial state, valid at
and owned by process 0. Any access fault will place the page in migratory
state, until a write fault by other process happens. Then the page is placed
21
in multiple-writer state. The single-writer protocol is used for pages in MIG
state, and the multiple-writer protocol is used for pages in multiple-writer
state. From time to time, pages in multiple-writer state can be reset to initial
to allow continuous adaptation.
The asymmetry between the home copy and non-home copies in home-
based protocols raises the home assignment problem. In home-based proto-
cols, the home copy is always valid. The accesses on home node never incur
communication overhead, while the accesses on non-home nodes will trigger
the communication with the home node. Therefore, which node acts as the
home will change the coherence data communication pattern, and thus influ-
ence the application performance. In fact, the optimal home assignment is
determined by the memory access pattern of the application. This inspires
some dynamic home assignment protocols able to adapt to runtime memory
access patterns.
In JiaJia [51], which is a page-based DSM system, those pages that are
written by only one process between two barriers are recognized by the bar-
rier manager and their homes are migrated to the single writing process.
New home notifications are piggybacked on barrier messages. JiaJia’s home
migration protocol only optimizes the single-writer pattern. Since JiaJia’s
approach relies on the barrier synchronization, it will not work if the appli-
cation does not use barriers or the DSM infrastructure does not expose the
barrier function. For example, in our case, the Java programmers have to im-
plement the barrier by using more primitive synchronization operations such
as lock/unlock/wait. Furthermore, since all the single-writer detection work
is done centrally at the barrier manager, it may cause considerable overhead
when there are a fair number of processes as well as shared pages.
JUMP [35] adopts a migrating-home protocol in that the process requiring
the page becomes the new home. The new home notification is broadcast to
other processes at synchronization points. Although this approach results in
less diffing operations because the writes probably happen at the home node,
22
No synchronization
Condition (Assignment)
Mutual exclusion (Accumulator)
Read only
Single writer
Multiple writers
Adaptationpoint
1
Number of writers
Synchronization
Repetition
Figure 3.1: The object access pattern space
the home migration decision ignores the inherent memory access patterns of
the application. If the accesses by the process at the new home do not persist,
home migration will not improve the performance; instead, it could suffer
from heavy home notification overhead. The worst case happens when the
shared page is written by processes sequentially, which produces numerous
home notification messages.
23
3.2 Access Pattern Space
According to JMM, an object’s access behavior can be described as a set
of reads and writes performed on the object, with interleaving synchroniza-
tion actions such as locks and unlocks. Locks and unlocks on the same
object are executed sequentially. Three orthogonal dimensions capturing the
characteristics of object access behavior can be defined: number of writers,
synchronization, and repetition. They form a 3-dimensional access pattern
space, as shown in Fig. 3.1.
Number of writers
Among all the accesses from different threads, a happen-before-1 [22] partial
ordering, denoted byhb1−→, can be established:
• If a1 and a2 are two memory actions by the same thread, and a1 occurs
before a2 in program order, then a1hb1−→ a2.
• If a1 is an unlock by thread t1, and a2 is the following lock on the same
object by thread t2, then a1hb1−→ a2.
• If a1 → a2 and a2 → a3, then a1hb1−→ a3.
A write w1 is a concurrent write if there exists another write w2 so that
• w1 and w2 are issued by different threads; and
• w1 and w2 are on the same object; and
• both w1hb1−→ w2 and w2
hb1−→ w1 do not hold.
We can also say w1 is concurrent with respect to w2, denoted by w1‖w2.
A write w1 is a sequential write if there does not exist another write w2
so that w1‖w2.
24
On the dimension of number of writers, we distinguish three cases:
• Multiple writers : the object is written by multiple threads. Specifically,
if w is a concurrent write on this object, this object presents multiple-
writer pattern when w happens. Multiple-writer pattern is not the
data race situation. Accesses of data race happen on the same variable,
while accesses of multiple-writer pattern happen on the same object.
Multiple-writer pattern implies the false sharing situation.
• Single writer : the object is written by a single thread. Specifically, if
w is a sequential write on this object, this object presents single-writer
pattern when w happens. Exclusive access is a special case where the
object is accessed (written and read) by only one thread.
• Read only : no thread writes to the object.
Synchronization
This characterizes the execution order of accesses by different threads. When
the object is accessed by multiple threads and at least one thread is a writer,
the threads should be well synchronized to avoid data race. There are three
cases:
• Accumulator : the object accesses are mutually exclusive. The object is
updated by multiple threads concurrently, and therefore all the updat-
ing should happen in a critical section. That is, the read/write should
be preceded by a lock and followed by an unlock. Java provides syn-
chronized block and synchronized method to implement accumulator
pattern.
• Assignment : the object accesses obey the precedence constraint. The
object is used to safely transfer a value from one thread to another
thread. The source thread writes to the object first, followed by the
destination thread reading it. Synchronization actions should be used
25
to enforce that the write happens before the read according to the
memory model. Java provides the wait and notify methods in the
Object class to help implement the assignment pattern.
• No synchronization: synchronization is unnecessary.
Repetition
This indicates the number of consecutive repetitions of an access pattern. It
is desirable that an access pattern will repeat for a number of times so that
the GOS will be able to detect the pattern based on the history informa-
tion and then apply the optimization on the re-occurrence of the pattern.
Such a pattern will appear on the right side of the adaptation point along
the repetition axis. The adaptation point is an internal threshold parameter
in the GOS. When the pattern repeats for more times than what the adap-
tation point indicates, the corresponding adaptation will be automatically
performed.
On the other hand, some important patterns appear on the left of the
adaptation point, such as the producer-consumer pattern. Produce-consumer
pattern is also called the single assignment. The write must happen before
the read. However, in the producer-consumer pattern, after the object is
created, it is written and read only once, and then turned into garbage.
26
Chapter 4
Distributed-Shared Object
This chapter presents how the memory consistency issue is efficiently solved
by leveraging the concept of distributed-shared objects. We define distributed-
shared object and discuss the benefits it brings to our GOS. We then present
a lightweight mechanism for the detection of DSOs and the basic cache co-
herence protocol used in the GOS.
4.1 Definitions
In the JVM, connectivity exists between two Java objects if one object con-
tains a reference to another. Therefore, we can conceive the whole picture
of an object heap to be a connectivity graph, where vertices represent ob-
jects and edges represent references. Reachability describes the transitive
referential relationship between a Java thread and an object based on the
connectivity graph. An object is reachable from a thread if its reference re-
sides in the thread’s stack, or if there is some path existing in the connectivity
graph between this object and some known reachable objects.
By the escape analysis technique [38], if an object is reachable from only
one thread, it is called thread-local object. The opposite is a thread-escaping
object, which is reachable from multiple threads. Thread-local objects can be
27
separated from thread-escaping objects at compile time using escape analysis.
The benefits from the escape analysis are: the synchronization operations on
thread-local objects can be safely removed, and the thread-local objects can
be allocated on the threads’ stack instead of the heap to reduce the heap
overhead.
In a distributed JVM, Java threads are distributed to different nodes, so
we need to extend the concepts of thread-local object and thread-escaping
object. We define the following.
• A node-local object (NLO) is an object reachable from thread(s) in
the same node. It is either a thread-local object or a thread-escaping
object.
• A distributed-shared object (DSO) is an object reachable from at least
two threads located at different nodes.
4.2 Benefits from DSO Detection
We introduce the concept of DSO to address both the memory consistency
issue and the memory management issue in GOS. We argue that the identi-
fication of DSOs benefits both the memory consistency maintenance and the
memory management, i.e., distributed garbage collection.
4.2.1 Benefits on Memory Consistency Maintenance
The detection of DSOs can help reduce the memory consistency maintenance
overhead. According to the JVM specification, there are two memory consis-
tency problems in a distributed JVM. The first one, local consistency, exists
among working memories of threads and the main memory inside one node.
The second one, distributed consistency, exists among multiple main mem-
ories of different nodes. The issue of local consistency should be addressed
by any JVM implementation, whereas the issue of distributed consistency
28
is only present in the distributed JVM. The cost to maintain distributed
consistency is much more than that of its local counterpart due to the com-
munication incurred. As we have mentioned before, synchronization in Java
is used not only to protect critical sections but also to enforce memory con-
sistency. However, synchronization actions on NLOs do not need to trigger
distributed consistency maintenance, because all threads that are able to
acquire or release the lock of an NLO must reside in the same node, and
therefore would not experience distributed inconsistency throughout.
Only DSOs are involved in distributed consistency maintenance since they
have multiple copies in different nodes. With the detection of DSOs, only
DSOs need to be visited to make sure that they are in a consistent state
during distributed consistency maintenance.
4.2.2 Benefits on Memory Management
According to the JVM specification, one vital responsibility of the GOS is
to perform automatic memory management in the distributed environment,
i.e., distributed garbage collection (DGC) [67]. The detection of DSOs also
helps improve the memory management in the GOS.
Since we detect DSOs at runtime, we are able to do pointer transla-
tion across node boundaries, i.e., between local object addresses and objects’
global unique identifications (GUID), so as to relocate objects at different
memory addresses on different nodes.
In this way, the heap management of each node is totally decoupled.
Each node performs independent memory management. The local garbage
collectors on each node can perform asynchronous collection of garbage ob-
jects independently. The global garbage collections can thus be postponed
or reduced.
Moreover, all the nodes are coordinated to present a huge virtual heap.
We can calculate the aggregated heap size of our distributed JVM with the
29
following formula:
H = (1− d)hn + dh (4.1)
H — The aggregated heap size;
h — The heap size on each node;
n — The number of nodes;
d — The ratio of the local heap space that DSOs occupy to the total local
heap size. We presume DSOs will be replicated on all nodes.
Obviously, when the ratio of DSOs, i.e., d, is small, H ≈ hn.
4.3 Lightweight DSO Detection and Recla-
mation
In the distributed JVM, whether an object is a DSO or an NLO is determined
by the relative locations of the object and the threads reaching it. Compile-
time solutions, such as the escape analysis, are not useful as the location
of objects and threads can only be determined at runtime. We propose a
runtime lightweight DSO detection scheme which leverages Java’s runtime
type information to unambiguously identify pointers, i.e. object references
in Java context.
Java is a strongly typed language. Each variable, either an object field
that is in the heap, or a thread-local variable in some Java thread’s stack, is
associated with a type. The type is either a reference type or a primitive type
such as integer, char, or float. The type information is known at compile time
and written into class files generated by the compiler. At runtime, the class
subsystem builds up the type information from the class files. By looking
up the runtime type information, we can identify those variables that are
of the reference type. Therefore, object connectivity can be determined at
runtime. The object connectivity graph is dynamic since the connectivity
between objects may change from time to time through the reassignment of
30
objects fields.
DSO detection is performed when there are some JVM runtime data to
be transmitted across node boundaries, which could be thread stack contexts
for thread relocation, object contents for remote object access, or diff data for
update propagation. On both the sending and the receiving side, these data
are examined for identification of object references. A transmitted object
reference indicates the object is a DSO since it is reachable from threads
located at different nodes.
On the sending side, if the corresponding object of an identified object
reference has not been marked as a DSO, it is marked at this moment. In
doing this, a global unique identification (GUID) will be assigned to it, which
is its global name in the cluster-based JVM. Before sent, all the object ref-
erences should be replaced by their GUIDs. Since the copies of DSOs reside
at different memory addresses on different nodes, local object references, i.e.,
memory addresses, do not make sense on other nodes.
In sending an object, usually the type information of all its fields can be
determined according to its class data structure. However, some additional
type information should be sent along in some situations in order to clearly
describe the type information of a field: (a) If the field is of an array, the
array’s size should be sent. The array’s size helps to shape an array. And it
is a special field of the array object in Java. (b) If the field is of a subclass of
the class type defined in the class, the subclass’s type should be sent. Java
allows a conversion from any class type S to any class type T , provided that
S is a subclass of T . And the subclass type can’t be determined from the type
information loaded from class file. (c) If the field is of an implementation of
the interface type defined in the class, the field’s actual type should be sent.
On the receiving side, all the GUIDs should be replaced by their corre-
sponding local object references. The receiver knows where a GUID should
appear according to the type information. When a GUID first emerges, an
empty object of corresponding type will be created to be associated with
31
it, so that the reference will not become a dangling pointer. The object’s
access state will be set to be invalid. When it is accessed later, its up-to-date
content will be faulted in.
In this scheme, only those objects whose references appear in multiple
nodes will be identified as DSOs.
We detect DSO in a lazy fashion. Since at anytime it is unknown whether
an object will be accessed by its reaching thread in the future or not, we
choose to postpone the detection to as close to the actual access as possible,
thus making the detection scheme lightweight.
To correctly reflect the sharing status of objects in the GOS, we rely on
distributed garbage collection to convert a DSO back to an NLO. If all the
cached copies of a DSO have become garbage, the DSO can be converted
back to an NLO. The distributed garbage collection will be discussed in
section 7.5.
An Example
Examining the case in Figure 4.1, a thread T1 prepares an object tree then
passes the reference of object c to another thread T2 as shown in the reach-
ability graph (Figure 4.1.a).
When T2 is distributed to another cluster node, i.e. node 1, all the objects
reachable from object c become DSOs. Object a, b, and d are not DSOs since
they are thread-local to T1. Instead of detecting all these objects as DSOs at
one blow, we detect object c as a DSO and send object c to node 1. Because
object e and f are directly connected with object a, we also detect object
e and f as DSOs but do not send them to node 1 (Figure 4.1.b). On node
1, we create two objects whose types are exactly the same as the types of
object e and f . Since the contents of object e and f are not available, we set
their access states to invalid.
Next time when object f is accessed by T2 on node 1 (Figure 4.1.c), an
object fault will occur. An object request message will be sent to node 0.
32
T2
Node 0 Node 1
T1 T2
Cluster network
Node 0 Node 1
Java thread stack frame
Java object
Connectivity between objects
Object reference in thread stack
Detected DSO
Invalid DSO
c b
a
f e d
h g
T1
c b
a
f d
T2
c
f
T1
c b
a
f d
c
f
i
e
h g i
e
e
h g i
e
i
(a) Reachability graph
(b) After thread T2 is distributed to Node 1
(c) Access on f by T2 triggers detection of i
Figure 4.1: The detection of distributed-shared object
33
This event will trigger the detection of object i as a DSO. The up-to-date
content of object f is copied from node 0 to node 1. The details of how to
maintain the coherence of objects located among multiple nodes are discussed
in next section. If object e is not accessed by T2, object e is always invalid
on Node 1 and object g and h will never be detected as DSOs.
4.4 Basic Cache Coherence Protocol
Our basic cache coherence protocol is a home-based, multiple-writer cache
coherence protocol. Figure 4.2 shows a state transition graph depicting the
lifecycle of an object from its creation to possible collection based on the
proposed DSO concept. At the right of the figure, the state transition graph
of the cache coherence protocol for DSOs at non-home nodes is shown. The
read/write arrows represent those happening on this object. The lock/unlock
arrows represent those happening on any object because lock/unlock actions
on other objects will also influence this object’s state according to JMM.
The lower part of the figure illustrates the interaction between the garbage
collection and the object’s states, which will be discussed in section 7.5.
When a DSO is detected, the node where the object is first created is
made its home node. The home copy of a DSO is always valid. A non-home
copy of a DSO can be in one of three possible access states: invalid, read
(read-only), or write (writable). Accesses to invalid copies of DSOs will fault
in the contents from their home node. Upon releasing a lock of a DSO,
all updated values to non-home copies of DSOs should be written to their
corresponding home nodes. Upon acquiring a lock, a flush action is required
to invalidate the non-home copies of DSOs, which guarantees that the most
up-to-date contents will be faulted in from the home nodes when they are
accessed later. Before the flush, all updated values to non-home copies of
DSOs should be written to the corresponding home nodes. In this way, a
thread is able to see the up-to-date contents of the DSOs after it acquires
34
����������
�������� ����� ��
����������
��� �����������
�����
���
������
����������� !
� � ����"�����
���
� �� ����
#�$
W
�
�����
��
������ ��
��� ��
� ����� ��
�������
����� ��
� ��
%&'&%&()&
�����
�**
+,-.,+/01234/
#�$5��$��� �6� ��� "� *78��5�� ���� �6� ��� "� *78�95�������� �6� ��� "� *78�
��� �����������
Figure 4.2: The state transition graph depicting object lifecycle in the GOS
35
the proper lock.
A multiple-writer protocol permits concurrent writing to the copies of a
DSO, which is implemented using the twin and diff techniques [57]. On the
first write to a non-home copy of the DSO, a twin will be created, which is
an exact copy of the object. On lock acquiring and releasing, the diff, i.e.,
the modified portion of the object, is created by comparing the twin with
the current object content word by word, and sent to the home node. On
releasing a lock, after the diffs are sent out, the access states of the updated
objects should be changed from write to read to capture the future writes.
Since a lock can be considered as a special field of an object, all the
operations on a lock, including acquire, release, as well as wait and notify
that are the methods of the Object class, are executed in the object’s home
node. Thus, the object’s home node acts as the object’s lock manager. The
detailed design and implementation of distributed synchronization will be
discussed in section 7.2.
With the availability of object type information, it is possible to invoke
different coherence protocols according to the type of the objects, as shown in
table 4.1. For example, immutable objects, such as instances of class String,
Integer, and Float, can be simply replicated and treated as an NLO. Some
objects are considered node-dependent resources, such as instances of class
File. When node-dependent objects are detected as DSOs, object replication
should be prohibited. Instead, accesses to them should be transparently
redirected to their home nodes. This is an important issue in the provision
of a complete single system image to Java applications.
36
Type Characteristics Protocoljava.lang.Thread Represents Java thread. On creation, choose a
running node for load balance.java.lang.String, Immutable object. Simply replicated and
java.lang.Integer, etc. treated as an NLO.java.io.File, etc. Represents node- No replication. Accesses will
dependant resources. be transparently redirectedto their home node.
Primitive array, such contains no object references. DSO detection is disabled.as float[ ], int[ ], etc.
Table 4.1: Coherence protocols according to object type
37
Chapter 5
Adaptive Cache Coherence
Protocol
Scientific applications usually exhibit diverse memory access patterns. The
performance of various cache coherence protocols is application-dependent:
the application’s inherent memory access patterns speak for the most suitable
protocol. This inspires us to go after an adaptive cache coherence protocol
to further improve the performance of our GOS. An adaptive cache coher-
ence protocol is able to detect the current access pattern and adjusts itself
accordingly.
Based on the access pattern space, we present several adaptations incor-
porated into our basic home-based multiple-writer cache coherence protocol
in three respective situations in the access pattern space: (1) object home
migration [45] which optimizes the single-writer access pattern by moving
the object’s home to the writing node according to the access history; (2)
synchronized method migration which chooses between default object (data)
movement and optional method (control flow) movement in order to optimize
the execution of critical section methods according to some prior knowledge;
(3) connectivity-based object pushing which scales the transfer unit to opti-
mize the producer-consumer access pattern according to the object connec-
38
tivity information.
5.1 Adaptive Object Home Migration
As a state-of-the-art DSM system, TreadMarks [57] adopts a multiple-writer
cache coherence protocol to implement lazy release consistency. TreadMarks
uses twin and diff techniques to support multiple processes writing on the
same shared virtual memory page simultaneously due to false sharing. The
protocol is considered to be homeless because the diffs are saved and managed
at each process.
Although TreadMarks’ homeless protocol can greatly alleviate the false
sharing problem, it may still suffer from heavy communication and protocol
overheads. In order to serve a page fault, the faulting process has to fetch the
diffs from each process that has updated the page before the fault according
to LRC, which causes multiple round-trip messages. Each diff needs to be
applied once at each process that fetches that diff, which amounts to a large
overhead. In addition, the diffs could consume a lot of memory, and cleaning
the useless diffs may trigger a global garbage collection.
In order to address the above problems, a home-based protocol to imple-
ment LRC, which is called HLRC, was proposed [55]. In the home-based
protocol, each shared coherence unit has a home to which all writes (diffs)
are propagated and from which all copies are derived. It has been shown that
the home-based protocol is more scalable than the homeless protocol because
the home-based protocol maintains a simpler state, sends fewer messages, has
a lower diff overhead, and consumes much less memory.
The asymmetry between the home copy and non-home copies in home-
based protocols raises the home assignment problem. In home-based proto-
cols, the home copy is always valid. Accesses at the home node never incur
communication overhead, while accesses at non-home nodes will trigger com-
munication with the home node. Therefore, which node to act as the home
39
will change the coherence data communication pattern, and influence the ap-
plication performance. In fact, the optimal home assignment is determined
by the memory access patterns of the application. This inspires some dy-
namic home assignment protocols that are able to adapt to runtime memory
access patterns [51, 35, 78, 44].
In DSM applications, the single-writer access pattern happens if the
shared coherence unit is only updated by one process for a certain period.
It does not prohibit the shared coherence unit from being read by multi-
ple processes at the same time. A few research projects [51, 23, 64] have
demonstrated that the single-writer pattern is common in DSM applications.
In our GOS, we propose a novel home migration protocol to optimize the
single-writer pattern. We only target the single-writer pattern because home
migration makes little difference in the multiple-writer situation so long as
the home node is one of the writers.
At runtime, an object can exhibit different access patterns during its
lifetime. For example, an object can be updated by multiple writers concur-
rently, and then by a single writer exclusively; or an object can be updated
by different writers sequentially, each persisting for sometime. Since home
migration has to have the effect that the other processes would be informed
of the new home, improper home migrations will degrade the performance
by introducing a host of messages for new home notification. Therefore, it is
a challenge to exploit the single-writer property as much as possible and at
the same time maintain an acceptable level of home migration overhead.
5.1.1 Home Migration Concepts
Figure 5.1 illustrates the home-based multiple-writer protocol that imple-
ments LRC. In the figure, X represents some shared coherence unit, which
could be either an object or a virtual memory page. Its home is at the proces-
sor where process P2 resides. Assuming the write on X performed by process
P1 causes a fault, because either the local cached copy is outdated according
40
Diff propagation
Fault-in
P1 P2 (Home of X)
Lock Write(X)
Make twin
Create diff
Unlock Apply
diff
Figure 5.1: Home-based Protocol for LRC with multiple-writer support
to LRC or X is not cached at all, P1 will then fault-in the valid copy from
X’s home, P2. Before P1 could write on the newly fetched copy, it needs to
create a twin, which is simply a copy of X. Later, when P1 releases the lock,
it eagerly creates the diff, which is the difference between the current X and
the previously saved twin, and sends the diff to the home. And the diff will
be applied to the home copy of X at the home.
If P1 is the only writer of X, we can migrate X’s home from P2 to P1, to
avoid the communication overhead including faulting in the shared data and
the diff propagation, the diff overhead including creating and applying the
diff, and the memory consumption caused by the twin and the diff.
On the other hand, if both P1 and P2 write on X, it does not matter which
node to become the home.
Home Location Notification Mechanism
We assume there is a way to determine the initial home for each unit. For
example, all units are initially assigned a home node by a well known hash
function. If the home of a shared coherence unit is subject to migration,
41
home miss could happen. Home miss is the situation that a process visits an
obsolete home. Therefore, we need some mechanism to inform other nodes
of the new home location. There are three mechanisms: broadcast, home
manager, and forwarding pointer.
Broadcast After home migration, the new home location is broadcast to all
the nodes.
Home Manager The most updated home location of a unit is always recorded
in a designated manager node, which is known to all nodes. On home
migration, the new home location is posted to the manager node. On
home miss, a process can visit the manager node to find out where the
current home is.
Forwarding Pointer On home migration, a forwarding pointer is left in
the former home to point to the new home. On home miss, a process
can always be redirected to the current home via the given forwarding
pointer.
With the broadcast and home manager mechanisms, it is possible that the
broadcast or update to the manager happens after some node tries to fault
in a copy from the home node. Then the former home is already obsolete,
but the new home is still not known. This situation needs to be handled
carefully, for example, by waiting for sometime before repeating the fault-in
again. Notice that this situation will not happen with the forwarding pointer
mechanism.
Of the three mechanisms, which is superior depends on the memory ac-
cess patterns of the applications and how frequent the home migration is.
If after a home migration, all the other nodes need to visit the new home,
then the broadcast mechanism is superior to the others because a well imple-
mented broadcast operation should be efficient for notifying all. Otherwise,
the broadcast may cause a large overhead. The merit of the forwarding
42
pointer mechanism is that it does not need to broadcast the new home loca-
tion on home migration. However, the redirection effect may cascade where
multiple home migrations may form a distributed chain of home forward-
ing pointers. Therefore, a process may be redirected multiple times before
coming upon the current home, which is called redirection accumulation. It
could cause significant overhead when home migration happens frequently.
The manager mechanism strikes a balance between the home notification
cost and the home miss cost. However, on a home miss, the process needs
to visit the old home, the manager, and the new home in sequence, which
is heavyweight compared with the broadcast mechanism and the forwarding
pointer mechanism in the absence of redirection accumulation.
5.1.2 Home Migration with Adaptive Threshold
In order to detect the single-writer access pattern, the GOS monitors all
home accesses as well as non-home accesses at the home node. With the
cache coherence protocol, the object request can be considered a remote read
and the diff received on synchronization points a remote write. To monitor
the home accesses, the access state of the home copy will be set to invalid
on acquiring a lock and to readonly on releasing a lock. Therefore, the home
access faults can be trapped and returned after the access is recorded. We
call the write fault at home node the home write, and the read fault at home
node the home read, respectively.
At the home node, we define an object’s consecutive remote writes to be
those issued from the same remote node and not interleaved with the writes
from either the home node or other remote nodes. Note that under the Java
memory model, the remote writes are only reflected to the home node on
synchronization points. Therefore the number of consecutive writes is the
number of synchronizations during which the object is only updated by that
node. At runtime, the GOS continuously monitors consecutive remote writes
for each object. We also introduce a predefined home migration threshold
43
that represents some prior knowledge on the single-writer pattern. We fol-
low a heuristic that an object is in the single-writer pattern if the number
of consecutive remote writes exceeds the home migration threshold. If the
single-writer pattern is detected, when the object is requested again by the
writing node, not only the object is replied, but also its home is migrated.
We adopt the forwarding pointer mechanism to notify others of the new home
location. When an obsolete home node is requested for an object, it simply
replies with the valid home node location.
However, this protocol is still not satisfactory. Above all, it is difficult to
decide the fixed home migration threshold. If it is too large, which implies a
lazy migration policy, the home migration will be less sensitive to the single-
writer pattern, thus causing unnecessary remote access overhead. If the
home could be migrated earlier, more remote accesses could be transformed
to local accesses. On the contrary, if the threshold is too small, it implies
an eager migration policy. Although sensitive to the single-writer pattern, it
will be less capable of avoiding unnecessary home migrations. If the single-
writer pattern is transient in that it repeats for a very limited times, then
the threads on the new home node may not perform any more accesses after
the home migration. Thus the home migration decision will not gain any
performance improvement, but suffer from the home redirection overhead.
We observe that the transient single-writer pattern is not worthy of home
migration. The home migration protocol should capitalize on the lasting
single-writer pattern.
The challenge here is to choose a threshold that yields both sensitivity and
robustness for the single-writer pattern. By robustness we mean taking no
migration action for the transient single-writer pattern, and by sensitivity the
approach responds actively to the lasting single-writer pattern. Furthermore,
it is anticipated that different objects may have different access behaviors.
It is more reasonable to use different thresholds on different objects.
Based on the above discussion, we propose a home migration protocol
44
with an adaptive threshold. The adaptive threshold is monotonously de-
creasing with increased likelihood that an object presents the lasting single-
writer pattern. A lower threshold will allow home migration to happen more
quickly. The adaptive threshold is continuously adjusted at runtime accord-
ing to the feedback of previous home migration decisions for each object.
Runtime Feedback
In order to measure the feedback of previous home migration decisions, the
GOS will observe exclusive home writes and redirected object requests at
runtime.
We define exclusive home write to be that there is no remote write be-
tween an exclusive home write and an earlier home write. Clearly, exclusive
home writes reflect the single-write pattern happening at the home node. So
it represents a positive feedback of previous home migration decisions.
A redirected object request reflects the home redirection effect due to
home migration. It represents a negative feedback of previous home migra-
tion decisions. Redirected object requests take the redirection accumulation
into account. For example, if an object request is redirected for three times
before reaching the current home node, the number of redirected object re-
quests will be considered to be three instead of one.
In addition, it is observed that exclusive home writes and redirected object
requests are associated with different costs. The home redirection overhead,
which is measured by redirected object requests, is equal to the round-trip
time for a unit-sized message. The benefits due to home migration are from
eliminated pairs of object fault-ins and diff propagations. They are measured
by exclusive home writes, and are related to the object size. Therefore, we
introduce the home access coefficient which is the overhead ratio of one elim-
inated pair of object fault-in and diff propagation to one home redirection.
Here we mainly consider the communication overhead.
45
Formalization
We formalize the idea of object home migration with adaptive threshold as
follows. For each object, we have:
• Ci : the number of consecutive remote writes since the (i− 1)th home
migration.
• Ti : the value of the adaptive home migration threshold since the (i−1)th home migration.
• Tinit : the initial threshold, which is set to 1.
• Ri : the number of redirected object requests since the (i − 1)th home
migration.
• Ei : the number of exclusive home writes since the (i − 1)th home
migration.
• α : the home access coefficient.
• m 12
: the half-peak length in bytes, which is the message length required
to achieve half of the asymptotic bandwidth [50].
Home migration decision is taken when the following condition is met:
Ci = Ti (5.1)
Adaptive home migration threshold, Ti, is calculated by
Ti = max{(Ti−1 + Ri − αEi), Tinit} (5.2)
where
T0 = Tinit = 1 (5.3)
46
and
α ≈ 2 + bsizeof(object)
m 12
c (5.4)
Equation (5.2) is the core of the above equations, which determines the
adaptive home migration threshold. Both the positive feedback (exclusive
home writes) and the negative feedback (redirected object requests) of previ-
ous home migrations will affect the current threshold. The positive feedback
tends to indicate that the object presents a lasting single-writer pattern, thus
decreases the threshold. Remember the threshold is monotonously decreas-
ing with increased likelihood of the lasting single-writer pattern. While the
negative feedback tends to indicate that the object presents the transient
single-writer pattern, thus increases the threshold. We also take the home
access coefficient into account. Whenever the home migration condition, i.e.,
Equation (5.1), is met, a home migration takes place. All these computations
are done by the GOS at the home node of the object.
The initial threshold is set to 1 in order to speed up the initial data
relocation if possible. It is possible that the initial data layout is not opti-
mal with respect to the data access behavior, particularly when the writing
nodes of single-writer objects are not their home nodes. A small initial home
migration threshold could alleviate this situation. We rely on the adaptive
threshold mechanism to adjust the threshold automatically after the initial
home migration.
The Deduction of Home Access Coefficient
Hockney [50] has proposed a model to characterize the communication time
(in µs) for a point-to-point operation as follows, where the communication
overhead, t(m), is a linear function of the message length m (in bytes):
t(m) = t0 +m
r∞(5.5)
47
t0 is the start-up time in µs.
r∞ is the asymptotic bandwidth in MB/s.
Recall that home access coefficient is the overhead ratio of one eliminated
pair of object fault-in and diff propagation to one home redirection. Here we
mainly consider the communication overhead. We assume the object size is
o, the diff size is d, and the home redirection is a unit-sized message. Then
we have:
α =(t0 + o
r∞) + (t0 + d
r∞)
t0 + 1r∞
(5.6)
=2t0r∞ + (o + d)
t0r∞ + 1(5.7)
The half-peak length, denoted by m 12
bytes, is the message length required
to achieve half of the asymptotic bandwidth. It can be derived using the
relationship:
m 12
= t0r∞ (5.8)
Based on m 12À 1 and o > d, we derive equation (5.4). We restate it
here:
α ≈ 2 + b o
m 12
c (5.9)
5.2 Synchronized Method Migration
Synchronized method migration is not meant to directly optimize synchro-
nization related access patterns such as assignment and accumulator. In-
stead, it optimizes the execution of the synchronized method itself, which is
usually related to those access patterns.
Java’s synchronization primitives, including synchronized block, as well
as the wait and notify methods of the Object class, are originally designed
48
for thread synchronization in a shared memory environment. The synchro-
nization constructs built upon them are inefficient in a distributed JVM that
is implemented in a distributed memory architecture like clusters.
Fig. 5.2 shows the skeleton of a Java implementation of the barrier func-
tion. The execution cannot continue until all the threads have invoked the
barrier method. We assume the instance object is a DSO and the node
invoking barrier is not its home node. On entering and exiting the syn-
chronized barrier method, the invoking node will acquire and then release
the lock of the barrier object, while maintaining distributed consistency. In
line 8, the barrier object will be faulted in. It is a common behavior that
the locked object’s fields will be accessed in a synchronized method. In line 9
and line 11, the synchronization requests wait and notifyAll respectively,
will be issued. The wait method will also trigger an operation to main-
tain distributed consistency according to the JMM.1 Therefore, there are
four synchronization or object requests sent to the home node and multiple
distributed consistency maintaining operations involved.
We propose synchronized method migration to reduce communication and
consistency maintenance overhead in the execution of synchronized methods
at non-home nodes. On synchronized method migration, instead of invoking
the method, the synchronized object’s GUID, the method’s index in the
dispatch table, and the arguments of the method, will be sent to the home
node of the synchronized object. The method will be executed there. The
method’s return value if exists will be sent back so that the execution at the
non-home node can continue.
While object shipping is the default behavior in the GOS, we apply
method shipping particularly to the execution of synchronized methods of
DSOs. With the detection of DSOs, this adaptation is feasible in our GOS.
The synchronized method migration code is generated at JIT compilation
time. All the non-synchronized methods are untouched so that they can go
1According to the JMM, wait behaves as if the lock is released first and acquired later.
49
class Barrier {
private int count; // the number of threads to barrier
private int arrived; // currently arrived threads
public Barrier(int numOfThreads) {
count = numOfThreads;
arrived = 0;
}
public synchronized void barrier() {
try {
if (++arrived < count)
wait();
else {
notifyAll();
arrived = 0;
}
} catch (Exception e) {
// handle the synchronization exception
}
}
}
Figure 5.2: Barrier class
in full speed. A code stub is inserted into the beginning of each synchronized
method, which includes a condition check to see whether the current execu-
tion needs migration, and the actual code to perform synchronized method
migration.
The method shipping will cause the workload to be redistributed among
the nodes. However, the synchronized methods are usually short in terms
of the execution time; therefore, synchronized method migration will not
significantly affect the load distribution in the distributed JVM.
50
5.3 Connectivity-based Object Pushing
Some important patterns, such as the single-writer pattern, tend to repeat for
a considerable number of times, therefore giving the GOS the opportunity
to detect the pattern using history information. However, there are some
significant access patterns that do not repeat, which cannot be detected by
using access history information.
Connectivity-based object pushing is applied in our GOS to the situa-
tions where no history information is available. Essentially, object pushing
is a prefetching strategy which takes advantage of the object connectivity
information to more accurately pre-store the objects to be accessed by a re-
mote thread, therefore minimizing the network delay in subsequent remote
object accesses.
Connectivity-based object pushing actually improves the reference local-
ity. It is useful in applications of fine-grained object sharing.
The producer-consumer pattern is one of the patterns that can be opti-
mized by connectivity-based object pushing. Similar to the assignment pat-
tern, the producer-consumer pattern obeys the precedence constraint. The
write must happen before the read. However, in the producer-consumer pat-
tern, after the object is created, it is written and read only once, and then
turned into garbage. Therefore, producer-consumer is single-assignment.
The producer-consumer pattern is popular in Java programs. Usually, in a
producer-consumer pattern, one thread produces an object tree, and prompts
another consuming thread to access the tree. In the distributed JVM, the
consuming thread suffers from network delay when requesting objects one by
one from the node where the object tree resides.
In order to apply connectivity-based object pushing, we follow the heuris-
tic that after an object is accessed by a remote thread, all its reachable
objects in the connectivity graph may be “consumed” by that thread after-
wards. Therefore, upon request for a specific DSO in the object tree, the
home node pushes all the objects that are reachable from it to the requesting
51
node.
Object pushing is better than pull-based prefetching which relies on the
requesting node to specify explicitly which objects to be pulled according to
the object connectivity information. A fatal drawback of pull-based prefetch-
ing is that the connectivity information contained in an invalidated object
may be obsolete. Therefore, the prefetching accuracy is not guaranteed.
Some unneeded objects, even garbage objects, may be prefetched, which will
end up wasting communication bandwidth. On the contrary, object push-
ing gives more accurate prefetching since the home node has the up-to-date
copies of the objects and the connectivity information in the home node is
always valid.
In our implementation, we rely on an optimal message length, which is
the preferred aggregate size of objects to be delivered to the requesting node.
Reachable objects from the requested object will be copied to the message
buffer until the current message length is larger than the optimal message
length. We use a breadth-first search algorithm to select the objects to be
pushed. If these pushed objects are not DSOs yet, they will be detected.
This way, DSOs are eagerly detected in object pushing.
Since object connectivity information does not guarantee that future ac-
cesses are bound to happen, object pushing also risks sending unneeded ob-
jects. We disable object pushing upon request of an array of reference type,
e.g., a multi-dimension array, since such an array usually represents some
workload shared among threads with each thread accessing only a part of it.
52
Chapter 6
Object Access Pattern
Visualization
We design and implement a visualization tool called PAT (Pattern Analysis
Tool) that can be used to visualize object access traces and analyze object
access patterns in our GOS.
PAT is useful in two aspects. For the protocol designers, such a tool can
expose the inherent memory access patterns inside a benchmark application,
and thus enable the evaluation of the effectiveness of the adaptive protocol
in reducing the number of network-related memory operations and the pro-
tocol’s pattern detection mechanism. It can reveal how frequent a particular
memory access pattern appears in an application, and how well a particular
adaptation can optimize a target memory access pattern.
On the other hand, it can help the application developer in planning the
initial data layout and runtime data relocation. Since DSM systems tend
to hide the communication details from application developers, performance
tuning is rather difficult if not impossible. With PAT, the parallel application
developer is able to discover the performance bottleneck in the application by
observing the application’s memory access behavior. He may then redesign
the algorithm to avoid some heavyweight memory access patterns. In this
53
Pattern Visualization Component
Map pattern to access events
log DJVM Node
log DJVM Node
log DJVM Node
log DJVM Node
log DJVM Node
log DJVM Node
log DJVM Node
log DJVM Node
Lifetime Pattern
Analyzer
Global Phase Pattern
Analyzer
Producer-consumer Analyzer
Other Pattern
Analyzer
Pattern Analysis Engine
Pattern
Window
Timeline Window
Source Code
Window
Map pattern to allocation site
Merged Object Access Events Log
Runtime Operations Postmortem Operations
Figure 6.1: PAT Architecture
aspect, PAT plays the role of a profiling tool.
PAT comprises three components: the object access trace generator (OATG)
that is plugged into the distributed JVM, the pattern analysis engine (PAE),
and the pattern visualization component (PVC), as shown in figure 6.1.
OATG gathers object access information at runtime. Improper runtime
logging could introduce intolerable overheads and interruptions to the appli-
cation being traced, which makes the logging unaccepted. For example, the
recorded memory access behavior could be quite different from that without
logging due to the interruption caused by heavyweight logging. To tackle this
problem, OATG was designed to be lightweight. It activates the recording
only on DSOs. Logs are stored in a memory structure and flushed to the
local disk at synchronization points or when the buffer is full. The just-in-
time compiler is used to instrument only the user-interested methods; all the
other methods execute at full speed.
54
PAE is used to discover knowledge concerning patterns from the raw ac-
cess information collected by OATG. After an application’s execution, the
global (of all the processes) and complete (the entire lifetime of the applica-
tion) access information can be compiled, based on which an analysis of the
object access patterns is carried out precisely and thoroughly.
PVC uses a pattern-centric representation to visualize object access pat-
terns. It can display the global and complete access pattern information. In
addition, for objects of interest to the user, it can associate access patterns
with the source code lines that create the corresponding objects—referred
to as allocation sites. The object access patterns can be further mapped to
low-level object access operations.
StormWatch [37] is a profiling tool that visualizes the execution of DSM
systems and links it to the program’s source code. StormWatch provides
three linked graphic views: trace, communication and source. The trace
and the communication view combined together reflect the low level access
operations in the execution. The major difference between our tool and
StormWatch is that StormWatch only focuses on the low level access oper-
ations, which may not provide straightforward and intuitive information to
the users. However, our pattern analysis and visualization system provides
the access pattern knowledge that, as high level information, will definitely
be more helpful to the users.
Xu et al. described a profiling approach for DSM systems in [81]. It can
detect and visualize some cache block level access patterns. However, as an
online tool, it suffers from the memory and time constraints in a runtime
analysis. For example, it can only show lifetime access pattern that a certain
cache block presents in the whole execution time. The pattern change cannot
be expressed because the memory consumption is expensive if each pattern
change per cache block is recorded. This is neither flexible nor precise. On
the contrary, our approach is postmortem so that we can invest as much
effort as affordable to precisely and thoroughly analyze the access patterns
55
after the execution.
6.1 Object Access Trace Generator
OTAG uses several techniques to achieve the lightweight runtime logging of
memory access information.
Firstly, it relies on the Java memory model to carefully choose the memory
access operations to be logged. Figure 6.2 shows all the memory access
operations in the GOS, with only those access types in bold font being logged.
In the GOS, we focus on DSOs since only they will incur communica-
tion overheads. Consequently, we are only interested in the access patterns
presented by DSOs. On non-home nodes, the object faulting-in and diff prop-
agation can represent the reads and writes on the cached copy, respectively.
Similarly, the home read fault and home write fault can represent all the
reads and writes happening in the home node, respectively. All these remote
and home reads/writes, together with synchronization operations on objects
and synchronized methods, constitute the object’s access behavior.
Secondly, usually we are interested not only in those access operations
themselves, but also the relationship between them and other program states.
For example, we may want to know what the object access behavior is inside
a Java method. Or we may want to log a particular method that imple-
ments barrier synchronization among all threads to observe the object access
operations against the barrier synchronization.
To address the above requirement, OATG leverages the just-in-time com-
piler in a distributed JVM to dynamically instrument translated Java method
code to log interesting operations. PAT allows the user to provide a list of
interested Java method signatures1 to the distributed JVM. During the just-
in-time compilation, the signature of the to-be-translated method will be
compared against the user provided list. If there is a match, the just-in-time
1The format of Java method signature is defined in the JVM specification.
56
Memory Access Operations in distributed JVM
On node-local objects On distributed-shared objects
Synchronization: lock, unlock, wait, notify
Read Write
Remote read: object faulting-in from the home node
Read on cached copy
Write on cached copy
Remote write: diff propaga-tion to the home node
Synchroniza-tion: lock, unlock, wait, notify
Synchronized method
Read/write issued on non-home nodes
Home read: home read fault
Other read on home copy
Other write on home copy
Home write: home write fault
Read/Write issued on the home node
Figure 6.2: Memory access operations in GOS
57
compiler will insert the log code at both the start and the end of the method.
In doing so, the user is able to choose his interested method operations to log.
All the other methods are left untouched and operate at full speed. If the
just-in-time compiler is not used, we have to do the instrumentation for each
method in advance since each method could potentially be a user-interested
operation. The overall slowdown could be significant.
We make use of some source code of the logging facility in MPE (Multi-
Processing Environment of MPICH) [34] for collecting the access logs. How-
ever, our logging facility does not require MPI support during logging. It is
implemented as a library and linked against the distributed JVM. At run-
time, each process of the distributed JVM independently generates its own
log. The log records are firstly put into the local memory and then dumped
to the local disk at synchronization points or when the memory buffer is full.
After the multi-threaded Java program exits, an MPI program will merge all
those local logs into one log file according to the time stamps. We rely on
the Network Time Protocol (NTP) [63] to synchronize the computer times
on different cluster nodes. The time offset between cluster nodes can be ad-
justed to less than one millisecond. On merging the node local logs, the time
stamps will be further tuned by calculating the current time offset.
6.2 Pattern Analysis Engine
There can be many independent modules sequentially reading in the same
log in the analysis engine. Each module is responsible for detecting one or
several related access patterns. The access pattern analysis results from all
the modules are fed into the pattern visualization component, which will be
discussed in the next section. It is extensible in the sense that we can plug
in new modules to detect any precisely defined access patterns. Currently
there are two analysis modules in place: the lifetime pattern analyzer and
the global phase pattern analyzer.
58
The lifetime pattern analyzer detects object access pattern that is fixed
in the whole lifetime for each DSO. It will check whether an object presents
read-only, single-writer, or multiple-writers pattern in its whole lifetime.
The global phase pattern analyzer works for those applications adopting
a phase parallel paradigm (section 12.1.1 of [54]), as shown in figure 6.3. In
this paradigm, every thread does some computation before arriving at a bar-
rier. After all the threads arrive at the barrier, they can continue to the next
computation phase. Two consecutive barriers define a global synchronization
phase agreed by all threads. This is a very common paradigm in paral-
lel programming. The global phase pattern analyzer will check whether an
object presents read-only, single-writer, or multiple-writers pattern in each
global synchronization phase. The barrier, as a synchronized Java method,
will be logged as a special operation at runtime. If the application does not
present the phase parallel paradigm, i.e. no barrier operations are found in
the log, the global phase pattern analyzer simply ignores the log. Detecting
read-only, single-writer, and multiple-writers patterns in the log is straight-
forwardly done by counting the number of writers among all the accesses on
the object during each phase.
6.3 Pattern Visualization Component
There are three windows in the presentation, a time lines window displaying
the low-level access operations, a pattern result window revealing the ob-
ject access patterns, and a source code window displaying the application’s
source code. The time lines window also reflects the overall access operations
incurred in the execution.
The time lines window, as shown in figure 6.4, provides a complete exe-
cution picture on 8 cluster nodes for an application called SOR. The x-axis
represents the time. In the y-axis direction, there are 8 time lines in the fig-
ure, representing 8 threads, one thread on each node in this experiment. The
59
Thread1
Comp.
Thread2
Comp.
Thread3
Comp.
Barrier Synchronization
…
Comp. Comp. Comp.
Barrier Synchronization
…
… … … …
Figure 6.3: Phase parallel paradigm
rectangles on the time lines show some states, e.g., barrier synchronization
in this case. The arrows show the object access operations. Those in green
are writes and those in white are reads. Furthermore, the arrows started in
one thread’s time line and ended in another thread’s time line represent the
remote reads (object faulting-ins) or the remote writes (diff propagations).
They are issued by the threads represented by the arrows’ starting time lines.
The corresponding home nodes are the nodes represented by the arrows’ end-
ing time lines. The arrows overlapping with the time lines are the home reads
or home writes. We can click any arrow to see the detail information about
that object access, e.g. the class name, size, and ID of that object. The time
lines can be zoomed out to get an overall picture of the accesses behavior,
or zoomed in to examine some particular object accesses. We implement the
time lines window by modifying Jumpshot in MPE [34].
Moreover, clicking the “Pattern Analysis” button in the time lines window
will trigger the pop-up of the pattern result window, as shown in figure 6.5.
As SOR is a barrier synchronized application, the global phase pattern ana-
lyzer can provide the pattern analysis result for each object. The objects are
60
Read
Write
Figure 6.4: The time lines window
firstly sorted by their allocation sites where they are created in the source
code. Each allocation site may create many objects at runtime. For each
object, its access pattern at each phase is displayed. As observed from the
analysis result, most objects in SOR present the single-writer access pattern.
For example, in figure 6.5, the object being observed presents the read-only
and the single-writer pattern in alternate phases.
The pattern result window is in the center of the visualization. Inside
this window, we can choose any object to highlight its accesses in the time
lines window. Thus we provide a convenient association between the high
level access pattern knowledge and the low-level access operation details.
Since the objects are sorted by their allocation sites in the pattern analysis
result window, we can map any object to its actual allocation site in the
application’s source code by clicking it, as shown in figure 6.5. Note that
the highlighted line in the source code window is the actual position for the
highlighted allocation site in the pattern analysis result window. Thus we
61
Figure 6.5: The window of object access pattern analysis result (the biggerone) and the window of the application’s source code (the smaller one)
provide a convenient association between the object access pattern and the
object’s corresponding allocation site in the source code.
In such a design, our visualization tool not only helps us, the GOS de-
signer, to visually evaluate the effectiveness of the adaptive protocol being
applied, but also helps the multi-threaded Java application programmer to
better understand the access behavior inherent in the program.
62
Chapter 7
Implementation
In this chapter, we discuss several implementation details in the cluster-based
JVM.
7.1 JIT Compiler Enabled Native Instrumen-
tation
In DSM, shared data units have different access states, such as invalid, read
(readonly), and write (writable). A faulting access will trigger some operation
according to the cache coherence protocol. For example, an access on an
invalid data unit will cause the data to be faulted in, and a write on a
readonly data unit will cause its twin to be created under the multiple-writer
protocol. Therefore, it is the responsibility of the DSM system to trap all
the faulting accesses. Unlike page-based DSMs which rely on the MMU
hardware to trap the faulting accesses, object-based DSMs need to insert
software checks before the memory accesses in order to trap the possible
faulting ones. So does our GOS.
The GOS provides transparent object accesses for Java threads distributed
among different nodes in a cluster-based JVM. The GOS needs to insert soft-
63
ware checks before all the bytecodes accessing heap in the Java programs,
which include:
• GETFIELD/PUTFIELD: load/store object fields.
• GETSTATIC/PUTSTATIC: load/store static fields.
• XALOAD/XASTORE1: load/store array elements.
In a JVM, the bytecode execution engine is the processor for Java byte-
code, which could be an interpreter or a just-in-time (JIT) compiler. An
interpreter will emulate the behavior of the bytecode one by one. While a
JIT compiler will translate a Java method in bytecode to the native code
on the first time it is invoked. Usually, a JIT compiler improves the JVM
performance by one order of magnitude compared with an interpreter. Since
the cluster-based JVM is targeted at high performance scientific and engi-
neering computing, the JIT compiler is our choice for the execution engine
of the cluster-based JVM.
Under the JIT compiler mode, a heap access operation takes only one
native machine instruction. We should be very careful to design the check
code to make it as lightweight as possible.
A straightforward solution could be inserting a function call before each
heap access, as shown in figure 7.1. The object state is checked and the
necessary protocol operation is done in the function. Although simple, this
approach is very heavyweight because a function call will cause a lot of over-
head, such as saving registers before the call, preparing for a new stack frame,
and restoring registers after the call. We had better avoid the function call
as much as possible.
A more efficient way is to make a comparison to check the access state,
as illustrated in figure 7.2. If the object has the proper access state, the
function call can be avoided.
1X represents a type indicator, e.g., A (reference), B (byte), C (char), etc.
64
call gos_check(object1);
access object1;
Figure 7.1: Pseudo code for access check: using a function call
if (object1 does not have the proper state)
call gos_check(object1);
access object1;
Figure 7.2: Pseudo code for access check: by comparison
Since we classify all the objects to either DSOs or NLOs, and only DSOs
have access states, we can easily come up with a straightforward algorithm
for a read operation, as shown in figure 7.3. In this way, two comparisons
are required for each read operation.
In order to reduce the comparisons, we let NLOs also have access state,
i.e., write (writable). Thus only one comparison is necessary to check the
access state of an object. We have patched the JIT compiler engine to per-
form native instrumentation to insert the access state check before each heap
access. Figure 7.4 shows the Intel assembly code for a read access after the
native instrumentation by the JIT compiler in our distributed JVM. Register
esi is used for the object reference, register ecx for the object access state,
and register eax for the object field to read. When the object is readable,
only 3 machine instructions are taken to check the access state, which include
one memory read, one comparison, and one jump instruction.
if (object1 is DSO)
if (object1 is invalid)
call gos_check(object1);
read object1;
Figure 7.3: Detailed pseudo code for a read check
65
0x08eac045: mov 0xc(%esi),%ecx // load access state
0x08eac04b: cmp $0x20000000,%ecx // make a comparison
0x08eac051: jge 0x8eac076 // go to access
0x08eac057: mov %ecx,0xffffffac(%ebp) // save register
0x08eac05d: mov %esi,0xffffffb0(%ebp) // save register
0x08eac063: push %esi // push argument
0x08eac065: call 0x8a3da0 <checkRead> // call gos_check
0x08eac06a: add $0x4,%esp // pop argument
0x08eac070: mov 0xffffffb0(%ebp),%esi // restore register
0x08eac076: mov 0x80(%esi),%eax // read object field
Figure 7.4: IA32 assembly code for a read check
7.2 Distributed Threading and Synchroniza-
tion
In the cluster-based JVM, threads within one Java application are automat-
ically distributed among cluster nodes to achieve parallel execution. Thus,
we need to extend the threading subsystem inside the JVM to the cluster
scope, and to visualize a single thread space across machine boundaries. Par-
ticularly, we need to solve the following technical issues:
Thread distribution The threads need to be efficiently distributed among
the nodes of the cluster-based JVM to achieve maximum parallelism.
Thread synchronization Even running on different nodes, the threads can
still interact and coordinate with each other through the methods pro-
vided in class java.lang.Thread, and any Java objects by synchro-
nization according to JMM.
JVM termination As we mentioned in the introduction, from the perspec-
tive of system architecture, a cluster-based JVM is composed of a group
of collaborating daemons, one on each cluster node. Each cluster-based
66
JVM daemon can exit if and only if the multi-threaded Java application
terminates.
In a standard JVM, all threads can be classified into either user threads
created by the application, or daemon threads created by the JVM itself.
Any Java application will create at least one user thread, i.e., main thread.
Daemon threads include, e.g., gc thread for performing garbage collection,
finalizer thread for performing the finalization work for the unreachable
objects before the collection. So the JVM is a multi-threaded system even
the running Java application is single-threaded. The whole JVM exits if
all user threads exit. The thread subsystem performs the tasks of thread
scheduling and thread synchronization. The thread subsystem also provides
non-blocking I/O interfaces.
7.2.1 Thread Distribution
In our cluster-based JVM, we classify all the nodes into two types, the master
node and the slave node. The master node is where the Java application is
started. The slave nodes accept threads distributed from the master node
to share the workload of the application. A daemon thread, called gosd, is
created on each node, which sits in a big loop to handle the cache coherence
protocol requests such as object fault-ins, diff propagations, and synchro-
nization operations.
We follow an initial placement approach to distribute user threads to
slave nodes. Upon the creation of a user thread on the master node, if there
is an underloaded slave node, the information of the thread which includes
the thread class name and the thread object will be sent to the slave node.
The gosd thread on the slave node then creates a new user thread based
on the thread information, and invokes the start() method of the thread
object to run the thread. The slave node is made the home of the thread
object to improve the access locality.
67
Our cluster-based JVM does not support dynamic thread distribution
mechanisms such as thread migration, by which a thread can be migrated
from one node to another during its execution.
7.2.2 Thread Synchronization
After the threads are distributed among the cluster nodes in a cluster-based
JVM, they should be able to coordinate with each other during the execu-
tion. This can be achieved through synchronization operations on any Java
object, such as lock, unlock, wait, and notify. Since each object has a home,
all the synchronization operations are executed in the object’s home node.
The object’s home node also acts as the object’s lock manager. If the ob-
ject’s home is not local, synchronization requests are sent to the home node,
which are handled by the gosd thread there. Some synchronization requests
are blocking, such as lock and wait. A lock request will suspend till the
corresponding lock is acquired. Since the gosd thread can not be blocked
anytime, upon receiving a synchronization request, it arranges another kind
of daemon thread, the monitorproxy thread, to actually process and reply
it.
The monitorproxy thread performs the synchronization operation indi-
cated in the request on behalf of the requesting remote thread. Since the
synchronization is stateful, the gosd thread will always arrange the same
monitorproxy thread for the requests from the same remote thread. For ex-
ample, after a monitorproxy thread MP have acquired a lock on an object
as requested by a remote thread T , it sends a notice to T . Then T continues
its execution as if it has acquired the lock. When T requests to release this
lock, it should be MP instead of any other monitorproxy threads to pro-
cess this request. Only after MP does not hold any lock state any more, it
can process the synchronization requests from other remote thread than T .
In the startup phase of the cluster-based JVM, a number of monitorproxy
threads are created. When new synchronization request comes, the gosd
68
thread tries to pick up a current available monitorproxy thread. If failed, a
new monitorproxy thread will be created to process it.
Java threads can also coordinate with each other through the methods
of class java.lang.Thread. For example, invocation of join will block till
the callee thread finishes. Like other methods of java.lang.Thread, join
is built on the synchronization operations on the thread object. Thus it is
also implemented through the mechanism discussed above.
7.2.3 JVM Termination
A cluster-based JVM is composed of a group of collaborating JVM daemons,
one on each cluster node. Each cluster-based JVM daemon can exit if and
only if the multi-threaded Java application terminates. If a JVM daemon
exits earlier, the home-based cache coherence protocol will be violated for
those DSOs whose homes are there. If a JVM daemon exits later, it becomes
unattended and wastes system resources.
A termination protocol is designed to coordinate all cluster-based JVM
daemons to exit when the Java application terminates.
1. When a slave node is started up, a main thread is also started there,
which will wait on an internal lock, called slavemain.
2. On the master node, a counter is increased by one whenever a user
thread is created in the Java application, and it is decreased by one
whenever a user thread exits.
3. A user thread could be distributed to a slave node. When it exits
there, a notice will be sent back to the master node. The master node
decreases the user thread counter accordingly. Thus the counter reflects
the number of currently live user threads.
4. When the counter reaches zero, which means all user threads have
exited, the master node can safely exit. Before exiting, the master
69
node sends notices to all slave nodes, informing them to exit.
5. Receiving the notice to exit, the slave node wakes up the main thread
waiting on the slavemain lock. The main thread then exits. Since all
user threads exit now, the slave node will also exit. Until now, the
cluster-based JVM terminates and all its JVM daemons exit.
7.3 Non-Blocking I/O Support
There are multiple threads on each node of the distributed JVM. Non-
blocking I/O support is a must in the distributed JVM so that a thread
doing I/O will not block the whole node.
We use the remote unlock operation on a DSO as an example to illustrate
non-blocking I/O support in the GOS, as shown in figure 7.5. After the
requesting thread sends out the unlock request, it will be switched off to
give the CPU to other runnable threads. The multi-threading nature of the
cluster-based JVM asks for a non-blocking I/O processing, otherwise a thread
in I/O will block the whole JVM. Therefore, the receiving thread should
never busy wait for an incoming message. Instead, it should give up the
CPU. Later, the signal SIGIO will be catched to switch on the corresponding
I/O-waiting thread. The significant signal processing overhead has to be
introduced. On the requested node, the currently running thread may be
some other thread than the GOS daemon thread that takes care of all the
GOS request messages. Thus, a thread switch is necessary to switch on
the GOS daemon thread. The GOS daemon thread will schedule a proper
monitor proxy thread to process this unlock request. The proper monitor
proxy thread is the one that currently holds the lock. Here another thread
switch is incurred. A similar situation happens on the requesting node when
it receives the unlock reply message. We can see along the critical path
of a remote unlock, an unlock overhead, 2 signal processings and 3 thread
switches are incurred.
70
SomeThread
Requested Node Requesting Node
Requesting Thread
SignalProcessing
ThreadSwitch
GOS DaemonThread
ThreadSwitch
Monitor Proxy
Thread
Thread Switch
Signal Processing
Requesting Thread
Thread Switch
Some Thread
Unlock Request
Unlock Reply
Figure 7.5: Remote unlock of a DSO
7.4 Distributed Class Loading
A Java class file defines a single class’ static/instance fields, static/instance
methods, and the constant pool that serves a function similar to that of
a symbol table in conventional programming languages. At runtime, the
JVM dynamically loads, links, and initializes classes when they appear in
the application, as illustrated in figure 7.6.
Loading is the process of finding the Java class file and reading it into the
memory. Linking is the process of combining it into the runtime state of the
Java virtual machine. During the linking phase, the bytecode will be verified,
71
Loading
Java Class File
Linking
Initialization
Class in Method Area
Figure 7.6: JVM’s dynamical loading, linking, and initialization of classes
the static fields will be allocated and initialized to the default values, and
all the symbolic references in the constant pool are resolved. Initialization is
the process of executing the class initialization code. The Java class finally
stays in the method area of JVM.
Our cluster-based JVM provides the dynamical class loading capability
defined in the JVM specification. Since Java classes contain readonly defi-
nitions of fields and methods, they are allowed to be loaded independently
be each node. However, two particular issues needed to be addressed to
maintain the single system image:
• Although each cluster-based JVM daemon can load Java classes inde-
pendently, it must be guaranteed that they load Java classes from the
same source. In other words, they must load the same Java classes.
• Although each cluster-based JVM daemon can load Java classes in-
72
dependently, a Java class can only be initialized for once. And the
static variables should be made consistent according to JMM during
the execution.
To address the first issue, we have configured a Network File System
(NFS) [74] for the cluster-based JVM so that each JVM daemon sees the
same file system hierarchy where the Java class files are stored. Since the
NFS is a very popular file system used in the cluster environment, such
configuration does not impair the portability of our cluster-based JVM.
To address the second issue, we let the master node maintain a central-
ized table recording all the cluster-wide initialized classes. For each initialized
class, the table also records where it is initialized and a lock to prevent the
race condition on the initialization. In the GOS, a class is also considered as
an object, which contains static fields and has a home. The node initializing
the class becomes its home node. Whenever a JVM daemon loads a class,
it will check the table to see whether it has already been initialized or not.
If yes, the JVM daemon will skip the initialization, and fetch the current
content of the static fields from the home node of the class. If not, the class
is initialized locally. Before the initialization, the corresponding lock in the
table must be acquired. Having acquired the lock, the JVM daemon will dou-
ble check whether the object have been initialized. The lock is released after
the initialization is done. Since the static fields are allowed to be replicated
on different nodes, they are also handled by the cache coherence protocol to
maintain the consistency according to the JMM.
7.5 Garbage Collection
In this section, we will discuss distributed garbage collection in GOS. There
are two GC algorithms in place in GOS, one for local garbage collection
that will be discussed in section 7.5.1, and the other for distributed garbage
collection of DSOs that will be discussed in section 7.5.2.
73
Node 0 (Home)
Root set
a
Node 1
Root set
a
b
Figure 7.7: Tolerating inconsistency in DGC
7.5.1 Local Garbage Collection
An adapted uniprocessor garbage collector, e.g., a mark-sweep collector [79]
in our case, can function independently on each node in our cluster-based
JVM. The challenge here is to put the right stuff into the root set to assure
the correctness of GC. The home copy of DSO should always been put into
the root set since the collector has no idea whether its non-home siblings
are still alive or not. As long as there are some non-home siblings, the home
copy should be kept due to its special role in the home-based cache coherence
protocol.
The inconsistency among the copies of a DSO introduces a new problem
in DGC, which dose not exist when there is no consistency issue involved [46].
Fig. 7.7 gives an example. Node 0 is the home of DSO a. Node 1 cached a
and modified it by installing a reference to object b in a. Now the copies of a
are inconsistent. If a becomes unreachable on node 1, and node 1 performs
a local GC, both a and b will be mistakenly collected. Therefore, when each
node performs independent local GC, all the non-home copies of DSOs that
are inconsistent with their home copies, i.e. in the write access state, should
be put into the root set.
74
0
1 2
3 4
Export list = {1, 2} Import list = {null}
Export list = {3, 4} Import list = {0} Export list = {null}
Import list = {0}
Export list = {null} Import list = {2}
Export list = {null} Import list = {2}
Figure 7.8: DSO reference diffusion tree
7.5.2 Distributed Garbage Collection
A DGC algorithm, Indirect Reference Listing [66], is adopted to collect
garbage DSOs.
Essentially, the indirect reference listing (IRL) algorithm maintains a dis-
tributed reference diffusion tree for each DSO. In GOS, a reference of DSO
can be transmitted either from the home node to a non-home node or be-
tween two non-home nodes. The former is referred as reference creation and
the latter as reference duplication. With IRL, both the home and non-home
copy of a DSO will maintain two lists, an import list recording where its ref-
erence comes from, and an export list recording where its reference is sent to.
In a DSO’s reference diffusion tree, every vertex represents a node possessing
one of its copies. The root of the tree is its home node. An edge in the
tree represents that the reference is transmitted from one node to another
node. The sending node adds the receiving node into its export list, while
the receiving node adds the sending node into its import list. If the node to
be added is already in the list, the addition has no effect. Fig. 7.8 gives an
example. The figure in the circle is the node number.
When a non-home copy DSO is figured out to meet the following two
conditions, it can be reclaimed locally and a garbage notice will be sent to its
75
parent in the diffusion tree: (1) its export list is empty; (2) it is not reachable
from the local root set, which can be determined by the local collector. If
one node receives a garbage notice of a DSO, it will remove the sending node
from the DSO’s export list. When the export list of the home copy of a DSO
is empty, the DSO will be reversed to an NLO. IRL requires that the local
collector also put those non-home copies of DSOs with non-empty export
lists into the root set.
The transmission path of a DSO reference may form some cycle among
the nodes. Then the export list on every node in the cycle is not empty and
all the copies will be put into the local root sets. It makes this DSO never be
reclaimed even it is not reachable from any local root set. In order to prevent
such cycles polluting the structure of the diffusion tree, we assure each node
can only have one valid parent in the tree. If a DSO reference arrives from
a node different from the current parent, the sender will not be added into
the import list. Instead, the receiver prepares a pseudo garbage notice for
the sender, since the sender has already added the receiver into the export
list. Having received the pseudo garbage notice, the sender can remove the
receiver from its export list.
IRL inherits the idempotency property from reference listing [67]. The
effect of multiple transmissions of a DSO reference between two nodes is as
same as that of once. This property is very helpful in GOS since DSOs will
be transmitted many times due to the cache coherence protocol. The indirect
nature of IRL avoids the race condition in reference listing when reference
deletion and duplication happen at the same time [67]. IRL can not collect
the cycle of garbage DSOs whose home nodes are different. However, this
usually is not a serious problem.
The major overheads of IRL are maintaining import and export lists
for every DSO as well as sending garbage notices. The list maintenance
coexists with the reference transmission. Compared with the transmission,
the maintenance overhead, which is simply bitmap setting, is negligible. The
76
garbage notices can be batched and piggybacked with coherence messages.
So IRL will not contribute a significant overhead to GOS.
77
Chapter 8
Performance Evaluation
8.1 Experiment Environment
We conducted the performance evaluation on the HKU Gideon cluster [14].
Each node has an Intel 2GHz P4 CPU and 512M memory, running Linux
kernel 2.4.22. A Network File System (NFS) is set up and mounted on all the
cluster nodes so that the user has a same view of his home directory on all
nodes. All the cluster nodes are connected by two Fast Ethernet networks,
one for NFS, the other for high performance communication such as MPI.
Our cluster-based JVM is implemented based on the Kaffe JVM [9] which
is an open-source JVM. A Java application is started on the master node.
When a Java thread is created, it is automatically dispatched to a cluster
node to achieve parallel execution. Unless specified otherwise, the number
of computation threads created is the same as the number of cluster nodes
in all the experiments.
In our implementation, we leverage TCP/IP Socket interface for all the
communications. We use Netperf [40] to evaluate the TCP/IP performance
of Gideon cluster. It takes 114 microseconds to send a one-byte request
message and get a one-byte response message. The network throughput is
94.05Mb/s when the message size is 4096 bytes.
78
8.2 Application Suite
In this section, the application suite used to evaluate the performance of our
cluster-based JVM will be presented. The application suite contains CPI,
ASP, SOR, NBody, NSquared, and TSP.
8.2.1 CPI
CPI is a multi-threaded Java program to calculate π. The π is computed by
π =∫ 1
0
4
1 + x2dx (8.1)
The program follows a fork-and-join parallelism style. The integral in-
tervals are equally divided among threads.
8.2.2 ASP
The All-pairs Shortest-Path (ASP) problem is to find the shortest path be-
tween all pairs of vertices in a graph. ASP is an important problem in
graph theory and has applications in communications, transportation, and
electronics problems [47].
A graph can be represented as a distance matrix D in which each element
(i, j) represents the distance between vertex i and vertex j. We assume for
any i and j, Dij exists so that 0 ≤ Dij < ∞. Also, Dij = Dji and Dii = 0.
Floyd gives a sequential algorithm for ASP. It solves a graph of N vertices
in N steps, constructing an intermediate matrix I(k) containing the best-
known shortest distance between each pair of nodes at step k. Initially,
I(0) is set to D. The kth step of the algorithm considers each Iij in turn
and determines whether the best-known path from i to j is longer than the
combined lengths of the best-known paths from i to k and from k to j. If so,
the entry Iij is updated to reflect the shorter path.
We design a parallel version of Floyd’s algorithm by making a row-wise
79
black[j][k] = (red[j-1][k] + red[j+1][k] + red[j][k-1]
+ red[j][k]) / (float)4.0;
Figure 8.1: The typical operation in SOR
domain decomposition of the distance matrix D and the intermediate matrix
I among threads. Appendix A.2 shows the run method of the Worker thread
in our ASP. The instances of the Worker thread perform the actual compu-
tation. At step k, all threads need the value of the kth row of the distance
matrix. There is a barrier at the end of each iteration. The workload is
distributed equally among the Worker threads. The rows of D are allocated
among cluster nodes in a round-robin manner initially.
8.2.3 SOR
The red-black Successive Over Relaxation (SOR) is used to solve partial
differential equations:
∂2f
∂x2+
∂2f
∂y2= 0 (8.2)
A matrix is created with the perimeter elements being initialized to be
the boundary conditions of a given mathematical problem. The interior
elements are repeatedly computed as the average of its top, bottom, left, and
right neighbors until the computed values are sufficiently close to the values
computed in the last iteration.
Two matrixes, a red one and a black one, are used in the SOR. At any
iteration the elements are read from one matrix, and the computed values are
written to the other. After finishing this iteration, the roles of two matrixes
are swapped. Figure 8.1 shows the typical operation in SOR.
We partition the red and black matrixes among threads in a row-wise way.
Each thread computes the parts of matrixes it has been assigned. Thus, the
80
(a) Space decomposition (b) Barnes-Hut tree
Figure 8.2: Barnes-Hut tree for 2D space decomposition
workload is equally partitioned among the threads. Each thread needs to
access not only its own sub-matrixes but also the neighboring rows in the
matrixes which are computed by other threads. After each iteration, all
threads are synchronized through a barrier operation. The rows of red and
black matrices are allocated among cluster nodes in a round-robin manner
initially.
8.2.4 NBody
NBody is used to simulate the motion of particles due to gravitational forces
between each other. The Barnes-Hut method [29] is a well-known hierarchical
NBody algorithm. In Barnes-Hut method a physical space is recursively
divided into sub-domains until each sub-domain contains at most one body.
The space decomposition is based on the spatial distribution of the bodies.
Figure 8.2 (a) gives an example of space decomposition in 2D space. Initially,
the space is equally divided into four sub-domains. If there are more than
one bodies in a sub-domain, the sub-domain will be further decomposed into
four smaller sub-domains. A Barnes-Hut tree is built based on the space
decomposition as figure 8.2 (b) shows.
81
In the Barnes-Hut tree, the bodies reside on the leaves. Inner cells in the
tree correspond to the sub-domains, and represent the centers of mass for
the bodies beneath it. The force computation is performed by traversing the
tree. The Barnes-Hut tree is built at the beginning of each iteration. If the
body is far enough from a cell, no further traversal will be made beneath the
cell. The force influences from the bodies below the cell can be computed
as the force influence from the cell, which is the center of mass. Otherwise,
the body should proceed to traverse the children of the cell. After the force
computation, each body updates its position in the space as the result of
force influences. That ends one simulation loop. The tree should be rebuilt
at the beginning of next iteration to reflect the new body distribution in the
space.
We parallelize the Barnes-Hut method by equally dividing the bodies
among threads. The workload of threads is not balanced as the computation
load associated with each body is different. The tree construction is not
parallelized. In the NBody application, there is a main thread responsible
for the tree construction, and a number of worker threads responsible for
the computation of the force and the resulted body movement. During each
iteration, after the main thread has built the tree, it wakens all the waiting
worker threads. A barrier operation synchronizes the worker threads after
they finish their computation. Then the main thread is notified to begin
the tree construction of the next iteration. A lot of Java objects, which
describe bodies’ position, velocity, and forces, have been created during the
tree construction.
8.2.5 NSquared
NSquared solves NBody problem with an O(n2) complexity, just like Water-
NSquared in Splash-2 benchmark suite [80].
All n bodies are stored in an array. The workload are evenly partitioned
among threads by assigning an identical number of bodies to each thread.
82
A thread is responsible to calculate the force on each of its assigned bodies,
and update the bodies’ positions accordingly. To calculate the force on one
body, we need to combine the interactions between this body and the other
n− 1 bodies respectively.
8.2.6 TSP
The Traveling Salesman Problem (TSP) is to find the cheapest way of visiting
all the cities and returning to the starting point. Our TSP finds the optimal
solution instead of an approximate one by searching the entire solution space.
Our TSP follows a branch-and-bound approach. It prunes large parts of
the solution space by ignoring partial routes that are already longer than the
current best solution.
The program divides the whole solution space into many small ones to
build up a job queue in the beginning. A sub-space contains all the routes
of a same prefix. A number of worker threads are created initially. Every
thread repeatedly requests a sub-space from the job queue to search for the
optimal solution until the queue is empty. The workload of threads is not
balanced. A lot of objects have been created during the searching.
8.3 Application Performance
In our experiments, unless stated otherwise, CPI makes the integration on
100,000,000 sub-intervals, ASP solves a graph of 1024 vertices, SOR per-
forms the successive over-relaxation on a 2-D matrix of 2048 by 2048 for
30 iterations, NBody simulates the motion of 2048 particles over 30 steps,
NBody simulates the motion of 2048 particles over 10 steps, and TSP solves
a problem of 12 cities.
83
0%
20%
40%
60%
80%
100%
120%
140%
160%
ASP SOR NSquared NBody TSP CPI
No
rmal
ized
exe
cuti
on
tim
e
Kaffe
GOS/1
Figure 8.3: Single node performance
8.3.1 Sequential Performance
Our cluster-based JVM is based on Kaffe JVM. When running our cluster-
based JVM on only one processor, we can measure its sequential performance.
The major GOS overhead incurred in the sequential performance is caused
by software checks inserted before object accesses, which are used to en-
sure the corresponding objects are in the right access states, as discussed in
section 7.1. By comparing the sequential performance of our cluster-based
JVM with the performance of original Kaffe, we can measure the overhead
of software checks.
Figure 8.3 compares the performances of the cluster-based JVM and Kaffe
using our application suite. In the figure, GOS/1 denotes the cluster-based
JVM on one processor. Both the Kaffe JVM and our cluster-based JVM run
in the just-in-time mode.
Among all applications, ASP, SOR, and NSquared incur a heavy check
overhead due to their intensive array object accesses. NBody and TSP’s
84
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
Number of processors
Sp
eed
up
CPI
TSP
NBody
NSquared
SOR
ASP
Figure 8.4: Speedup
check overheads are well contained, less than 10%. In CPI, most time is
spent on calculation, and the object accesses are very few.
8.3.2 Parallel Performance
We measure the speedup for all applications on up to 16 processors as an
overall performance evaluation for our cluster-based JVM. Figure 8.4 shows
the speedup curves. In the experiments, n threads will be created when
running on n processors. The sequential time on 1 processor is measured
on the original Kaffe JVM where only one thread is created. All the cache
coherence protocol optimizations are enabled. Both the Kaffe JVM and our
cluster-based JVM run in the just-in-time mode.
The applications’ parallel performances are determined by their computation-
to-communication ratios. Among all the applications, TSP and CPI are com-
putationally intensive programs. Therefore, they are able to achieve speedups
85
of more than 13 on 16 processors. NBody and NSquared also achieves accept-
able speedups on 16 processors. SOR and ASP’s performances are embar-
rassing. They achieve speedups less than 3.5 on 8 processors. Their speedup
curves drop on 16 processors.
In order to further investigate factors contributing to the applications’
performance, we try to break down the execution time to various parts,
including Comp that denotes the computation time; Obj, the object access
time to fault in up-to-date copies of invalid objects; Syn, the time spent
on synchronization operations, such as lock, unlock, wait, notify, and mi-
grated synchronized method; and GC, the garbage collection overhead. We
instrument internal functions of our cluster-based JVM to measure the ac-
cumulated overheads of Obj, Syn, and GC. The Comp time is computed by
subtracting all the other parts from the total time.
All the breakdown data are normalized to the total execution time, as
displayed in figure 8.5. How we obtain the breakdown data is discussed in
appendix A.3 in detail. In spite of a certain impreciseness, figure 8.5 helps
us gain an insight into the executions.
Notice that not every application requires GC. Obj and Syn portions are
the GOS overhead to maintain a global view of a virtual object heap shared
by physically distributed threads. Obj and Syn portions not only include
the necessary local management cost and the time spent on the wire for
moving the protocol-related data, but also the possible waiting time on the
requested node. The percentage of Comp roughly reflects the efficiency of
parallel executions.
ASP requires n iterations to solve an n-node graph problem. There is a
barrier at the end of each iteration, which requires participation of all threads.
When ASP is running on more processors, the computation workload of
each thread decreases. On the contrary, the Syn part increases when more
processors join. The Obj part also increases with the number of processors.
On the ith iteration, all threads need to access the ith row of the distance
86
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2 4 8 16 2 4 8 16 2 4 8 16
ASP SOR NBody
Number of processors
No
rmal
ized
exe
cuti
on
tim
e
Comp Syn Obj GC
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2 4 8 16 2 4 8 16 2 4 8 16
TSP NSquared CPI
Number of processors
No
rmal
ized
exe
cuti
on
tim
e
Comp Syn Obj GC
Figure 8.5: Breakdown of normalized execution time against number of pro-cessors
87
matrix. When the number of processor increases, the home node of the ith
row needs to serve more requests. Thus the waiting time of each request
increases correspondingly. When scaled up to a large number of processors,
ASP’s performance is hindered by the intensive data communication and
synchronization overheads.
The situation of SOR is similar to that of ASP. In SOR, there are two
barriers in each iteration. The Syn part contributes a significant portion to
the execution time when scaled to a large number of processors. The absolute
time of Obj roughly stays constant because each thread only needs to access
the neighbor rows with respect to the rows it manages in the matrixes. The
data to be accessed do not increase with the number of processors. However,
the percentage of Obj in the total time increases because each thread’s com-
putation load is reduced when SOR is running on more processors. Similar
to ASP, SOR’s performance is hindered by the intensive data communica-
tion and synchronization overheads when scaled up to a large number of
processors.
NBody also involves synchronization in each simulation step. The syn-
chronization overhead becomes a significant part of the overall execution
time when we increase the number of processors. The absolute time of Obj
decreases when the number of processors increases, but slower than that
the absolute time of Comp decreases. So we observe that the percentage of
Obj increases with respect to the total time. NBody is a memory intensive
application and therefore triggers garbage collection. With our distributed
garbage collection mechanism in place, the GC overhead is highly parallelized.
The absolute time of GC is inversely proportional to the number of processors.
The breakdown of NSquared is similar to that of NBody.
TSP is a computationally intensive application, and the GOS overhead
accounts for less than 1% of the total execution time. TSP is also a mem-
ory intensive application. The absolute times of GC and Obj are inversely
proportional to the number of processors. Nevertheless, their percentages
88
Parameters Messages Traffic (KB)CPI 100,000,000 255 12
sub-intervalsASP A graph of 169,130 347,425
1024 verticesSOR A 2048 by 2048 matrix 35,999 93,286
for 30 iterationsNBody 2048 particles 752,878 321,505
over 30 stepsNSquared 2048 particles 698,192 74,230
over 10 stepsTSP 12 cities 4,849 411
Table 8.1: Communication effort on 16 processors
in the total time stay constant on various numbers of processors. CPI is a
computation-intensive application. Most of its time is Comp.
8.4 Effects of Adaptations
In this section, we evaluate the effectiveness of the adaptations discussed in
chapter 5. They are adaptive object home migration, synchronized method
shipping, and connectivity-based object pushing.
All applications except TSP and CPI incur a lot of communication during
the parallel executions. Table 8.1 shows their communication effort when
running on 16 processors. The measurements are made after all the cache
coherence protocol optimizations are enabled.
Figure 8.6 shows the overall performance improvement due to the adap-
tations for four benchmark applications respectively. We do not show the
figures for CPI and TSP becuase they are computationally intensive applica-
tions and incur little communication. The adaptations do not have obvious
effects on them. In the figures, Basic represents the basic cache coherence
protocol with three adaptations disabled, Adaptive represents the adaptive
89
0
100
200
300
400
500
600
700
0 2 4 6 8 10 12 14 16
Number of processors
Exe
cuti
on
tim
e (s
eco
nd
s)
Basic
Adaptive
0
20
40
60
80
100
120
140
0 2 4 6 8 10 12 14 16
Number of processors
Exe
cuti
on
tim
e (s
eco
nd
s)
Basic
Adaptive
(a) ASP (b) SOR
0
10
20
30
40
50
60
70
80
90
100
0 2 4 6 8 10 12 14 16
Number of processors
Exe
cuti
on
tim
e (s
eco
nd
s)
Basic
Adaptive
0
10
20
30
40
50
60
70
80
90
100
0 2 4 6 8 10 12 14 16
Number of processors
Exe
cuti
on
tim
e (s
eco
nd
s)
Basic
Adaptive
(c) NBody (d) NSquared
Figure 8.6: The adaptive protocol vs. the basic protocol
cache coherence protocol with all adaptations enabled. We display the appli-
cations’ execution times against the number of processors. The cluster-based
JVM runs in the JIT compilation mode.
We can observe from the figures that the adaptive cache coherence proto-
col greatly improves the performance of ASP and SOR. For example, 76% to
89.7% of ASP’s execution time can be eliminated when the adaptive protocol
is enabled. The adaptive protocol also improves the performance of NBody
and NSquared considerably. For example, as seen from figure 8.6 (c), 23.8%
of NBody’s execution time can be eliminated on 16 nodes when the adaptive
90
protocol is enabled.
In order to further investigate the effectiveness of various adaptations, we
try to breakdown the effects of adaptations. In the experiments, all adap-
tations are disabled initially; and then we would enable the planned adap-
tations incrementally. Figure 8.7 shows the effects of adaptations on the
execution time. Figure 8.8 shows the effects of adaptations on the message
number generated during the execution. Figure 8.9 shows the effects of adap-
tations on the network traffic generated during the execution. All data are
normalized to those when none of the adaptations are enabled. We present
the normalized data against different numbers of processors. In the legend,
No denotes no adaptive protocol enabled, HM denotes adaptive object home
migration, SMM denotes synchronized method migration, and Push denotes
connectivity-based object pushing.
We will elaborate the effectiveness of each adaptation respectively in the
following sub-sections.
8.4.1 Adaptive Object Home Migration
Among four applications, adaptive object home migration improves the per-
formance of ASP and SOR a lot, as seen in figure 8.7 (a) and (b). In ASP and
SOR, the data are in the 2-D matrices that are shared by all threads. In Java,
a 2-D matrix is implemented as an array object whose elements are also array
objects. Many of these array objects exhibit the single-writer access pattern
after they are initialized. The shared data are allocated to different cluster
nodes in a round robin manner initially. However, their original homes are
not the writing nodes. The home migration protocol automatically makes
the writing node the home node to eliminate remote accesses. As seen in
figure 8.8 (a) and figure 8.9 (a) for ASP, and figure 8.8 (b) and figure 8.9 (b)
for SOR, home migration greatly reduces the messages and network traffic
generated during the executions for ASP and SOR, which explains the reason
of performance improvement.
91
0%
20%
40%
60%
80%
100%
2 4 8 16
Number of processors
No
rmal
ize
exec
uti
on
tim
e
No HM HM+SMM HM+SMM+Push
0%
20%
40%
60%
80%
100%
2 4 8 16
Number of processors
No
rmal
ize
exec
uti
on
tim
e
No HM HM+SMM HM+SMM+Push
(a) ASP (b) SOR
0%
20%
40%
60%
80%
100%
2 4 8 16
Number of processors
No
rmal
ize
exec
uti
on
tim
e
No HM HM+SMM HM+SMM+Push
0%
20%
40%
60%
80%
100%
2 4 8 16
Number of processors
No
rmal
ize
exec
uti
on
tim
e
No HM HM+SMM HM+SMM+Push
(c) NBody (d) NSquared
Figure 8.7: Effects of adaptations w.r.t. execution time
As a further demonstration, figure 8.10 visualizes the effect of object home
migration on SOR by using PAT that is discussed in chapter 6. Figure 8.10
(a) is the time line window without home migration. There are four global
phases, each taking approximately the same amount of time. Figure 8.10 (b)
is the time line window with home migration enabled. Three global phases
are marked in the figure: “Before Home Migration”, “Home Migrating”, and
“After Home Migration”. Before home migration takes effect, we observe
that a lot of remote reads and writes are sent to their home node, node 01.
1The shared objects are intentionally allocated on node 0 to simplify the visualization
92
0%
20%
40%
60%
80%
100%
2 4 8 16
Number of processors
No
rmal
ize
mes
sag
e n
um
ber
No HM HM+SMM HM+SMM+Push
0%
20%
40%
60%
80%
100%
2 4 8 16
Number of processors
No
rmal
ize
mes
sag
e n
um
ber
No HM HM+SMM HM+SMM+Push
(a) ASP (b) SOR
0%
20%
40%
60%
80%
100%
2 4 8 16
Number of processors
No
rmal
ize
mes
sag
e n
um
ber
No HM HM+SMM HM+SMM+Push
0%
20%
40%
60%
80%
100%
2 4 8 16
Number of processors
No
rmal
ize
mes
sag
e n
um
ber
No HM HM+SMM HM+SMM+Push
(c) NBody (d) NSquared
Figure 8.8: Effects of adaptations w.r.t. message number
During the home migrating phase, we observe that although the reads (white
arrows) are still sent to the original home node, the writes (gray arrows) are
performed locally. It means the home has already been migrated to the
local node at that moment. We can also observe that the phase after home
migration takes much less time than the phase before home migration since
most remote reads and writes are eliminated by object home migration. As
can be observed, the effect of home migration is to change remote read/write
to home read/write.
view.
93
0%
20%
40%
60%
80%
100%
2 4 8 16
Number of processors
No
rmal
ize
net
wo
rk t
raff
ic
No HM HM+SMM HM+SMM+Push
0%
20%
40%
60%
80%
100%
2 4 8 16
Number of processors
No
rmal
ize
net
wo
rk t
raff
ic
No HM HM+SMM HM+SMM+Push
(a) ASP (b) SOR
0%
20%
40%
60%
80%
100%
2 4 8 16
Number of processors
No
rmal
ize
net
wo
rk t
raff
ic
No HM HM+SMM HM+SMM+Push
0%
20%
40%
60%
80%
100%
2 4 8 16
Number of processors
No
rmal
ize
net
wo
rk t
raff
ic
No HM HM+SMM HM+SMM+Push
(c) NBody (d) NSquared
Figure 8.9: Effects of adaptations w.r.t. network traffic
Home migration also improves the performance of NSquared. In NSquard,
the data of particles are stored in an array. The particles are evenly dis-
tributed among threads. Each thread will only update its assigned particles.
Thus the particle objects present single-writer pattern, and the communi-
cation is reduced by migrating the homes of the particle objects to their
updating threads respectively.
Home migration has little impact on the performance of Nbody because
NBody lacks the single-writer pattern, as seen in figure 8.7 (c). This also
indicates that our home migration protocol has little negative side effect
94
(a) Without home migration
(b) With home migration
Before Home Migration
Home migrating After Home migration
Figure 8.10: The effect of object home migration on SOR
because of its lightweight design.
8.4.2 Synchronized Method Migration
Synchronized method migration optimizes the execution of a synchronized
method of a non-home DSO. Although it does not reduce the network traffic,
it reduces the number of messages and the protocol overheads, as we discussed
in section 5.2.
ASP requires n barriers for all the threads in order to solve an n-node
graph. SOR requires two barriers in each iteration. NSquared requires one
barrier in each simulation step. The barrier operation is implemented as a
95
synchronized method. We see in figure 8.8 (a) and (b) that synchronized
method migration reduces the messages generated during the executions of
ASP and SOR. For example, on 16 processors, 35% of ASP’s messages are
eliminated by enabling synchronized method migration. Consequently, ASP
and SOR’s overall performances are improved to some extent, particularly
when running on a large number of processors, as observed in figure 8.7 (a)
and (b).
However, there is an exception: running on 4 processors, ASP’s execution
time increases by 8.2% after enabling synchronized method migration. The
detail analysis for this phenomenon is in section 8.6.
Synchronized method migration has a very limited effect on NSquared,
because the synchronization-related messages are only a very small percent-
age of the total messages. Most messages are object fault-ins. Synchronized
method migration has on effect on NBody because NBody uses synchronized
block instead of synchronized method.
8.4.3 Connectivity-based Object Pushing
Connectivity-based object pushing is a prefetching strategy which takes ad-
vantage of the object connectivity information to improve the reference lo-
cality. Particularly, it improves the producer-consumer pattern greatly.
NBody is a typical application of the producer-consumer pattern. In
NBody, a quadtree is constructed by one thread and then accessed by all
other threads in each iteration. The quadtree consists of a lot of small-
sized objects. We can see that object pushing greatly reduces the number of
messages for NBody, as seen in figure 8.8 (c). Since object pushing may push
unneeded objects as well, the amount of communication increases slightly,
as seen in figure 8.9 (c). The improvement on execution time due to object
pushing is also significant in NBody, as seen in figure 8.7 (c). When NBody
runs on a large number of processors, the percentage of object fault-in time
in the total execution time increases, as shown in figure 8.5. Thus the effect
96
of object pushing is amplified correspondingly.
Object pushing improves the reference locality in NSquared, too. In
NSquared, the particle object contains multiple sub-objects, describing its
coordination, its velocity, and the integrated forces on it. Object pushing
aggregates multiple objects in one message.
Compared with NBody and NSquared, most DSOs in ASP and SOR are
array objects of reference type and primitive type, and object pushing is not
performed on them to reduce the impact of pushing unneeded objects.
8.5 Sensitivity and Robustness Analysis for
HM Protocol
In order to clearly examine the performance difference between home mi-
gration protocols with different fixed thresholds and that with the adaptive
threshold, we carefully design some synthetic benchmark programs that pre-
dominantly present the single-writer pattern. Thus we can rule out any other
factors that influence the performance, and concentrate on the effectiveness
of our home migration protocol on the single-writer pattern.
Shown in our object access pattern space, there are two major basic
patterns along the synchronization dimension: accumulator and assignment.
To avoid data race condition, the object accesses presenting the single-writer
pattern can be coordinated by using either the accumulator synchronization
or the assignment synchronization. With the accumulator synchronization,
the objects of single-writer pattern are accessed inside the critical section.
With the assignment synchronization, the objects of single-writer pattern are
accessed outside the critical section. Proper synchronization guarantees that
the reads happen after the writes so that the values can be safely transfered
from one thread to another.
We have elaborately designed two benchmark programs: RCounter, i.e.,
repeating counter operations, representing the single-writer pattern under
97
while (true) {
synchronized (lock0) {
if (counter.internal >= n) {
break;
}
counter.internal++;
for (int j=0; j<r-1; j++) {
synchronized (lock1) {
counter.internal++;
}
}
}
// Some simple arithmetic
// computation goes here.
}
Figure 8.11: RCounter’s Source code skeleton run by each thread
the accumulator synchronization; DSOR, i.e., dynamical SOR, representing
the single-writer pattern under the assignment synchronization. In addi-
tion, RCounter demonstrates the behavior of the home migration protocol
on small-sized objects, and DSOR demonstrates that on relatively large-sized
objects. RCounter and DSOR represent the most important two situations
of the single-writer pattern in real applications. They are very suitable for
evaluating our adaptive home migration protocol. Below we present and
analyze their experiment results respectively.
In the experiments, we start with eight working threads all running on
the slave nodes. All synchronization operations are distributed ones that are
sent to the master node. So all the performance differences come from the
effects of different home migration protocols.
98
RCounter
Figure 8.11 shows Rcounter’s source code skeleton run by each thread. In the
benchmark, after a thread acquires the lock of object lock0, it will update a
shared counter for a number of times, which we refer to as the repetition of the
single-writer pattern. It is represented by r in the code. The home migration
protocols try to change the home of this shared counter object to improve the
performance. In order to reflect these updates to the home copy, each update
is enclosed in a synchronized block. Notice after this thread releases lock0, it
may acquire it again, or another thread may get the chance to acquire it. For
example, if the repetition of single-writer pattern is 4, the actual consecutive
writing times could be a multiple of 4, such as 8 and 16. This happens
randomly at runtime. We also embed some computation in the benchmark
to make it more realistic. We measure the performance of different home
migration protocols against different repetitions of the single-writer pattern.
Figure 8.12 shows the normalized execution time against different repe-
titions of the single-writer pattern. NM denotes no home migration. FT1
denotes home migration with a fixed threshold of 1. FT2 denotes home mi-
gration with a fixed threshold of 2. FT1 always performs home migration
more eagerly than FT2. AT denotes the home migration protocol with an
adaptive threshold. For each repetition, the execution times are normalized
to the largest one among them.
Figure 8.13 shows the normalized message number against different repe-
titions of the single-writer pattern. For each repetition, the message numbers
are normalized to the largest one among them. We further break down the
messages into four categories: obj denotes normal object fault-in without
home migration happening at the same time, mig denotes object fault-in
with home migration, diff denotes diff propagation, and redir denotes ob-
ject home redirection. We do not consider synchronization messages because
they are invariable in all cases as mentioned before.
In the message breakdown, the communication overhead without home
99
60%
80%
100%
2 4 8 16Repetition of single-writer pattern
No
rmal
ized
exe
cuti
on
tim
e
NM FT1 FT2 AT
Figure 8.12: Effects of home migration protocols against repetition of single-writer pattern: normalized execution time (RCounter)
0%
50%
100%
NM
FT
1F
T2
AT
NM
FT
1F
T2
AT
NM
FT
1F
T2
AT
NM
FT
1F
T2
AT
2 4 8 16
Repetition of single-writer pattern
No
rmal
ized
mes
sag
e n
um
ber redir
diff
mig
obj
Figure 8.13: Effects of home migration protocols against repetition of single-writer pattern: normalized message number (RCounter)
100
migration includes obj and diff. They are the overheads that the home
migration protocol tries to reduce. Under situations with home migration,
the total number of object fault-in equals to obj plus mig, and redir is the
negative impact of home migration.
We have several observations from figure 8.12 and 8.13. Firstly, when the
repetition of the single-writer pattern is large enough, e.g., 16, the benefit
from home migration is quite obvious. As seen, 87.2% of object fault-ins and
diff propagations are eliminated by FT1. In other words, remote read/write
changes to home read/write. We can expect even better performance im-
provement due to home migration when the repetition is larger.
Secondly, when the repetition of the single-writer pattern is not large
enough, the benefit from home migration may not pay off when compared to
the home redirection overhead. Particularly, when the object’s home and the
lock’s home are at the same node, as in the situation without home migra-
tion, the diff propagation can be piggybacked on synchronization messages.
This explains why home migration protocols incur much less messages but
still perform roughly the same as that without home migration when the
repetition of the single-writer pattern is 8.
Thirdly, in all cases, FT1 is more sensitive than FT2 towards the single-
writer pattern in that the message numbers of object fault-in and diff prop-
agation in FT1 are less than those in FT2. FT1 changes more remote
read/write to local read/write. When the repetition is relatively large, such
as 8 and 16, AT performs as well as FT1 in this aspect. This fact confirms
our claim that AT presents good sensitivity towards the lasting single-writer
pattern.
Finally, when the repetition is relatively small, such as 2 or 4, i.e. the
transient single-writer pattern, fixed threshold home migration protocols in-
cur a lot of redirection overhead. This shows that fixed threshold protocols
usually do not have robustness against the transient single-writer pattern, ex-
cept in some individual cases. For example, FT2 prohibits home migration
101
for (int cur = 0; cur < iteration; cur++) {
for (int i = from; i < to; i++) {
for (int j = 0; j < sizeOfMatrix; j++) {
matrix[i][j] = average(i, j);
}
if ((cur + 1) % repetition == 0) {
from = from + sizeOfMatrix / numOfThreads;
to = to + sizeOfMatrix / numOfThreads;
if (from >= sizeOfMatrix) {
from -= sizeOfMatrix;
to -= sizeOfMatrix;
}
}
}
}
Figure 8.14: DSOR’s Source code skeleton run by each thread
when the repetition is two. As we can see, AT demonstrates better robust-
ness than fixed threshold protocols in this aspect. AT is able to detect the
transient single-writer pattern and strike a good balance between performing
home migration to reduce remote accesses and prohibiting home migration
to reduce the redirection overhead. When the repetition is relatively small,
such as 2 or 4, AT greatly reduces the home redirection messages.
DSOR
DSOR is a variant version of SOR discussed in section 8.2.3. In SOR, the
computation workload of the matrix is evenly distributed among threads in a
row-wise manner. Each thread is the only writer for the assigned rows, which
are array objects. These single-writer patterns are fixed in the execution time.
However, in DSOR, after an adjustable number of iterations, we reassign the
102
array objects to threads in a circular way. The threads still conduct the
single-writer pattern on their newly assigned array objects. Therefore, each
array object presents a changeable single-writer pattern, where the writer
varies from time to time. Figure 8.14 shows DSOR’s source code skeleton
run by each thread. Here we calculate a 1024 by 1024 matrix for 64 iterations.
We can specify the number of iterations in which a particular thread
is writing some array objects, i.e., the repetition of a single-writer pattern,
which is represented by repetition in figure 8.14. Similar as figure 8.12,
figure 8.15 shows the normalized execution time against different repetitions
of the single-writer pattern. Similar as figure 8.13, figure 8.16 shows the
normalized message number against different repetitions of the single-writer
pattern.
Many observations from RCounter can also be found in DSOR. For ex-
ample, FT1 is more sensitive than FT2 towards the single-writer pattern. So
FT1 converts more remote accesses to local accesses than FT2. In this as-
pect, AT is as good as FT1, as shown when the repetition of the single-writer
pattern is 4, 8, and 16.
Since the size of shared object presenting the single-writer pattern is
relatively large, the benefit from home migration is obvious even when the
repetition is not very large, e.g., 2 and 4. While in RCounter, the home
migration benefit is convincible only when the repetition is 16 because of the
small size of the shared object. Remember in our protocol, the object size is
taken into account when calculating the home access coefficient that is the
overhead ratio of one eliminated pair of object fault-in and diff propagation to
one home redirection. Although not shown in the figure, we do experiment
with a fixed home access coefficient that does not take into account the
object size. When the repetition is 4, the protocol with a fixed home access
coefficient increases the execution time by 10.2%, and reduces the number of
home migrations by more than half. Clearly, the fixed home access coefficient
does not correctly consider the benefit from home migrations.
103
0%
20%
40%
60%
80%
100%
2 4 8 16Repetition of single-writer pattern
No
rmal
ized
exe
cuti
on
tim
e
NM FT1 FT2 AT
Figure 8.15: Effects of home migration protocols against repetition of single-writer pattern: normalized execution time (DSOR)
0%
50%
100%
NoM
ig
FT
1
FT
2
AT
NoM
ig
FT
1
FT
2
AT
NoM
ig
FT
1
FT
2
AT
NoM
ig
FT
1
FT
2
AT
2 4 8 16
Repetition of single-writer pattern
No
rmal
ized
mes
sag
e n
um
ber
redir
diff
mig
obj
Figure 8.16: Effects of home migration protocols against repetition of single-writer pattern: normalized message number (DSOR)
104
In DSOR, the most interesting thing happens when the repetition is 2:
AT performs better than both FT1 and FT2. FT2 prevents most of the home
migrations, so it is almost as same as NoMig. FT1 allows home migration
on the remote read coming from the last writer, so it incurs a lot of home
migrations as well as a lot of home redirections. While AT strikes a very
smart balance between FT1 and FT2. By considering the home redirection
overhead, AT incurs about 10% home migrations of those FT1 incurs, so thus
reduces home redirections very much. But AT still manages to eliminate some
remote accesses through home migrations.
To sum up, both RCounter and DSOR, which represent the most impor-
tant two situations of single-write pattern in real applications, reflect that
our adaptive home migration protocol is sensitive to the lasting single-writer
pattern, and at the same time robust against the transient single-writer pat-
tern.
8.6 More on Synchronized Method Migration
As shown in figure 8.7 (a), running on 4 processors, ASP’s execution time
increases by 8.2% after enabling synchronized method migration. Since the
computer cluster is dedicated for the experiments and we run each test for
many times to take an average, we have removed most, if not all, unpre-
dictable factors in the execution. So there must be some reason behind this
exception.
As a first step to investigate the reason, we measure the effect of synchro-
nized method migration on a barrier test benchmark. In the benchmark, all
working threads repeat doing the barrier operations for a number of times.
Figure 8.17 shows the effect of synchronized method migration on the bar-
rier operation against the number of processors. The barriers are repeated
for 10,000 times in the experiments. As seen from the figure, the effect of
synchronized method migration on the barrier operation is very clear, and it
105
0
10
20
30
40
50
60
70
80
90
100
0 2 4 6 8 10 12 14 16
Number of processors
Exe
cuti
on
tim
e (s
eco
nd
s) w/o SMM
w/ SMM
Figure 8.17: Effect of synchronized method migration on the barrier opera-tion against the number of processors
increases with the number of processors.
However, when a thread does both computation and synchronization,
synchronized method migration will complicate the situation. The effect of
synchronized method migration is to migrate the load from the requesting
node to the requested node, and at the same time to aggregate multiple
synchronization requests into one. So its positive effect is to reduce the
the processing and transmission overhead of the requesting node, while its
negative effect is to possibly overload the requested node so as to increase
the waiting time on the requesting node. Which one becomes the dominant
factor decides the effect of synchronized method migration.
Figure 8.18 shows ASP’s performances on different problem sizes with
various configurations of the cluster-based JVM. We scale the size of the
distance matrix. There are 4 working threads in all cases. All-Slave denotes
that all working threads are running on the slave nodes. Thus the master
106
0
20
40
60
80
100
120
140
160
256 512 768 1024 1280 1536
Problem size
Exe
cuti
on
tim
e (s
eco
nd
s)
HM
HM+SMM
HM All-Slave
HM+SMM All-Slave
Figure 8.18: ASP’s execution times on different problem sizes
node is dedicated for the synchronization operations.
The performance comparison between HM and HM+SMM reveals the influence
of computation load on the effect of synchronized method migration. When
the problem sizes are small, the synchronization overhead is relatively large.
The positive effect of synchronized method migration is dominant. When
the problem size increases, for each thread, the relative computation load
compared with the synchronization load increases, too. Synchronized method
migration causes the workload moved from the slave nodes to the master
node to worsen the situation of load imbalance. Here the negative effect of
synchronized method migration plays a major role. When we run ASP of a
same problem size on a large number of processors, the relative computation
load compared with the synchronization load for each thread decreases. So
the positive effect of synchronized method migration becomes dominant and
is amplified when we increase the number of processors, as seen in figure 8.7
(a).
When we run all working threads on the slave nodes, as shown by HM
107
All-Slave and HM+SMM All-Slave in figure 8.18, the master node is dedi-
cated for the synchronization workload. The negative effect of synchronized
method migration is gone. As we see in the figure, the effect of synchronized
method migration is always positive.
In conclusion, under situations with relatively heavy synchronization over-
head, synchronized method migration can be effective; while under situations
with relatively light synchronization overhead, it may not be helpful and
even worsen the performance by unbalancing the workload. It is helpful
to dedicate a processor for synchronization, which is shown to be effective,
particularly when synchronized method migration is enabled.
108
Chapter 9
Related Work
9.1 Overview
This chapter presents a survey of the works related to the thesis. Firstly, we
discuss some works on high performance parallel Java computing that do not
follow the cluster-based JVM approach. Then we put more focuses on the
works following the cluster-based JVM approach.
9.2 Augmenting Java for Parallel Computing
As a network-centric language, Java has already provided some capabilities
facilitating distributed computing, such as Socket, RMI [75]. Similar to re-
mote procedure call (RPC), RMI is used to build distributed applications in
which an object is able to invoke the methods of a remote object.
However, it is the consensus of the parallel computing community that
the official Java distribution, i.e., Sun JDK (Java Development Kit), is still
not capable enough to carry on high performance parallel computing. There-
fore, many works aim to augment Java in different ways to promote Java’s
capability for parallel computing.
In this section, we discuss two approaches to augmenting Java for parallel
109
computing, language augmentation and class augmentation. We will discuss
three representative systems below. Each of them belongs to a different
programming paradigm: JavaParty follows RMI style, HPJava follows data
parallel paradigm, and mpiJava follows message passing paradigm. Although
they have facilitated parallel programming to some extent, they still impose
some considerable programming complexity.
9.2.1 Language Augmentation
Some researchers take a language augmentation approach by introducing
new keywords or syntax extensions into Java language. Because new syntax
features are incorporated, a customized Java compiler or preprocessor is re-
quired to translate augmented Java source code to standard bytecode. No
modification to JVM and Java bootstrap classes is required.
JavaParty
JavaParty [65] is a case of language augmentation approach. The motiva-
tion behind JavaParty is to overcome the programming complexity of RMI
but still to use RMI as the communication method in cluster environment.
JavaParty only introduces a new class modifier, remote, to Java language.
If used, the new modifier indicates that the instances of the modified class
represent some form of parallelism, and thus are possible to reside at remote
cluster nodes. The invocations of remote instances’ methods are through
RMI. JavaParty provides a preprocessor that translates JavaParty’s source
code to RMI implementations in pure Java, which are further compiled to
Java bytecode by a standard Java compiler.
JavaParty is location transparent, i.e., the programmers do not need to
map instances of remote classes to some specific nodes. JavaParty’s runtime
system will handle the distribution of remote instances. The runtime system
may also schedule object migration to enhance locality, in which the object
110
is moved to where the invocations take place.
Although JavaParty reduces RMI’s programming complexity and pro-
vides some desirable features, it has some shortcomings. Firstly, JavaParty
may cause memory inconsistency. In method invocations, RMI passes non-
remote objects by copies. If the receiving method further changes the repli-
cated objects, the inconsistency will happen. The inconsistency situation can
only be solved by programmers, by either declaring the passed objects to be
remote or ensuring that the replicated objects will not be modified. Secondly,
because all the remote interactions are based on RMI that is heavyweight,
JavaParty is not suitable for fine-grained sharing situations in parallel com-
puting.
HPJava
Like HPF [3], HPJava [4] is a data parallel programming language exploiting
parallelism at data level. HPJava introduces new constructs into Java for de-
scribing how data is distributed among processors and how each process runs
program segment over different data sets simultaneously. For example, The
overall construct defines a parallel, distributed loop, in which each process
will work on a defined separate data range.
In the data parallel paradigm, some advanced compilation techniques
can be taken to direct the mapping from data to processor and to generate
efficient communication code accessing non-local data [69]. However, the data
parallel paradigm can not effectively address the irregular problems [48].
9.2.2 Class Augmentation
Observing that the requirements from parallel computing can not be effec-
tively addressed by the bootstrap Java classes that are distributed along with
Sun JDK, some researchers take a class augmentation approach by creating
Java classes specially designed for parallel computing. The Java program-
111
mers leverage them by creating their instances and invoking their methods.
No modification to Java compiler and JVM is required.
mpiJava
Message Passing Interface (MPI) [16] is a widely accepted message pass-
ing standard used in parallel computing. Compared with Socket interface,
MPI defines higher level abstraction and routines for parallel computing.
For example, MPI defines various blocking and non-blocking point-to-point
communication routines, as well as collective communication routines, such
as broadcast and reduction. The collective communication takes place in a
communicator context, which is a collection of communicating processes.
mpiJava [27] is a collection of Java APIs following the MPI standard.
mpiJava does not reinvent the MPI implementation. Instead, through Java
Native Interface (JNI), mpiJava provides a set of Java wrappers to native
MPI packages, such as MPICH [11] which is one of the most popular open
source MPI implementations. Thus, mpiJava should be portable to any plat-
form that provides compatible Java runtime and native MPI environments.
9.3 Cluster-based JVM
In the above approaches to using Java for high performance parallel com-
puting, the Java programmers are supposed to learn and use either new
Java language extensions in the cases of JavaParty and HPJava, or new Java
classes in the case of mpiJava. This is not transparent for Java programmers
and creates a non-trivial entry barrier for Java programmers to the field of
parallel programming.
In fact, Java has already provided some built-in mechanism that can be
used for parallel computing: Java has inherent multi-threading support.
The state-of-the-art JVMs have already been able to leverage multiple
available processors in symmetric multiprocessing (SMP) computers to sched-
112
ule Java threads, where Java threads can be distributed to different proces-
sors to achieve the speedup. However, the achievable speedup is limited by
the number of processors in SMPs.
The approach of cluster-based JVM is aiming to run unmodified multi-
threaded Java applications on computer clusters, which is considered to be
more scalable and affordable than SMPs. With this approach, the threads
within a Java application can be automatically distributed onto different
cluster nodes to achieve parallelism or leverage cluster-wide resources such
as memory and network bandwidth. With this approach, programming a
parallel computer, e.g., a computer cluster, is almost equivalent to program-
ming a sequential computer, except that parallel programming raises new
performance issues, such as minimizing communication and synchronization
overheads.
Since Java threads within the same application share the object heap,
the cluster-based JVM calls for a global object space, which is a distributed-
shared memory service to provide transparent object accesses and synchro-
nizations for distributed threads.
In this section, we discuss several research works following the cluster-
based JVM approach. The cluster-based JVM is also called distributed JVM.
9.3.1 Jackal
Jackal [78] directly compiles multi-threaded Java programs’ source code or
bytecode into distributed native machine code, which can directly run on a
cluster. Therefore, JVM, the standard Java runtime system, is not required in
this approach. It means the static compilation approach needs to provide an
alternative runtime system to meet runtime requirements of Java programs,
such as garbage collection, dynamical class loading and compilation, as well
as exception handling.
Most of its effort to improve performance is done at compile time. Jackal
incorporates some compiler optimizations to remove object access checks,
113
which are used to guarantee that the objects are in the right access states
according to the cache coherence protocol. In addition, Jackal’s compiler per-
forms two optimizations to reduce the communication caused by distributed
object accesses and synchronizations: object-graph aggregation and automatic
computation migration.
Object-graph aggregation uses a heap approximation algorithm [49] to
identify those connected objects. If an object’s field contains a reference
to another object, the connectivity exists between these two objects. At
runtime, when the root object of the object graph is faulted in, the whole
object graph could be prefetched together to improve reference locality.
Automatic computation migration will generate a remote procedure call
(RPC) alike code to move a computation part and its states to a remote
cluster node at runtime. This may be more efficient than executing the
computation at the local node that may involve a lot of communication of
objects and synchronizations. For example, migrating a synchronized block
or method to the node where the synchronized object resides can aggregate
multiple lock/unlock/object request and reply messages into one message
round trip.
Jackal uses a fine-grain DSM to build the GOS. The coherence unit is a
fixed-size region of 256 bytes. Like page-based DSM systems, all the repli-
cations of an object reside at the same virtual address on different nodes.
Since the virtual memory address allocatable to applications is limited in 32-
bit operating systems and all nodes share the same virtual memory address
range, Jackal may not leverage all the available physical memory across the
cluster. In fact, this drawback is inherited from page-based DSMs running
on 32-bit operating systems. Unlike page-based DSM systems, the object
access states are enforced through software checks.
Jackal’s runtime system enables an optimization called lazy flushing. The
home of a shared coherence unit is fixed. However, if a unit is not shared by
any other node and some node requests a copy for write access, the requesting
114
node becomes the exclusive owner. Then the later reads and writes will be
performed locally just as if they were at the home. If other nodes want
to shared the unit, the current exclusive owner needs to be notified. The
drawback of lazy flushing is that it ignores the application’s inherent access
patterns. Frequent transitions from and to exclusive owner will cause a lot
of communication overhead, thus the transition times are set to maximally
five in Jackal.
9.3.2 Hyperion
Hyperion [62] compiles Java bytecode to C source code, then compiles to
native code. To build the GOS, Hyperion introduces a redirection table,
which is replicated in each node. Each object has an entry in the table,
storing the actual memory address of the object. Its index represents the
global identification for the object. Each node manages a portion of the
table. It can create new objects in its own portion without synchronizing
with other nodes.
The introduction of a redirection table not only impairs the performance
by incurring a redirection overhead on object accessing, but also complicates
the memory management. The size of the table is the maximum object
number in one application. Even the object heap is not yet fully occupied,
but the full occupation of table entries on some node may trigger a distributed
garbage collection. Without knowing the actual number of objects in the
application, it is difficult to choose a proper redirection table size, and assign
the parts of the table to every node.
9.3.3 JavaSplit
JavaSplit [42] rewrites and instruments the bytecode of a multi-threaded Java
application. The result is a distributed application in pure Java that is linked
with some runtime classes and can run on a cluster of standard JVMs. Each
115
of them runs on a cluster node.
The execution effect of the Java application after the bytecode instru-
mentation is equivalent to that of the original multi-threaded Java applica-
tion. However, each thread of the original multi-threaded Java application
is transformed to a standalone sub-application that can run on a standard
JVM. With the help of some runtime classes, all the sub-applications coop-
erate with each other to accomplish the job of the original multi-threaded
application.
JavaSplit uses the Bytecode Engineering Libarry (BCEL) [13] to instru-
ment and rewrite the bytecode. JavaSplit provides some runtime classes to
build the GOS and facilitates remote thread creations.
In the bytecode of a multi-threaded Java program, JavaSplit will inter-
cept and instrument object field and array element accesses to maintain the
memory consistency among replicated objects by adding software checks and
calls to fault-in routines. In addition, JavaSplit will intercept and instru-
ment thread creations and synchronization operations to implement remote
thread creation and distributed synchronization operations according to Java
memory model semantics.
JavaSplit argues that the bytecode instrumentation approach maintains
Java’s cross-platform portability by making use of the standard JVMs as the
execution platforms. JavaSplit can also rely on the advancement of standard
JVMs for performance improvement.
On the other hand, JavaSplit’s approach has some drawbacks. Firstly,
JavaSplit needs to instrument all the Java bootstrap classes because they
may be dynamically loaded by JavaSplit’s applications. However, this process
can not be fully automated because some bootstrap classes contain native
methods. Currently, JavaSplit needs to manually create JavaSplit version of
those bootstrap classes with native methods.
Secondly, because the instrumentation is done at the bytecode level, the
resulted bytecode size could be considerably enlarged, and the performance
116
class javasplit.A extends javasplit.somepackage.C {
// fields
private int myIntField;
public char myCharField;
// Send this object
public void DSM_serialize(DSM_ByteOutputStream out) {
super.DSM_serialize(out);
out.writeInt(myIntField);
out.writeChar(myCharField);
}
// Receive this object
public void DSM_deserialize(DSM_ByteInputStream in) {
super.DSM_deserialize(in);
in.readInt(myIntField);
in.readChar(myCharField);
}
}
Figure 9.1: JavaSplit’s code sample to send and receive objects
could be impaired.
For example, figure 9.1 illustrates how JavaSplit sends and receives an
object. To enhance the readability, the instrumented bytecode is described
in Java source code. Method DSM serialize is used to send this object,
and DSM deserialize to receive this object. To send or receive each field, a
method call is made and it will further call other methods to finally deliver
the value to the output stream or retrieve a value from the input stream.
The total overhead is much more expensive than the simple memory copy.
Consequently, the data communication in JavaSplit involves large overhead.
JavaSplit’s performance evaluation [42] conforms our observation.
Please note this performance issue is inherent in the bytecode instrumen-
tation approach. Due to the fact that Java is a strongly typed language,
the bytecodes can not directly operate on the memory. Instead, they oper-
117
ate on the data items with type semantics. Thus the bytecode operations
can not efficiently address the issue of memory copy that takes place in the
communication.
9.3.4 cJVM
cJVM [24] follows the cluster-based JVM approach. It uses a master-proxy
object model and a method shipping approach to implement the GOS. A
proxy object will be created locally on accessing a remote object and the
remote object becomes the master object. Method invocation of the proxy
object as well as field accessing to the proxy object are shipped to the node
where the master object resides. No consistency issue is involved in this
approach. Also usually, the object is not replicated to improve the access
locality. Since the method shipping approach will forward the execution flow
to the node where the master object resides, the workload distribution is
determined by the distribution of master objects in cJVM. Load balancing
may be difficult to achieve without an effective strategy enforced by either the
programmer or some runtime mechanism. The method shipping causes the
thread stack scattered on multiple nodes, which complicates the exception
handling. cJVM is evaluated on up to 4 nodes. This can not fully reflect
the scalability of cJVM’s master-proxy object model and method shipping
approach.
Several optimization techniques are applied to reduce the number of ac-
cess and method shipping [25]. They are a combination of many simple opti-
mizations, such as caching read-only fields or objects, locally executing state-
less methods that do not write on the heap, and single chance object migra-
tion. Some optimizations are possible through exploiting Java’s semantics.
For example, objects of type java.lang.String and java.lang.Integer
are read-only.
118
9.3.5 JESSICA
the JESSICA system [61], which stands for Java-Enabled Single System Im-
age Computing Architecture, uses some page-based DSM systems, such as
JUMP [35] and TreadMarks [57], to build the GOS. All objects are allocated
in the distributed shared memory. Each node manages a segment of shared
memory and creates new objects in its own segment independently.
Although this approach greatly alleviates the burden of constructing the
GOS because all the cache coherence issues, such as object addressing, fault-
ing, replication, and update propagation, can be managed by the page-based
DSM, it suffers from certain problems. Firstly, the sharing granularity of
Java and that of the page-based DSM are incompatible. Java organizes data
into variable-sized objects, while the page-based DSM enforces coherence at
the granularity of virtual memory page. If two threads are updating differ-
ent objects coincidentally residing on the same memory page, it will cause
communication. This phenomenon is called the false sharing problem. In
addition, if one object happens to reside across the memory page boundary,
faulting in it will incur two memory page fault-ins, which is quite heavy-
weight.
Secondly, as a low-level supportive layer, the page-based DSM is not
aware of the abundant runtime information in JVM, e.g., the object type
information, which makes it difficult to look for opportunities to improve
the performance of the GOS. Particularly, the accesses to different objects
residing at the same memory page are mingled at the page level. Therefore,
it is difficult to detect access patterns in applications that exhibit fine-grain
sharing.
The detailed analysis of various factors contributing to the efficiency of
using a page-based DSM to build the GOS can be found in [36].
JESSICA has a unique feature that it supports transparent thread mi-
gration. At runtime, a Java thread can be preemptively migrated from one
node to another node, e.g., from an overloaded node to an underloaded node.
119
Thread migration could be useful to achieve load balance or fault tolerance
in cluster-based JVMs.
JESSICA employs a master-slave thread model. In the very beginning,
all threads reside at the master node. Then some threads will be migrated to
the slave nodes according to the migration policy. For each migrated thread,
a thread still remains at the master node to handle thread synchronization
and I/O redirection. The migrated thread is called the slave thread, and
the corresponding thread staying at the master node is called as the master
thread. If the slave thread needs to do some synchronization, it will inform
the corresponding master thread to perform the actual synchronization on
behalf of it. Similarly, all the I/O operations issued by the slave thread will be
redirected to the corresponding master thread. Although simple, this master-
slave thread model makes the master node the performance bottleneck.
9.3.6 Java/DSM
Java/DSM [82] has also built its GOS on top of a page-based DSM, i.e.,
TreadMarks [57]. Since Java/DSM is intended to work on a heterogeneous
cluster, data conversion between heterogeneous hardware architectures is re-
quired. However, using TreadMarks as an infrastructure for building GOS
contradicts the goal, because TreadMarks does not perform data conversion
across machine boundaries. Although the data conversion function could be
added into TreadMarks, it will incur some considerable overhead. To our
best knowledge, Java/DSM’s attempt to provide a heterogeneous DSM has
not been fully achieved, and no performance result has been reported.
120
Chapter 10
Conclusion
10.1 Discussions
10.1.1 Effectiveness of the Adaptations
We have presented three adaptations, namely, adaptive object home migra-
tion, synchronized method shipping, and connectivity-based object pushing.
The adaptive object home migration is definitely useful for a home-based
cache coherence protocol. Without home migration, the fixed home node for
an object could become a performance bottleneck. Two factors contribute to
the effectiveness of our adaptive object home migration protocol. Firstly, the
GOS is object-based and we are able to separate accesses on different objects
so that the access behavior for a certain object can be precisely observed.
Secondly, we rely on the runtime feedback to continuously adapt to objects’
access behavior. As a result, the experiments show that our adaptive home
migration protocol demonstrates both the sensitivity to the lasting single-
writer pattern and the robustness against the transient single-writer pattern.
In the latter case, the protocol inhibits home migration in order to reduce
the home redirection overhead.
Synchronized method migration can have both positive and negative ef-
121
fects on the performance. Synchronized method migration can reduce the
messages and the protocol overheads significantly on one hand. On the
other hand, it moves the workload to the home node of the synchronized
object, which could cause load imbalance if the computation load becomes
more dominant than the synchronization load. Currently, we have no way
to determine whether a synchronized method migration is profitable or not.
However, we show an arrangement where a particular node is dedicated to
the global synchronization. This arrangement can always give the perfor-
mance improvement, and the synchronized method migration approach has
only positive effect under this arrangement. In addition, not only synchro-
nized methods are worth being migrated. By migrating a method to the
home node of the receiver object, it is possible to aggregate multiple object
fault-ins inside the method into one message round trip. Nevertheless, the
GOS is subject to load imbalance in doing this. A smart switching between
object shipping and method shipping needs a thorough investigation.
Connectivity-based object pushing is essentially a prefetching technique
to improve the reference locality. It is impossible in previous cluster-based
JVMs that use a page-based DSM system to build the GOS. It is enabled by
our GOS design that intentionally exploits Java’s runtime information. For
the application where the reference locality prevails, the effect of object push-
ing is obvious. In particular, object pushing optimizes producer-consumer
pattern based on connectivity information.
All these adaptations are respectively working on different dimensions of
the access pattern space. Theoretically, they are orthogonal and thus do not
interact with each other. As shown in our experiments, all these adaptations
hold the property that they have little negative impact when they do not
work.
122
10.1.2 Which Existing JVM is Based on
All current works on building a cluster-based JVM are based on modifying an
existing JVM. For example, cJVM is based on Sun JDK 1.2, Jessica is based
on Kaffe 0.9.1, and our cluster-based JVM is based on Kaffe 1.0.6. cJVM
and Jessica are running in the interpretation mode, while ours is running in
the JIT mode.
We chose Kaffe because it was one of the popular open source JVMs
at that time and we did not have access to Sun JDK’s source code. Kaffe
was designed to work in an embedded environment. Its JIT engine and
garbage collector do not provide satisfactory performance suitable for parallel
computing.
Since our cluster-based JVM is based on Kaffe, its performance suffers
from the undesired performance of Kaffe’s execution engine and GC subsys-
tem. However, the contributions of our research are to demonstrate that a
considerable speedup can be achieved by building a cluster-based JVM and
the design of the GOS is the key to the performance of cluster-based JVMs.
Thus the GOS techniques proved effective in this research can be applied
to any existing high performance JVM in order to build a high performance
cluster-based JVM.
Recently, a JVM called Jikes RVM (research virtual machine) [7] has
drawn many researchers’ attention. RVM is open source, high performance,
designed for research specially, and written in Java. RVM could be a good
candidate for the foundation of a high performance cluster-based JVM.
10.1.3 Thread Migration vs. Initial Placement
Some cluster-based JVMs, such as Jessica, supports thread migration. They
can preemptively move a thread from one cluster node to another during the
execution. Thread migration is supposed to provide the functions of load
balance and fault tolerance.
123
Our cluster-based JVM does not support thread migration. Instead, we
follow an initial placement approach to distribute user threads to different
nodes, i.e., a thread could be remote created in another node.
Compared with thread migration, initial placement is lightweight, easier
to implement, and it can well balance the load for the regular structured
problems. Initial placement lacks the ability of dynamic load balancing that
thread migration has. However, thread migration has its own problems.
Firstly, after a thread is migrated, it may need to access the objects at the
source node of the migration. In this way thread migration generates new
remote accesses. It is possible to carry all the objects reachable from the
migrated thread on thread migration. But this could cause a big migration
overhead and the threads staying at the source node may still need to access
those objects. Secondly, thread migration is an intra-application load balance
mechanism, in which a thread is the minimal workload to be moved. The
workload of a thread is difficult to predict. Sometimes a thread is too much
for migration so that the destination node of the migration becomes the
new performance bottleneck. One possible solution is to create much more
threads than the number of nodes so that each thread carries a relatively
small workload that is suitable to balance the workload differences between
two nodes. But in this way, the thread creation and synchronization overhead
could be significant. The benefits of thread migration in cluster-based JVMs
need to be carefully justified.
10.2 Future Work
10.2.1 Compiler Analysis to Reduce Software Checks
In JIT mode, software checking of object access state will likely be a sig-
nificant overhead, as shown in figure 8.3. The check overhead is serious in
particular when there are array accesses inside loops. Some compiler opti-
mization technique can be applied during the JIT compiler enabled native
124
instrumentation to hoist the array object check outside the loop if there is
no synchronization in the loop. That is, a check is made before entering
the loop, and no more check will be made during the loop. For the normal
object accesses, it is possible to aggregate the checks before the accesses to
the different fields of the same object to further improve the performance.
These techniques have already been demonstrated in a software fine-grain
DSM system Shasta [71]. The reduction of software checks is important for
a high performance cluster-based JVM.
10.2.2 Automatic Performance Bottleneck Detection
The cluster-based JVM supports a “sequential development parallel execu-
tion” method for parallel programming. That is, a multi-thread Java pro-
gram can be coded and debugged at any existing JVM, and submitted to
a cluster-based JVM for high performance execution after it is proved func-
tionally correct on one computer node. The cluster-based JVM promises a
correct execution result and takes the responsibility to optimize the parallel
performance through the GOS.
However, the performance on the cluster-based JVM may not satisfy the
expectation of programmers. And programmers may want to know what
actually happens on the cluster-based JVM, so as to revise the algorithm to
avoid some performance bottlenecks in the program.
Therefore, the cluster-based JVM not only needs to transparently im-
prove the parallel execution performance, e.g., through the adaptive cache
coherence protocol researched in this thesis, but also needs to provide the
programmers some automatic performance report, e.g., to list some serious
performance bottlenecks.
The PAT can be a good staring point for the research of automatic per-
formance bottleneck detection in the cluster-based JVM. PAT is lightweight,
and provides some preliminary functions for access pattern analysis. Access
pattern analysis can be a useful approach towards detecting performance
125
bottlenecks in GOS. Future research should put emphasis on both the quan-
titative analysis of the access behavior and an intuitive visualization of object
access patterns.
10.2.3 High Performance Communication Substrate
Our research shows that the performance of GOS is very important for a
high performance cluster-based JVM. Advances in JIT techniques do not
help reduce the communication and synchronization overheads required by
the parallel execution of multi-threaded Java programs.
Our research focuses on designing an intelligent cache coherence protocol
to improve the performance of GOS. Another way to improve the perfor-
mance of GOS is to use high performance communication substrate, e.g., to
use some lightweight protocols such as Directed Point [39] instead of TCP/IP
we are using, or to use some high performance network technologies such as
InfiniBand [5] instead of the Fast Ethernet we are using. A high perfor-
mance communication substrate is very important for DSM systems where
communications are triggered by memory accesses that are fine-grained and
frequent.
It is interesting to investigate how to efficiently map the interface exposed
by the lightweight protocol to the GOS operations. It is also interesting to
investigate the usage of new features in advanced network technologies, such
as remote DMA, in the GOS implementation.
126
Appendix A
Appendix
A.1 Overheads of GOS Primitive Operations
We have measured the overheads of three primitive operations in our GOS:
(1) the remote lock of a DSO, (2) the remote unlock of a DSO, and (3) the
fault-in of a DSO. They help us to evaluate the performance of GOS in a
micrographic view.
Figure A.1 shows the source code segment of the multi-threaded Java
program used to measure the overheads of those primitive GOS operations.
Each Worker thread will repeat to acquire the lock of object synObj, to
update its only integer field, then to release its lock, for n times. synObj
is a small object with only one integer field. The remote lock and unlock
happen on the entrance and exit of the synchronized block. Since the locally
cached DSOs will be flushed at the time of locks, the update of synObj
will fault-in its up-to-date content from its home node. All the overheads
are measured inside our cluster-based JVM by instrumenting the internal
functions performing those tasks.
Table A.1 shows the overheads of primitive operations with respect to
different number of threads. In the experiments, all the Worker threads run
on the slave nodes of the cluster-based JVM, while the object synObj’s home
127
class Worker extends Thread {
int n; // number of operations
SynObj synObj;
Worker(int n, SynObj synObj) {
this.n = n;
this.synObj = synObj;
}
public void run() {
for (int i = 0; i < n; i++) {
synchronized (synObj) {
++synObj.elapsed;
}
}
}
}
Figure A.1: The source code to measure GOS primitive operations
Number of Overhead Overhead Overhead ofWorker thread of lock of unlock object fault-in
1 178.77 162.16 166.462 365.34 170.20 176.714 759.39 171.29 191.078 1544.75 175.79 190.1316 3119.53 185.49 190.44
Table A.1: Overheads (in microseconds) of primitive operations with respectto different number of threads
128
is at the master node. Thus all the locks, unlocks, and object fault-ins go to
the master node. When the number of threads is larger than 1, we sum up the
overheads of all threads and take an average for each operation respectively.
When the number of thread is 1, there is no synchronization contention
contributing to the operation overheads. Let’s take the unlock operation as
an example. The unlock request message contains 16 bytes, which include a
4-byte requesting node ID, a 4-byte message type ID, a 4-byte payload length,
and the GUID of the object to be unlocked. The successful reply message
also contains 16 bytes. The round-trip time to send and receive a 16-byte
message through TCP Socket measured by Netperf is 122.49 microseconds.
The difference between the overhead of unlock operation and the “Netperf”
time is due to GOS’ non-blocking I/O support.
The object fault-in incurs less thread switches than lock/unlock, since
the GOS daemon thread can reply it directly. However, the object fault-in
incurs the DSO detection overhead. The overheads of both object fault-in
and unlock increases along with the number of Worker threads, due to the
longer waiting time for each request when the number of requests increase.
The lock overhead incurs the synchronization contention when the num-
ber of threads is larger than 1. Not surprisingly, the average lock overhead
is roughly proportional to the number of threads.
A.2 ASP Code Segment
The following code shows the run method of the Worker thread in ASP.
public void run() {
int i, j, k;
float cur[ ][ ], next[ ][ ];
// n: the number of vertices; wsize: the number of threads;
// id: thread id.
int from = n / wsize * id;
129
int to = n / wsize * (id + 1);
for (k = 0; k < n - 1; k++) {
if (k % 2 == 0) { // d0, d1: intermediate matrixes.
cur = d0; next = d1;
} else {
cur = d1; next = d0;
}
for (i = from; i < to; i++) {
for (j = 0; j < n-1; j++) {
if (cur[i][j] <= cur[i][k] + cur[k][j])
next[i][j] = cur[i][j];
else
next[i][j] = cur[i][k] + cur[k][j];
}
}
// barrier synchronization among all threads.
bar.barrier();
}
}
A.3 The Method for Parallel Performance Break-
down
The multi-threading characteristics of JVM raises some difficulty for us to
performing the precise breakdown. For example, when we measure an oper-
ation, it is possible that the current thread is switched off and later switched
on. Therefore what we have measured includes not only the overhead of the
operation itself but also the thread scheduling overhead and the CPU time
spent by another thread.
130
To carefully control the impreciseness caused by JVM’s multi-threading,
we measure the breakdown only at the slave nodes for the following reasons:
Firstly, there is one working thread on each slave node. Since all bench-
mark applications’ workload is fairly balanced, the breakdown of the working
thread can represent that of the application to a great extent.
Secondly, for all the benchmark applications, there are three threads on
each slave node, i.e., the application’s working thread, the GOS daemon
thread gosd, and the garbage collection thread gc. The measurement impre-
ciseness comes from the CPU time taken by gosd. However, this imprecise-
ness can be tolerated because (1) it is negligible, as in TSP, or (2) it probably
overlaps with the idle time in Obj and Syn for ASP, SOR, and NBody. For
NBody, almost all the time taken by gosd overlaps with the idle time in
Syn, and the measurement impreciseness is negligible. We admit that the
measurement impreciseness for ASP and SOR does exist so that the actual
computation time should be a little smaller than the Comp time since Comp
contains some time taken by gosd.
Thirdly, we abandon the breakdown data on the master node because they
are messed up by the complicated multi-threading situation on the master
node. On the master nodes, gosd takes more workload than its counterparts
on the slave nodes. Moreover, gosd will schedule the monitor proxy threads
to do the synchronization on behalf of the remote threads.
We average the breakdown data on all slave nodes if the number of slave
nodes is larger than 1.
A.4 JIT Compilation vs. Interpretation
Our GOS can be integrated with both the interpretation mode and the JIT
compilation mode of the cluster-based JVM. By comparing the performances
of our cluster-based JVM under these two execution modes, we can reveal
how the JIT compilation technology improves the performance of cluster-
131
0
200
400
600
800
1000
1200
0 2 4 6 8 10 12 14 16
Number of processors
Exe
cuti
on
tim
e (s
eco
nd
s)
Interpret
JIT
0
50
100
150
200
250
300
350
400
0 2 4 6 8 10 12 14 16
Number of processors
Exe
cuti
on
tim
e (s
eco
nd
s)
Interpret
JIT
(a) ASP (b) SOR
0
500
1000
1500
2000
2500
3000
3500
0 2 4 6 8 10 12 14 16
Number of processors
Exe
cuti
on
tim
e (s
eco
nd
s)
Interpret
JIT
0
500
1000
1500
2000
2500
3000
3500
4000
0 2 4 6 8 10 12 14 16
Number of processors
Exe
cuti
on
tim
e (s
eco
nd
s)
Interpret
JIT
(c) NBody (d) TSP
Figure A.2: JIT Compilation vs. Interpretation
based JVMs.
Figure A.2 compares the performances under the JIT compilation mode
and the interpretation mode for four applications respectively. The per-
formances are measured against the number of processors. All the cache
coherence protocol optimizations are enabled.
For all the applications, we observe that the performance improvement
due to the JIT compilation decreases with the number of processors. The
JIT compilation technique aims to improve the computation performance.
When the application is running on a larger number of processors, the com-
132
putation load on each processor decreases, while the communication and
synchronization load increase. Since the JIT compilation technique reduces
the computation-to-communication ratio, the applications’ speedups signifi-
cantly drop on a larger number of processors.
The JIT compilation is proven to be a key technique to improve the
performance of JVMs. In order to maintain a GOS, cluster-based JVMs
incur communication and synchronization overheads that are absent in non-
distributed JVMs. Based on these observations, we consider the JIT compi-
lation and GOS techniques as two orthogonal key techniques to improve the
performance of cluster-based JVMs.
133
Bibliography
[1] Distributed Shared Memory Homepage.
http://www.ics.uci.edu/˜javid/dsm.html.
[2] Ganglia distributed monitoring and execution system.
http://ganglia.sourceforge.net/.
[3] High Performance Fortran (HPF) Forum.
http://www.crpc.rice.edu/HPFF/.
[4] HPJava Home Page. http://www.npac.syr.edu/projects/pcrc/HPJava/.
[5] InfiniBand Trade Association. http://www.infinibandta.org/home.
[6] Java Grande Forum. http://www.javagrande.org/.
[7] Jikes RVM. http://www-124.ibm.com/developerworks/oss/jikesrvm/.
[8] JSR-000133 Java Memory Model and Thread Specification Revision.
http://www.jcp.org/aboutJava/communityprocess/review/jsr133/.
[9] Kaffe Java Virtual Machine. http://www.kaffe.org.
[10] Maui Scheduler. http://www.supercluster.org/maui/.
[11] MPICH-A Portable Implementation of MPI. http://www-
unix.mcs.anl.gov/mpi/mpich/.
[12] Rocks Cluster Distribution. http://rocks.npaci.edu/Rocks/.
134
[13] The Byte Code Engineering Library. http://jakarta.apache.org/bcel/.
[14] The HKU Gideon 300 Cluster. http://www.csis.hku.hk/˜clwang/gideon300-
main.html.
[15] The Java Memory Model. http://www.cs.umd.edu/users/pugh/java/memoryModel/.
[16] The Message Passing Interface (MPI) standard. http://www-
unix.mcs.anl.gov/mpi/.
[17] TOP500 Supercomputer Sites. http://www.top500.org/.
[18] Torque Resource Manager. http://www.supercluster.org/projects/torque/.
[19] Joint ACM Java Grande - ISCOPE 2001 Conference, Stanford Univer-
sity, California, USA, June 2001.
[20] Joint ACM Java Grande - ISCOPE 2002 Conference, Seattle, Washing-
ton, USA, November 2002.
[21] S. V. Adve and K. Gharachorloo. Shared Memory Consistency Models:
A Tutorial. IEEE Computer, 29(12):66–76, 1996.
[22] S. V. Adve and M. D. Hill. A Unified Formalization of Four Shared-
Memory Models. IEEE Trans. on Parallel and Distributed Systems,
4(6):613–624, 1993.
[23] C. Amza, A.L. Cox, S. Dwarkadas, L.-J. Jin, K. Rajamani, and
W. Zwaenepoel. Adaptive Protocols for Software Distributed Shared
Memory. In Proceedings of IEEE, Special Issue on Distributed Shared
Memory, volume 87, pages 467–475, March 1999.
[24] Y. Aridor, M. Factor, and A. Teperman. cJVM: a Single System Image
of a JVM on a Cluster. In Proc. of International Conference on Parallel
Processing, 1999.
135
[25] Yariv Aridor, Michael Factor, Avi Teperman, Tamar Eilam, and Assaf
Schuster. Transparently Obtaining Scalability for Java Applications on
a Cluster. Journal of Parallel and Distributed Computing, 60, Oct. 2000.
[26] Mark Baker. Cluster Computing White Paper. Technical report, IEEE
Task Force on Cluster Computing, December 2000.
[27] Mark Baker, Bryan Carpenter, Geoffrey Fox, Sung Hoon Ko, and Sang
Lim. mpiJava: An Object-Oriented Java Interface to MPI. In In-
ternational Workshop on Java for Parallel and Distributed Computing,
IPPS/SPDP 1999, April 1999.
[28] H. E. Bal, R. Bhoedjang, R. Hofman, C. Jacobs, K. Langendoen,
T. Ruhl, and M. F. Kaashoek. Performance Evaluation of the Orca
Shared Object System. ACM Transactions on Computer Systems, 16(1),
Feberury 1998.
[29] J. Barnes and P. Hut. A Hierarchical O (N log N) Force-Calculation
Algorithm. Nature, 324(4):446–449, 1986.
[30] G. Bell and J. Gray. High Performance Computing: Crays, Clusters and
Centers. What Next?, 2001.
[31] Gilad Bracha, James Gosling, Bill Joy, and Guy Steele. The Java Lan-
guage Specification, Second Edition. Addison Wesley, 2000.
[32] Rajkumar Buyya, editor. High Performance Cluster Computing: Archi-
tecture and System, volume 1. Prentics Hall PTR, 1999.
[33] John B. Carter, John K. Bennett, and Willy Zwaenepoel. Techniques for
Reducing Consistency-Related Communication in Distributed Shared-
Memory Systems. ACM Transactions on Computer Systems, 13(3):205–
243, 1995.
136
[34] Anthony Chan, William Gropp, and Ewing Lusk. User’s Guide for MPE:
Extensions for MPI Programs.
[35] B. Cheung, C.L. Wang, and Kai Hwang. A Migrating-Home Protocol for
Implementing Scope Consistency Model on a Cluster of Workstations. In
International Conference on Parallel and Distributed Processing Tech-
niques and Applications, pages 821–827, 1999.
[36] W.L. Cheung, C.L. Wang, and F.C.M. Lau. Annual Review of Scal-
able Computing, volume 4, chapter Building a Global Object Space for
Supporting Single System Image on a Cluster. World Scientific, 2002.
[37] Trishul M. Chilimbi, Thomas Ball, Stephen G. Eick, and James R .
Larus. StormWatch: A Tool for Visualizing Memory System Protocols.
In Supercomputing ’95, December 1995.
[38] Jong-Deok Choi, Manish Gupta, Mauricio J. Serrano, Vugranam C.
Sreedhar, and Samuel P. Midkiff. Escape Analysis for Java. In Pro-
ceedings of the Conference on Object-Oriented Programming Systems,
Languages, and Applications (OOPSLA), pages 1–19, 1999.
[39] C.L. Wang and A. Tam and B. Cheung and W. Zhu and D. Lee. Directed
Point: High Performance Communication Subsystem for Gigabit Net-
working in Clusters. Journal of Future Generation Computer Systems,
pages 401–420, 2002.
[40] Information Networks Division. Netperf: A Network Performance
Benchmark. Hewlett-Packard Company, revision 2.1 edition, February
1996.
[41] Jack Dongarra, Thomas Sterling, Horst Simon, and Erich Strohmaier.
High Performance Computing: Clusters, Constellations, MPPs, and Fu-
ture Directions. submitted to CACM for publication, June 2003.
137
[42] Michael Factor, Assaf Schuster, and Konstantin Shagin. JavaSplit: A
Runtime for Execution of Monolithic Java Programs on Heterogeneous
Collections of Commodity Workstations. In IEEE International Con-
ference on Cluster Computing, page 110, December 2003.
[43] Weijian Fang, Cho-Li Wang, and Francis Lau. Efficient Global Object
Space Support for Distributed JVM on Cluster. In the 2002 Interna-
tional Conference on Parallel Processing, August 2002.
[44] Weijian Fang, Cho-Li Wang, and Francis C.M. Lau. On the Design of
Global Object Space for Efficient Multi-threading Java Computing on
Clusters. Parallel Computing, 29:1563–1587, November 2003.
[45] Weijian Fang, Cho-Li Wang, Wenzhang Zhu, and Francis C. M. Lau.
A Novel Adaptive Home Migration Protocol in Home-based DSM. In
the 2004 IEEE International Conference on Cluster Computing (Cluster
2004), September 2004.
[46] Paulo Ferreira and Marc Shapiro. Garbage Collection and DSM Con-
sistency. In First Symposium on Operating Systems Design and Imple-
mentation, pages 229–241, Monterey, CA, 1994. ACM Press.
[47] Ian Foster. Designing and Building Parallel Programs: Concepts and
Tools for Parallel Software Engineering. Addison-Wesley, 1995.
[48] Thierry Gautier, Jean-Louis Roch, and Gilles Villard. Regular versus
Irregular Problems and Algorithms. In Parallel Algorithms for Irreg-
ularly Structured Problems, Second International Workshop, IRREGU-
LAR ’95, 1995.
[49] Rakesh Ghiya and Laurie J. Hendren. Putting Pointer Analysis to Work.
In 25th Annual ACM SIGACT-SIGPLAN Symposium on the Principles
of Programming Languages, pages 121–133, January 1998.
138
[50] R.W. Hockney. A Framework for Benchmark Performance Analysis.
Supercomputer, IX-2(48):9–22, 92.
[51] Weiwu Hu, Weisong Shi, and Zhimin Tang. Home Migration in Home-
based Software DSMs. In Proc. of the 1st Workshop on Software Dis-
tributed Shared Memory (WSDSM’99), 1999.
[52] Richard L. Hudson and J. Eliot B. Moss. Incremental Collection of
Mature Objects. In International Workshop on Memory Management,
1992.
[53] K. Hwang, H. Jin, E. Chow, C.L. Wang, , and Z. Xu. Designing SSI
Clusters with Hierarchical Checkpointing and Single I/O Space. IEEE
Concurrency Magazine, 7(1):60–69, Jan-Mar 1999.
[54] Kai Hwang and Zhiwei Xu. Scalable Parallel Computing. McGraw-Hill,
1998.
[55] L. Iftode. Home-based Shared Virtual Memory. PhD thesis, Princeton
University, August 1998.
[56] P. Keleher. Lazy Release Consistency for Distributed Shared Memory.
PhD thesis, 1994.
[57] P. Keleher, S. Dwarkadas, A. L. Cox, and W. Zwaenepoel. TreadMarks:
Distributed Shared Memory on Standard Workstations and Operating
Systems. In Proc. of the Winter 1994 USENIX Conference, pages 115–
131, 1994.
[58] Leslie Lamport. How to Make a Multiprocessor Computer That Cor-
rectly Executes Multiprocess Programs. IEEE Transactions on Com-
puters, September 1979.
139
[59] Kai Li and Paul Hudak. Memory Coherence in Shared Virtual Memory
Systems. ACM Transactions on Computer Systems, 7(4), November
1989.
[60] Tim Lindholm and Frank Yellin. The Java Virtual Machine Specifica-
tion, Second Edition. Addison Wesley, 1999.
[61] Matchy J. M. Ma, Cho-Li Wang, and Francis C. M. Lau. JESSICA:
Java-Enabled Single-System-Image Computing Architecture. Journal
of Parallel and Distributed Computing, 60(10):1194–1222, Oct. 2000.
[62] M. MacBeth, K. McGuigan, and P. Hatcher. Executing Java Threads in
Parallel in a Distributed-Memory Environment. In Proc. of IBM Center
for Advanced Studies Conference, 1998.
[63] David L. Mills. RFC 1305 - Network Time Protocol (Version 3) Speci-
fication, Implementation, March 1992.
[64] L. R. Monnerat and R. Bianchini. Efficiently Adapting to Sharing Pat-
terns in Software DSMs. In the 4th IEEE International Symposium on
High-Performance Computer Architecture, Feb 1998.
[65] Michael Philippsen and Matthias Zenger. JavaParty — Transpar-
ent Remote Objects in Java. Concurrency: Practice and Experience,
9(11):1225–1242, 1997.
[66] Jose M. Piquer and Ivana Visconti. Indirect Reference Listing: A Robust
Distributed GC. In Euro-Par ’98 Parallel Processing, September 1998.
[67] David Plainfoss and Marc Shapiro. A Survey of Distributed Garbage
Collection Techniques. In Proc. of Interational Workshop on Memory
Management, 1995.
[68] William Pugh. The Java Memory Model is Fatally Flawed. Concurrency:
Practice and Experience, 12(1):1–11, 2000.
140
[69] J. Ramanujam, S. Dutta, A. Venkatachar, and A. Thirumalai. Advanced
Compilation Techniques for HPF. In Proc. 7th International Workshop
on Compilers for Parallel Computers, Linkoping, Sweden, June 1998.
[70] M. C. Rinard, D. J. Scales, and M. S. Lam. Jade: A High Level Machine-
Independent Language for Parallel Programming. Computer, 26(6):28–
38, 1993.
[71] D. J. Scales, K. Gharachorloo, and C. A. Thekkath. Shasta: A Low
Overhead, Software-Only Approach for Supporting Fine-Grain Shared
Memory. In Proc. of the 7th Symp. on Architectural Support for Pro-
gramming Languages and Operating Systems (ASPLOSVII), pages 174–
185, 1996.
[72] Daniel J. Scales and Monica S. Lam. The Design and Evaluation of a
Shared Object System for Distributed Memory Machines. In Operating
Systems Design and Implementation, pages 101–114, 1994.
[73] Kazuyuki Shudo. Performance comparison of JITs.
http://www.shudo.net/jit/perf/, 2002.
[74] Sun Microsystems, Inc. RFC 1094 - NFS: Network File System Protocol
specification, March 1989.
[75] Sun Microsystems, Inc. Java Remote Method Invocation Specification.
1999.
[76] Sun Microsystems, Inc. The Java Hotspot Performance Engine Archi-
tecture, Oct. 1999.
[77] Sun Microsystems, Inc. Java Object Serialization Specification. 2001.
[78] Ronald Veldema, Rutger F. H. Hofman, Raoul Bhoedjang, and Henri E.
Bal. Runtime Optimizations for a Java DSM Implementation. In Java
Grande, pages 153–162, 2001.
141
[79] Paul R. Wilson. Uniprocessor Garbage Collection Techniques. In
Proc. Int. Workshop on Memory Management, number 637, Saint-Malo
(France), 1992. Springer-Verlag.
[80] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal
Singh, and Anoop Gupta. The SPLASH-2 programs: Characteriza-
tion and methodological considerations. In Proceedings of the 22th In-
ternational Symposium on Computer Architecture, pages 24–36, Santa
Margherita Ligure, Italy, 1995.
[81] Zhichen Xu, James R. Larus, and Barton P. Miller. Shared Memory
Performance Profiling. In Principles Practice of Parallel Programming,
pages 240–251, 1997.
[82] W. Yu and A. Cox. Java/DSM: A Platform for Heterogeneous Comput-
ing. In Proc. of ACM 1997 Workshop on Java for Science and Engi-
neering Computation, 1997.
[83] Matthew J. Zekauskas, Wayne A. Sawdon, and Brian N. Bershad. Soft-
ware Write Detection for a Distributed Shared Memory. In Proceedings
of the First Symposium on Operating Systems Design and Implementa-
tion (OSDI), 1994.
142