java at scale - azul systems, inc. · 2020. 4. 24. · languages like jruby. by writing compilers...
TRANSCRIPT
Java at Scale
TECHNOLOGY
WHITE PAPER
Introduction
Java is everywhere. Starting with its appearance in
1995 as part of the HotJava web browser, Java has
taken the attention of the software industry like no
programming language before, and has held that
attention for nearly two decades and counting. In
this paper we’ll talk a little about all the ways Java
gets used, why it has such wide appeal, and where it
has limitations that either prevent its use in specific
domains or require a great deal of effort to make it
fit. We will also talk about the Zing Virtual Machine,
Azul Systems’ answer to Java’s limitations.
Java at Scale2
Table of Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Java in Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Why Java? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Moore’s Law and Parkinson’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
The Memory Management Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
A Classic Look at Application Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
A Few Realities about GC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
How Does GC Work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
New Generation GC – Live Fast, Die Young . . . . . . . . . . . . . . . . . . . . . . . . . . . .13
GC Terminology: This vs . That . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14
Oracle HotSpot JVM: GC Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15
Where Pauses Matter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16
Characterizing Pauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Measuring Pauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
What Can You Do? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18
The Challenge of the Pauseless JVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18
The Zing Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19
In Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23
About Azul Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23
Java at Scale
Java in Use
Java’s greatest success is in back end server
systems implementing business rules. Many
development frameworks were written specifically to
tie front end systems (web based user interfaces
and command and control) to database systems for
transactional activities. The first of these to take
hold was Sun Microsystems’ own Java 2 Enterprise
Edition, with its concept of Enterprise Java Beans.
J2EE was all about building much of the low level
behavior into a set of standard software objects,
which the developer wired together to build applica-
tions. Application servers managed the lifecycle of
these objects, leaving the programmer (mostly) free
to implement what was special about his or her
application and ignore the rest. The benefit was
productivity at the expense of system resources.
Later frameworks like Spring improved the devel-
oper’s experience and allowed for easier mainte-
nance and enhancement of business applications by
making explicit actions that J2EE often buried in code.
But not all of Java’s use on servers was about
business logic. Many developers took advantage of
greater productivity with Java (we’ll get to the source
of that productivity shortly) to use it for a wide range
applications including high performance computing
(HPC) tasks, solving any performance limitations by
throwing more hardware at the problem. By using
many CPUs to run parts of their applications in
parallel, they could reduce execution time, work on
more complex data sets or reduce any performance
differential from Java vs. closer-to-the-hardware
languages like C, and sometimes all three at once.
A single Java application running on a single CPU
may have been slower than the same application
written in C, but by making it easier to leverage more
processors in the same computer and even more
processors in separate computers, Java applications
could be made to scale in a way that wasn’t practical
for other languages.
Java has taken hold less well on the client. This is
a little surprising, since at its introduction so much
attention was paid to Java as a client-side language.
HotJava was a web browser written entirely in Java,
but what got people’s attention was the idea of little
bits of Java code running on the user’s desktop,
extending the experience web developers could deliver.
HTML was severely limited; Java was not.
The reality of client-side Java is mixed at best. Early
network bandwidth limitations made downloading
Java applets on demand slow and unreliable.
Security concerns appeared early, and have only
become worse over time. And other solutions got
ahead of Java in the web browser, first Adobe Flash
with its orientation toward graphic designers instead
of programmers, and now JavaScript. As JavaScript
became more capable, reusable component frame-
works were developed and as Ajax techniques for
creating dynamic web content became standard, Java
has become less of a factor on the client side of the
web experience. We have reached the stage where a
major security issue with browser-based Java
(US-CERT Alert TA13-010A) can cause large numbers
of users to disable Java in their browsers without
noticing any loss of capabilities in the applications
and web services they use.
Note the distinction between thin client applications
(Java in a web browser, with the heavy activity
running somewhere on a server) and fat client
computing. Java has significant value as a straight
programming language on a user’s desktop or laptop,
especially where developer productivity matters.
Non-browser Java applications are also far less
prone to the security issues reported for Java in the
browser, making them both safe and effective.
A final area of Java use is in embedded applications.
The greatest and most recent success here has
been in mobile phones. Google’s Android platform
is based heavily on their version of Java, known as
Dalvik. Android applications are all Java-based, and
Java provides a reliable, stable and efficient platform
for extending the capabilities of modern mobile
devices.
4 Java at Scale
Why Java?
Java is portable
What is Java’s appeal, both to developers and to the
users of the applications they create? The first benefit
cited for Java is its portability, the idea of Write
Once, Run Anywhere. Java is compiled to a standard
instruction set, the byte code that is processed by
the Java Virtual Machine. Any kind of computer that
has a JVM installed can run applications written for
it, so no need for separate versions for every kind of
computer running every different operating system.
In practice, the situation isn’t so perfect, and Write
Once, Run Anywhere still means Test Everywhere. But
Java really is portable in ways C and C++ and other
languages never were and could never be.
Java is productive
Another benefit of Java comes from the design of
the language itself, the statements programmers
write and the concepts they support. Java learned
from its predecessors, using much that is good from
other languages but avoiding features that are overly
complicated or prone to abuse. C++ offers Multiple
Inheritance, the ability for an object to gain behavior
from multiple unrelated (or, in the most challenging
cases, partially related) base classes. Multiple Inheri-
tance complicates code maintenance, since it makes
it harder to determine just where specific capabilities
came from. And it can get even worse, with Virtual
Base Classes permitting inheritance relationships
back to a common ancestor. Incest is a bad idea in
life, and it’s no better in programming. Java provides
the benefits of Multiple Inheritance without the pain
through its Interface specification. Interfaces allow
an object to take the place of many different classes
of object by saying it implements that class’s behavior.
Each class has exactly one line of inheritance, which
keeps things simpler, while acting as other types
where needed.
Another feature of C++ which seems a good idea but
leads to problems is called Operator Overloading.
Overloading allows a developer to redefine standard
operations like + and – for his or her own objects.
This sounds like a convenience and a time-saver,
allowing us to write “A = B + C” for character string
objects as a way of concatenating them, but it also
allows for relationships that are not nearly so obvious
or intuitive. C++ permits programmers to write a +
operation that modifies its left hand argument and
returns no result (e.g. if B is 5 and C is 3, “A = B +
C” would set B to 8 and leave A unchanged. This is a
bad idea; no reader of this code will expect such
behavior. Worse, most readers will assume a different
behavior and take that assumption with them as they
read on. Java does not provide Operator Overloading,
a decision that generated many complaints early on
but was in retrospect entirely justified.
Java also takes a Do the Right Thing philosophy,
making the decision to give up some performance to
provide consistency of behavior. C++ has a Do the
Efficient Thing approach, where no feature of the lan-
guage can reduce performance or increase resource
usage over C unless a particular developer decides
to use that feature. C++ makes it the programmer’s
decision to choose run-time method selection based
on object type vs. the more efficient compile-time
selection. That leads to mistakes, both by the
original developer who expected run-time behavior
different than what he coded, and by those who
come later and need to make sense of his code.
Java decided it was worth a little extra overhead to
be consistent and to make developers’ jobs easier.
But perhaps the greatest difference between Java
and its competitors is in its approach to memory
management. Languages like C and C++ make al-
locating space for objects the programmer’s problem,
and especially reclaiming the space for those objects
when they are no longer needed. For every new
object a C++ programmer creates, there must be a
corresponding free operation to release the space for
reuse. Java replaces this explicit release operation
with a Garbage Collection mechanism that operates
automatically and efficiently. Garbage Collection
eliminates a huge number of application problems,
many of which can go undetected in other languages
for a long time and be difficult to identify when they
do appear.
5 Java at Scale
Java is efficient
Early Java implementations were slow, relying on
interpreters to process Java byte code instruction
by instruction. Byte code interpreters were quickly
augmented by Just In Time (JIT) compilers. JIT compil-
ers converted byte code to the native instructions of
the target computer as the program began to run. The
programs on disc are still the original and portable
byte code, but as soon as the program gets loaded,
it would be converted to the vastly more efficient ma-
chine code, either class by class or method by method
depending in the implementation. JIT compilers put
Java programs in the same ballpark as C and C++,
still slower because of its greater flexibility but fast
enough for many tasks.
Over time, and as computers became faster and their
resources grew, JIT compilers turned into dynamic
recompilation. Modern Java implementations monitor
the way an application uses its code, changing that
code and its reference to other code to make it more
efficient for the reality of its execution. Methods can
be inlined (brought into the method that uses them to
avoid the overhead of invoking them), or pushed back
out if they are used infrequently and their size bloats
the calling code. Knowing the cache and memory sys-
tem behavior of the target computer can enable very
specific decisions that would not be practical in the
more general case. With dynamic recompilation, Java
now has the ability to be faster than other languages
for specific application domains.
Java is generic
Although the Java Virtual Machine was written to
support the execution (and translation to machine
instructions) of Java byte code, and although Java byte
code was written to support the Java language, Java
is not the only use JVMs see. Many other languages
have been moved to the JVM, new languages like
Scala and Clojure, as well as Java variants of existing
languages like JRuby. By writing compilers to convert
their source to Java byte code and tools to support
the resulting applications in terms that are meaningful
to their developers, managers of these languages get
the portability, performance, and tooling benefits of a
widely used platform.
Back in 1989 an industry body called the Open
Software Foundation put out a proposal for ANDF, the
Architecture Neutral Distribution Format. Their idea
was that compilers would take source code and produce
an intermediate representation; this code would then
be turned into real object code when it reached the
destination computer. The result would be more effi-
cient execution based on the knowledge of the precise
configuration of the target. ANDF never went anywhere,
but perhaps we can see some of ANDF’s promise in
the way other languages take advantage of Java’s
infrastructure.
Java is scalable
As we have already discussed, Java shows up every-
where and in everything from the smallest devices to
the largest servers. Java scales to address problems
up and down the line. But as we will see, there are
limits to just how big Java can get.
6 Java at Scale
Moore’s Law and Parkinson’s
Hardware keeps getting bigger, and the applications
that run on that hardware get bigger just as fast.
Moore’s Law taught us to expect transistor counts
on chips to double roughly every eighteen months.
Memory sizes have grown about 100 times every ten
years. Although we will eventually hit limits, we can
expect these trends to continue a while longer.
Parkinson’s Law says that work expands to fill the
time allotted. It isn’t a great stretch to apply that to
software, which is inevitably late and takes up every
bit of resource it can have, if not more. As we look
at computing history, we can see application growth
that keeps right in line with our hardware.
Think about in-memory computing over time, the
size of an application’s data that we can work on in
memory rather than bringing in from disc piecemeal.
In 1980 a typical application might keep 100 kilobytes
in memory; servers then had between ¼ and ½ a
megabyte of main memory. Move ahead to 1990
and you might find 10 MB of data on a 16 to 32 MB
server. Another jump to 2000 (ignoring Y2K issues)
and applications were using 1 gigabyte of data on
a 2 to 4 GB server. By 2010 it wasn’t uncommon
for server applications to use 100 GB of memory
on servers with 256 GB. In each case, applications
were taking roughly the same proportion of available
server resources. And these are commodity servers;
at this writing a server with 24 processor cores and
256 GB of memory can be had for $8000 US.
Java has not kept up with this trend of growth in ap-
plication size. Most developers report that the Java
heap for their applications (the chunk of memory
Java manages for creating objects) runs between 2
and 4 GB. We see a few applications, a very few, with
larger heaps. Why aren’t there many more big Java
applications? Why do they stop at a few percent of
available resources?
7 Java at Scale
The Memory Management Problem
Java has a problem, one that comes from one of its
greatest features. Java’s memory manager makes
developers’ lives easier and reduces the worst kinds
of bugs. But it also introduces a lack of consistency.
Periodically Java’s Garbage Collector has to run to
identify no-longer-used objects and reclaim their
space. When the GC runs, the application stops.
Sometimes it stops long enough for people to notice.
The graph you see represents the behavior of an
application, in this case a benchmark of an in-memory
cache. This graph was produced by an Open Source
tool called jHiccup. jHiccup runs with a target
application and measures hiccups or stalls, times
when neither it nor the target application were able
to get any work done. The Y axis of the graph
represents the duration of those stalls. The horizon-
tal lines show percentiles, how often stalls were
observed.
In this example, we see a worst case stall of 8½
seconds. .01% of the time we experienced stalls of
that size (a second line barely visible below the first).
.1% of the time we saw stalls of 8 seconds (the next
line down). 1% of the time we had stalls of roughly
5½ seconds. These stalls would delay the results of
our application, and are in addition to the actual work
the application had to do.
The graph shows unacceptable performance for a
user-interactive application. A request that takes an
extra 8 or 9 seconds likely means a frustrated user
and perhaps a lost customer. Unfortunately this is
not an uncommon situation for Java applications
with large memory heaps. This particular heap was
29 GB, and helps to explain why we don’t see heaps
of this size all that often. The length of application
stalls varies linearly with the size of memory, while
frequency varies with the application workload.
8
10 GB cache, 29 GB heap, HotSpot, Parallel GC
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 100 200 300 400 500 600 700
HiccupDura*on(msec)
ElapsedTime(sec)
HiccupsbyTimeInterval
MaxperInterval 99% 99.90% 99.99% Max
Java at Scale
This second graph presents the data from the same
application in terms of service levels. Here we can
map SLA requirements against application behavior.
What is acceptable as a worst case? What do we
require 99% of the time, or 90%? As you can see
by this graph, any attempt to measure and evaluate
performance based on mean times and standard
deviations is not going to describe Java performance
accurately. GC-related stalls do not fit a bell curve.
Worst case results will be many standard deviations
away from the norm, yet occur far more frequently
than a normal distribution would suggest.
9
0% 90% 99% 99.9% 99.99%
Max=8634.368
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
HiccupDura2on(msec)
Percen2le
HiccupsbyPercen2leDistribu2on
Java at Scale
A Classic Look at Application Response
The first graph below from IBM’s CICS server
documentation illustrates our expectation about
application performance and workload. If the load is
moderate, we expect good performance as measured
by the response time to the user. As the load
increases, the response time grows. Eventually load
can grow enough to move response from acceptable
to poor. The assumptions are that response is
related directly to load, and that response times
experienced by individual users are close to the
mean (that bell curve again).
Java applications exhibit a very different behavior. In
the next graph we see application load, represented
as Active User Count, and response time for three
different tasks. Note the large spikes for the update_
user_details task. These spikes occur with regularity
when load is both low and high. Frequency is not
associated with load in any obvious way, and neither
is severity. We see both spikes for slow transactions
early in the run when load is low and later as it
increases. Fixing these spikes is not a simple matter
of either reducing workload or increasing capacity to
get acceptable performance as it was in the day of
IBM CICS.
10 Java at Scale
A Few Realities about GC
Java’s reliance on garbage collection provides a large
set of benefits, in terms of ease of development,
reduced opportunities for errors, and performance.
But all is not perfect, as we have already discussed.
Let’s look at some of the benefits and drawbacks of
automatic garbage collection.
On the plus side, GC is very efficient, much more
so than the malloc() function used by C and C++.
There is no overhead associated with collecting
dead objects, making GC particularly advantageous
for dynamic languages that create and discard large
numbers of objects. And GC is able to find all the
dead objects without help from the developer. Even
cyclic graphs that link back to themselves will be
collected quickly and efficiently.
One downside to GC is that it does take time to
process its way through live objects, typically about 1
second for every GB it encounters. GCs do their best
to reduce the visibility of this processing; they can’t do
much to eliminate it. We’ll talk more about that in a bit.
The inevitability of GC processing means that tuning
to avoid GC stalls generally just delays them past the
test interval. Tune a Java application to avoid a stall
for 20 minutes, and you’re likely to find that it will
show up in the 21st minute, or the 47th, or some-
where else down the line. Delaying them is easy;
getting rid of them is another matter.
Finally, as good as GC is, it is not a panacea. Applica-
tions can still have memory leaks, simply by holding
on to references to objects that they no longer need.
The GC only reclaims objects no longer referenced.
It can do nothing about objects that are referenced
unnecessarily. Still, this is far better than the C and
C++ situation where keeping track of objects is
entirely the programmer’s responsibility.
11 Java at Scale
12
How Does GC Work?
Garbage collection has three phases, often called
Mark, Sweep, and Compact. The Mark phase starts
with any static variables and any pointers on the
stack. It follows each pointer, marking that object as
live, and then follows each of that object’s pointers.
Eventually we reach objects we’ve already marked or
we run out of pointers to check. When we’re done,
everything live is marked, and anything not marked
is not live.
Now we go on to the Sweep phase. Sweep captures
all of the unmarked objects into a free pool. The
free pool will likely be fragmented, with areas of free
space being broken up by live objects. But as long
as enough free space exists for the new objects we
want to create, we can hold off on the next, most
expensive phase.
Compact is the last phase, and the one that gener-
ally involves the most work. Periodically, the GC will
push all the live objects together, squeezing out
all the dead objects and leaving behind the largest
possible free space and updating all the live point-
ers to those objects to point to their new locations.
Compaction has two benefits: first, it eliminates
fragmentation and permits the creation of objects of
whatever size an application needs, up to the size
of the free space itself; and second, it improves
memory locality by letting new objects be created on
the same page or nearby pages. But at some point
Compaction will likely be necessary just to permit
large objects to be created.
Modern GCs use a generational scheme, which we
will discuss shortly. Full GCs, also referred to as Old
Generation GCs, perform Mark, Sweep, and Compact
as three separate passes. This has the benefit of
requiring little extra heap space while the GC runs,
so it can be delayed until the last moment.
The Java heap, containing objects of different sizes
The heap after the Mark phase, with live objects
marked in green
After the Sweep phase, dead objects in blue are
collected into a linked list of free space
After the Compact phase, live objects are contiguous
and there is one large free space for new objects
Java at Scale
13
New Generation GC–
Live Fast, Die Young
Generational GC is an important performance
optimization for Java. First developed for Lisp, it
starts with an assumption that most objects have
short lifespans. By scanning just newly created
objects, the GC can reclaim the largest amount
of free space with the smallest effort. Eventually
objects that survive these New Generation GCs are
promoted to Old Generation status. Old Generation
objects endure GC less frequently, since the cost is
higher (bigger space to scan) and the benefit is lower
(reduced chance to reclaim anything there).
New generation GCs use a copying scheme, where
live objects are moved to a copy space on the heap
as they are encountered. After everything has been
found and moved to the copy space, the old heap
space becomes free space. Copying takes twice the
heap, since all the objects in the area may still be
live, but all the processing can be done in one pass.
New Generation GC requires something called a
Remembered Set. Where a Full GC starts with static
variables and the stack to begin its scan, a New Gen
GC wants to reduce the pointers it has to follow to
find live objects. A Remembered Set keeps track
of every pointer to a New Generation object from
outside the New Gen area. The GC then only has
to follow those pointers, and only if they still point
to something in New Gen. (The pointer may now
point to something in Old Gen, making it irrelevant
for New Gen GC.) This means that Java’s runtime
system must track every pointer stored and record
those that point to New Gen. This requires additional
memory and additional processing not required by
non-GC languages, but the costs are paid back in
simplicity and reduced programmer errors.
Java at Scale
14
GC Terminology: This vs . That
Before we can discuss the finer points of Garbage
Collection implementations, we need to define some
terms. Perhaps the biggest fallacy in the computer
industry is the idea that we have a common termi-
nology. We don’t, and much confusion results from
treating different words as if they are the same or
applying different meanings to the same word.
Concurrent vs. Parallel: A Concurrent GC is one that
can run at the same time as the application. A Paral-
lel GC has multiple parallel tasks which can run at
the same time on separate CPU cores. A GC may be
Concurrent but not Parallel (lets the application run
but only does one task at a time itself), Parallel but
not Concurrent (performs GC work in multiple cores
simultaneously but does not permit the application
to work while it does), both Concurrent and Parallel,
or neither.
Concurrent vs. Stop-the-World: The opposite of a
Concurrent GC is one that stops all the application’s
work while it runs. A Stop-the-World collector may be
Parallel, which will reduce the duration of the stop.
Incremental: An Incremental GC breaks its work up
into discrete chunks. By performing a little work at
a time, it permits the application to run with more
short stalls instead of one long one. The total time
an application is stalled will be the same, or perhaps
a bit greater, but the perceived effect will be reduced.
Precise vs. Conservative: A Conservative GC does
not know the location of every pointer in memory,
and cannot distinguish between pointers and bit
patterns that might be pointers. Conservative GCs
can’t relocate objects (the Compact phase), since
they can’t find and update every pointer to those
objects. Precise GCs do know every reference, thanks
to information provided by the compiler. Java relies
on Precise GC; by contrast, languages like C and C++
can only be provided with Conservative GCs.
Safepoints: GC relies on safepoints, locations or
ranges of locations in code where the GC can identify
every reference and process it. GC has to wait for
each thread to reach a safepoint before the GC can
proceed. That generally means pausing the thread,
although code executed via the Java Native Interface
(JNI) is all safe to run as long as the thread does not
touch Java space. Safepoints need to be reached
frequently, so the GC has a chance to run when
needed. There are also global safepoints, where
every thread needs to reach safepoint and remain
there. These create Stop-the-World pauses.
New Generation GCs are typically copying collectors,
and are Monolithic (not Incremental) and Stop-the-
World. New Gen areas are small, so they don’t stop
the application for long. Old Generation GCs are
usually Mark/Sweep/Compact. They may be Stop-
the-World, or Concurrent, or mostly Concurrent, or
Incremental Stop-the-World, or mostly Incremental
Stop-the-World.
That word mostly is important. It means that the GC
will sometimes be forced into a Monolithic Stop-the-
World GC, and that means long stalls.
Java at Scale
15
Oracle HotSpot JVM: GC Options
The current version of Oracle HotSpot offers three
different garbage collectors which are selectable with
command switches. Each has strengths and weak-
nesses, and each will be better or worse for different
applications.
Parallel GC: Parallel GC is the default GC. It combines
a Monolithic Stop-the-World New Generation GC with
a Parallel Monolithic Stop-the-World Old Generation
GC. As we discussed earlier, that means that Old
Gen GC will stall the application while it runs, although
the length of the stall will be reduced by the parallel-
ism of the GC itself.
CMS GC: Concurrent Mark Sweep GC has been avail-
able in HotSpot for a long time. It uses the same
Monolithic Stop-the-World New Gen GC as Parallel
GC. For Old Gen, CMS offers a mostly Concurrent
non-compacting collector. The CMS Mark phase is
mostly Concurrent, requiring multiple passes through
the heap. The CMS Sweep phase is fully Concurrent.
It maintains free space as a linked list of dead
objects. When CMS runs out of memory chunks big
enough for new objects due to fragmentation, it will
be forced to perform a long and expensive Monolithic
Stop-the-World Mark/Sweep/Compact. CMS is likely
to work best where new objects tend toward a similar
size, since their requirements will match available
spaces in the free list without pushing toward a
Compaction event.
G1GC: The Garbage First GC was offered as an
experimental option prior to the release of Java 7,
but is now a fully supported alternative to Parallel GC
and CMS. G1GC has the same Monolithic Stop-the-
World GC as the others. G1’s Old Gen GC uses a
mostly Concurrent Mark phase, with Stop-the-World
pauses to deal with mutations (objects changing the
targets of pointers). It breaks Old Gen into regions,
and has to keep track of references between regions
using Remembered Sets. G1’s Compact phase
identifies regions that can be compacted in a limited
time. It holds off Compacting popular objects and
regions. The stated goal of G1 was to “avoid, as
much as possible, having a full GC.” Full GCs still
happen, and are supported by a Monolithic Stop-the-
World Mark/Sweep/Compact GC. Full GC is required
to Compact popular objects and regions.
Java at Scale
16
Where Pauses Matter
Now that we have looked at the technology, let’s step
back and consider how much GC behavior matters to
different applications. Not all applications will be af-
fected by GC stalls equally, and not all will see stalls
as a problem.
Stalls are a big problem for interactive applications
like ecommerce. Online customers and users have
little tolerance for slow systems, and may abandon
a transaction or even a business after a few bad
experiences. A system like Java that can randomly
add many seconds to transaction time seems a poor
choice for interactive applications. Keeping stalls
under control is vital.
On the other extreme are batch applications. Here
the user is concerned with the start-to-finish time of
an application, not how long each operation takes.
Stalls of even many minutes may not be an issue if
the time to complete is acceptable.
Big Data applications are more likely to be affected
by stalls; the more memory that applications work
with the longer the stalls are likely to be. Using large
memory offers significant benefits in many cases. For
example, travel sites that want to keep hotel inventory
in memory for fast access, interactive business
intelligence applications for drill-down and sorts or
search applications that want quick access to their
indexes of search terms.
Other applications realize an efficiency and manage-
ment benefit to using more memory. A smaller number
of larger JVMs may be able to do more work with the
same resources and will be easier to configure and
monitor. But again, larger JVM heaps mean more
severe stalls. The typical application server model
uses many smaller instances for just this reason.
Even some Low Latency applications use Java. They
require consistent performance with variability far
below what other developers would find acceptable.
Low Latency applications often have constant data
feeds they must respond to; to stall would be to drop
some data. Low Latency applications use Unsafe
Java APIs to do as much work as possible off-heap.
They concern themselves entirely with New Gen GC,
as an Old Gen GC would quickly violate their perfor-
mance requirements. Stalls beyond a few milliseconds
or in some cases hundreds of microseconds are
unacceptable. Achieving that level of performance
requires Low Latency support at the operating system
level, as well as considerable tuning.
Java at Scale
17
Characterizing Pauses
Two factors affect GC behavior: the way the
application creates and manipulates objects and
the amount of memory the GC must process. The
frequency of GC stalls is most affected by application
characteristics: at what rate does it create new objects,
how big are they, how often are their contents modi-
fied and how long do they live? The severity comes
down to heap size, particularly the amount of heap
taken by live objects. Pause length is generally an
Old Generation issue rather than New Gen, which is
usually too small to trigger a long pause. Further, the
problem may not be the amount of overhead
associated with GC but where it happens. A 30
second pause for GC will certainly be noticeable,
while 300 pauses of 100 milliseconds each may
pass undetected.
Consider GC overhead for a moment. The worst case
is that the live set takes up the entire heap. The GC
runs, but is unable to reclaim more than a few bytes.
So it runs again, and again fails to return anything.
We have 100% GC overhead and no application
execution at all.
The best case is the opposite. With infinite memory
an application can create objects forever. Memory
never runs out, so GC never runs. 100% application
execution, 0% GC overhead, no stalls, no problem.
Reality is somewhere in the middle. GC overhead
follows a 1/x curve: given a fixed live set size, as the
heap size increases, GC overhead decreases as a
percentage of the total.
Measuring Pauses
We must understand the magnitude and source of
a problem before attempting to solve it. The jHiccup
tool we mentioned earlier is useful to quantify stalls
before going on to identify their source. jHiccup runs
as a small agent within the JVM being used by an
application. Every millisecond jHiccup wakes up and
records how long that took. The wakeup should take
no time at all. If a delay exists it’s because some-
thing stopped the thread from running. That delay
might be due to GC, or it might be caused by some-
thing further down the hardware and software stack.
To determine which one, you can run jHiccup on an
idle JVM. If that shows stalls, GC isn’t the immediate
problem.
Note that jHiccup won’t show performance issues
due to application inefficiencies or any other activity
that may cause poor performance (e.g. network,
database). jHiccup strictly shows stalls caused by
the JVM and below. When diagnosing a performance
issue developers can start by determining what
areas of the stack are likely at fault.
Java at Scale
18
What Can You Do?
Now that you can use jHiccup to show performance
problems related to GC you can quantify the issues
and evaluate potential solutions. The first solution
is to try to tune JVM behavior. HotSpot offers a large
number of parameters you can manipulate to reduce
the impact of GC. This kind of tuning is generally
a stopgap; you can make the problem a little less
visible but you can’t make it go away. In addition, any
change in the application’s use of memory can undo
all that tuning effort. Be prepared to tune again and
again for each application.
A second approach is to keep the heap small. Make
more instances with less data in each. Move some
of your data out of the heap into an external cache.
Create pools of reusable objects for threads, data-
base connections, etc. In essence, replace Java’s
generic GC with specific memory managers for your
own objects.
Another alternative is to defer expensive Old Gen GC
out into the future, and then kill and restart your Java
instances before that time comes. This is a common
part of a Low Latency developers’ modus operandi:
never let Old Gen GC happen by using multiple
instances, very large heaps, and a scheduled termi-
nation and restart.
We have another solution, of course. That is to
replace a GC that rarely stalls with one that never
does. What would that take?
The Challenge of the Pauseless JVM
A JVM can eliminate GC stalls by avoiding long Stop-
the-World processing in every situation. First, the
Mark phase must be completely Concurrent, allowing
the application to continue to modify references and
objects as it runs. Multiple Marking passes are a
problem; the more the application modifies objects,
the more passes may be required to get everything.
In addition, Weak, Soft, and Final References repre-
sent challenges that must be correctly processed in
a Concurrent Marker.
Concurrent Compaction is the bigger obstacle to a
pauseless JVM. Moving objects isn’t an issue; it’s
changing all the references to those objects you’ve
just moved. The GC needs a way to handle an appli-
cation that tries to use a stale reference. If it can’t,
it’s stuck with a Monolithic Stop-the-World remapping
operation and a long pause.
New Generation GC will become more of a concern
as applications’ demand for memory grows. Today’s
Monolithic Stop-the-World New Gen GCs are accept-
able only because New Gen is so small. But grow the
heap to 100 GB or more and suddenly New Gen has
a lot of work to do. Small stalls of a few milliseconds
will soon become noticeable and then problematic.
Java at Scale
19
The Zing Virtual Machine
Zing is a high performance Java Virtual Machine for
applications that require predictable performance,
low latency, large memory or all of the above. Zing
runs on 64-bit Linux on Intel and AMD processors.
Zing supports hundreds of gigabytes of heap memory
in a single JVM instance and includes overdraft pro-
tection capabilities that can assign a Java instance
extra memory as needed to avoid crashes.
Zing implements a Concurrent guaranteed-single-
pass Marker that is unaffected by the mutation rate
of objects. Zing processes Weak, Soft, and Final
References concurrently as well. Its Concurrent
Compactor moves objects and remaps references to
them without stalling applications. Zing is not Incre-
mental; it relocates an entire New or Old Generation
in a single GC cycle.
Zing implements Concurrent Compacting GC for both
Old and New Generations. This means that even the
largest New Gen will not stall applications. Zing does
not implement any Stop-the-World behavior. Ever. It is
fully Concurrent all the time. In the following graph,
you can see the result – consistent performance, all
the time.
By way of comparison, let’s return to the jHiccup
graph we saw earlier.
Here we saw significant stall behavior under HotSpot
using Parallel GC.
10 GB cache, 29 GB heap, Azul Zing
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 100 200 300 400 500 600 700
HiccupDura*on(msec)
ElapsedTime(sec)
HiccupsbyTimeInterval
MaxperInterval 99% 99.90% 99.99% Max
Java at Scale
20
Comparisons to HotSpot’s other GC options may also
be instructive. As we discussed, CMS (Concurrent
Mark Sweep) is very efficient in its Mark and Sweep
phases, avoiding the Compact as long as it can.
In the case of this caching application, those Com-
pactions have a dramatic impact on performance.
In the Zing graph, the stalls the application
experienced with HotSpot have been reduced to
insignificance.
Small but measurable delays are still visible
on the SLA report:
10 GB cache, 29 GB heap, HotSpot, Parallel GC
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 100 200 300 400 500 600 700
HiccupDura*on(msec)
ElapsedTime(sec)
HiccupsbyTimeInterval
MaxperInterval 99% 99.90% 99.99% Max
0% 90% 99% 99.9% 99.99% 99.999%
Max=13.520
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
HiccupDura2on(msec)
Percen2le
HiccupsbyPercen2leDistribu2on
Java at Scale
21
Note the difference in scale for this graph. Where
Parallel GC produced regular stalls of between four
and nine seconds, CMS produces less frequent
stalls of 28 and 35 seconds.
G1GC produces even more stalls, trading frequency
for severity; these stalls are in the range of 1 to 1.7
seconds.
In summary, each of HotSpot’s GC implementations
fits a specific application behavior. CMS trades
cheap and efficient Sweep for long and painful but
infrequent Compact events. G1GC offers frequent
stalls of relatively short duration.
Parallel GC is somewhere in the middle, with more
frequent stalls of shorter length. Only Zing eliminates
stalls entirely, putting GC in the background while the
application continues to run.
10 GB cache, 29 GB heap, HotSpot, CMS GC
10 GB cache, 29 GB heap, HotSpot, G1GC
0
5000
10000
15000
20000
25000
30000
35000
40000
0 100 200 300 400 500 600 700 800
HiccupDura*on(msec)
ElapsedTime(sec)
HiccupsbyTimeInterval
MaxperInterval 99% 99.90% 99.99% Max
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 100 200 300 400 500 600 700 800
HiccupDura*on(msec)
ElapsedTime(sec)
HiccupsbyTimeInterval
MaxperInterval 99% 99.90% 99.99% Max
Java at Scale
22
The 29 GB heap of the preceding examples is bigger
than most Java applications use, precisely because
of the performance problems we see here. What
about applications that need even more memory?
From what we have seen, we would expect a bigger
heap to make stalls longer or more frequent or both.
Mike McCandless, a contributor to the Open Source
project Apache Lucene™, decided to find out by
building an in-memory implementation of Wikipedia’s
English language index. The experiment required 132
GB of data in a 240 GB heap. This next graph shows
the performance of both HotSpot CMS and Zing with
a load of 200 queries per second.
Here we can see query times as percentages of the
total. CMS and Zing performed equivalently about
75% of the time, to a maximum of 100 milliseconds
per query. From the 75th percentile to the 99th,
Zing’s performance curve remained relatively flat
while the average response time for CMS queries
about doubled. Beyond the 99th percentile, CMS
began to bog down. 0.1% of its queries took one
second or longer to complete, 0.01% took 1.5
seconds or more, and the longest query took about
2.5 seconds. Zing was both faster and much more
consistent, with its longest query well below 300
milliseconds. A second run with double the load was
even more dramatic, with Zing providing the same
level of consistent performance and CMS worst case
reaching eight seconds.
Interestingly, these experiments did not cause CMS
to compact its heap. If it did, we would expect to
see an even more extreme worst case performance
result.
10 GB cache, 29 GB heap, Azul Zing
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 100 200 300 400 500 600 700
HiccupDura*on(msec)
ElapsedTime(sec)
HiccupsbyTimeInterval
MaxperInterval 99% 99.90% 99.99% Max
Java at Scale
In Conclusion
Java works very well and its performance is fine
for many applications, as long as their memory
requirements are small and users can live with the
occasional (or more than occasional) blip. However,
now that even low-cost servers come with hundreds
of GB of memory, more application types expect to
be able to take advantage of it. Low latency trading,
ad serving, complex event processing, analytics
and Big Data are just a few examples. Existing Java
technologies haven’t kept up. Mostly Concurrent
means sometimes not, which isn’t acceptable for a
business-critical application.
Up to now, developers have burned a lot of midnight
oil tuning away application pauses and coding around
the JVM memory manager to avoid GC. Unfortunately,
this made production applications fragile and difficult
to maintain, delayed upgrades and reduced the ben-
efits of using Java in the first place.
But now another option is available. Azul Zing
eliminates application pauses due to garbage collec-
tion – entirely, forever. Just set the memory high and
go. For any application that requires low latency, big
memory, consistent performance or all three, Zing is
the answer.
About Azul Systems
Azul Systems delivers high-performance and elastic
Java Virtual Machines (JVMs) with unsurpassed
scalability, manageability and production-time vis-
ibility. Designed and optimized for x86 servers and
enterprise-class workloads, Azul’s Zing JVM is the
only Java runtime that supports highly consistent
and pauseless execution for throughput-intensive
and QoS-sensitive Java applications. Azul’s products
enable organizations to dramatically simplify Java
deployments with fewer instances, greater response
time consistency, and dramatically better operating
costs.
To discover how the Zing JVM can help your
applications reduce latency, contact:
Azul Systems
1.650.230.6500
www.azulsystems.com
23
Copyr
ight
© 2
01
3 A
zul S
yste
ms, In
c.
11
73
Borr
egas A
ve, S
unnyv
ale
Califo
rnia
, U
SA
Java at Scale