java at scale - azul systems, inc. · 2020. 4. 24. · languages like jruby. by writing compilers...

Java at Scale

TECHNOLOGY

WHITE PAPER

Introduction

Java is everywhere. Starting with its appearance in

1995 as part of the HotJava web browser, Java has

taken the attention of the software industry like no

programming language before, and has held that

attention for nearly two decades and counting. In

this paper we’ll talk a little about all the ways Java

gets used, why it has such wide appeal, and where it

has limitations that either prevent its use in specific

domains or require a great deal of effort to make it

fit. We will also talk about the Zing Virtual Machine,

Azul Systems’ answer to Java’s limitations.

Java at Scale2

Table of Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Java in Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Why Java? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Moore’s Law and Parkinson’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

The Memory Management Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

A Classic Look at Application Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

A Few Realities about GC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

How Does GC Work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12

New Generation GC – Live Fast, Die Young . . . . . . . . . . . . . . . . . . . . . . . . . . . .13

GC Terminology: This vs . That . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14

Oracle HotSpot JVM: GC Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

Where Pauses Matter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

Characterizing Pauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Measuring Pauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

What Can You Do? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18

The Challenge of the Pauseless JVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18

The Zing Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19

In Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23

About Azul Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23

Java at Scale

Java in Use

Java’s greatest success is in back end server

systems implementing business rules. Many

development frameworks were written specifically to

tie front end systems (web based user interfaces

and command and control) to database systems for

transactional activities. The first of these to take

hold was Sun Microsystems’ own Java 2 Enterprise

Edition, with its concept of Enterprise Java Beans.

J2EE was all about building much of the low level

behavior into a set of standard software objects,

which the developer wired together to build applica-

tions. Application servers managed the lifecycle of

these objects, leaving the programmer (mostly) free

to implement what was special about his or her

application and ignore the rest. The benefit was

productivity at the expense of system resources.

Later frameworks like Spring improved the devel-

oper’s experience and allowed for easier mainte-

nance and enhancement of business applications by

making explicit actions that J2EE often buried in code.

But not all of Java’s use on servers was about

business logic. Many developers took advantage of

greater productivity with Java (we’ll get to the source

of that productivity shortly) to use it for a wide range

applications including high performance computing

(HPC) tasks, solving any performance limitations by

throwing more hardware at the problem. By using

many CPUs to run parts of their applications in

parallel, they could reduce execution time, work on

more complex data sets or reduce any performance

differential from Java vs. closer-to-the-hardware

languages like C, and sometimes all three at once.

A single Java application running on a single CPU

may have been slower than the same application

written in C, but by making it easier to leverage more

processors in the same computer and even more

processors in separate computers, Java applications

could be made to scale in a way that wasn’t practical

for other languages.

Java has taken hold less well on the client. This is

a little surprising, since at its introduction so much

attention was paid to Java as a client-side language.

HotJava was a web browser written entirely in Java,

but what got people’s attention was the idea of little

bits of Java code running on the user’s desktop,

extending the experience web developers could deliver.

HTML was severely limited; Java was not.

The reality of client-side Java is mixed at best. Early

network bandwidth limitations made downloading

Java applets on demand slow and unreliable.

Security concerns appeared early, and have only

become worse over time. And other solutions got

ahead of Java in the web browser, first Adobe Flash

with its orientation toward graphic designers instead

of programmers, and now JavaScript. As JavaScript

became more capable, reusable component frame-

works were developed and as Ajax techniques for

creating dynamic web content became standard, Java

has become less of a factor on the client side of the

web experience. We have reached the stage where a

major security issue with browser-based Java

(US-CERT Alert TA13-010A) can cause large numbers

of users to disable Java in their browsers without

noticing any loss of capabilities in the applications

and web services they use.

Note the distinction between thin client applications

(Java in a web browser, with the heavy activity

running somewhere on a server) and fat client

computing. Java has significant value as a straight

programming language on a user’s desktop or laptop,

especially where developer productivity matters.

Non-browser Java applications are also far less

prone to the security issues reported for Java in the

browser, making them both safe and effective.

A final area of Java use is in embedded applications.

The greatest and most recent success here has

been in mobile phones. Google’s Android platform

is based heavily on their version of Java, known as

Dalvik. Android applications are all Java-based, and

Java provides a reliable, stable and efficient platform

for extending the capabilities of modern mobile

devices.

4 Java at Scale

Why Java?

Java is portable

What is Java’s appeal, both to developers and to the

users of the applications they create? The first benefit

cited for Java is its portability, the idea of Write

Once, Run Anywhere. Java is compiled to a standard

instruction set, the byte code that is processed by

the Java Virtual Machine. Any kind of computer that

has a JVM installed can run applications written for

it, so no need for separate versions for every kind of

computer running every different operating system.

In practice, the situation isn’t so perfect, and Write

Once, Run Anywhere still means Test Everywhere. But

Java really is portable in ways C and C++ and other

languages never were and could never be.

Java is productive

Another benefit of Java comes from the design of

the language itself, the statements programmers

write and the concepts they support. Java learned

from its predecessors, using much that is good from

other languages but avoiding features that are overly

complicated or prone to abuse. C++ offers Multiple

Inheritance, the ability for an object to gain behavior

from multiple unrelated (or, in the most challenging

cases, partially related) base classes. Multiple Inheri-

tance complicates code maintenance, since it makes

it harder to determine just where specific capabilities

came from. And it can get even worse, with Virtual

Base Classes permitting inheritance relationships

back to a common ancestor. Incest is a bad idea in

life, and it’s no better in programming. Java provides

the benefits of Multiple Inheritance without the pain

through its Interface specification. Interfaces allow

an object to take the place of many different classes

of object by saying it implements that class’s behavior.

Each class has exactly one line of inheritance, which

keeps things simpler, while acting as other types

where needed.

Another feature of C++ which seems a good idea but

leads to problems is called Operator Overloading.

Overloading allows a developer to redefine standard

operations like + and – for his or her own objects.

This sounds like a convenience and a time-saver,

allowing us to write “A = B + C” for character string

objects as a way of concatenating them, but it also

allows for relationships that are not nearly so obvious

or intuitive. C++ permits programmers to write a +

operation that modifies its left hand argument and

returns no result (e.g. if B is 5 and C is 3, “A = B +

C” would set B to 8 and leave A unchanged. This is a

bad idea; no reader of this code will expect such

behavior. Worse, most readers will assume a different

behavior and take that assumption with them as they

read on. Java does not provide Operator Overloading,

a decision that generated many complaints early on

but was in retrospect entirely justified.

Java also takes a Do the Right Thing philosophy,

making the decision to give up some performance to

provide consistency of behavior. C++ has a Do the

Efficient Thing approach, where no feature of the lan-

guage can reduce performance or increase resource

usage over C unless a particular developer decides

to use that feature. C++ makes it the programmer’s

decision to choose run-time method selection based

on object type vs. the more efficient compile-time

selection. That leads to mistakes, both by the

original developer who expected run-time behavior

different than what he coded, and by those who

come later and need to make sense of his code.

Java decided it was worth a little extra overhead to

be consistent and to make developers’ jobs easier.

But perhaps the greatest difference between Java

and its competitors is in its approach to memory

management. Languages like C and C++ make al-

locating space for objects the programmer’s problem,

and especially reclaiming the space for those objects

when they are no longer needed. For every new

object a C++ programmer creates, there must be a

corresponding free operation to release the space for

reuse. Java replaces this explicit release operation

with a Garbage Collection mechanism that operates

automatically and efficiently. Garbage Collection

eliminates a huge number of application problems,

many of which can go undetected in other languages

for a long time and be difficult to identify when they

do appear.

5 Java at Scale

Java is efficient

Early Java implementations were slow, relying on

interpreters to process Java byte code instruction

by instruction. Byte code interpreters were quickly

augmented by Just In Time (JIT) compilers. JIT compil-

ers converted byte code to the native instructions of

the target computer as the program began to run. The

programs on disc are still the original and portable

byte code, but as soon as the program gets loaded,

it would be converted to the vastly more efficient ma-

chine code, either class by class or method by method

depending in the implementation. JIT compilers put

Java programs in the same ballpark as C and C++,

still slower because of its greater flexibility but fast

enough for many tasks.

Over time, and as computers became faster and their

resources grew, JIT compilers turned into dynamic

recompilation. Modern Java implementations monitor

the way an application uses its code, changing that

code and its reference to other code to make it more

efficient for the reality of its execution. Methods can

be inlined (brought into the method that uses them to

avoid the overhead of invoking them), or pushed back

out if they are used infrequently and their size bloats

the calling code. Knowing the cache and memory sys-

tem behavior of the target computer can enable very

specific decisions that would not be practical in the

more general case. With dynamic recompilation, Java

now has the ability to be faster than other languages

for specific application domains.

Java is generic

Although the Java Virtual Machine was written to

support the execution (and translation to machine

instructions) of Java byte code, and although Java byte

code was written to support the Java language, Java

is not the only use JVMs see. Many other languages

have been moved to the JVM, new languages like

Scala and Clojure, as well as Java variants of existing

languages like JRuby. By writing compilers to convert

their source to Java byte code and tools to support

the resulting applications in terms that are meaningful

to their developers, managers of these languages get

the portability, performance, and tooling benefits of a

widely used platform.

Back in 1989 an industry body called the Open

Software Foundation put out a proposal for ANDF, the

Architecture Neutral Distribution Format. Their idea

was that compilers would take source code and produce

an intermediate representation; this code would then

be turned into real object code when it reached the

destination computer. The result would be more effi-

cient execution based on the knowledge of the precise

configuration of the target. ANDF never went anywhere,

but perhaps we can see some of ANDF’s promise in

the way other languages take advantage of Java’s

infrastructure.

Java is scalable

As we have already discussed, Java shows up every-

where and in everything from the smallest devices to

the largest servers. Java scales to address problems

up and down the line. But as we will see, there are

limits to just how big Java can get.

6 Java at Scale

Moore’s Law and Parkinson’s

Hardware keeps getting bigger, and the applications

that run on that hardware get bigger just as fast.

Moore’s Law taught us to expect transistor counts

on chips to double roughly every eighteen months.

Memory sizes have grown about 100 times every ten

years. Although we will eventually hit limits, we can

expect these trends to continue a while longer.

Parkinson’s Law says that work expands to fill the

time allotted. It isn’t a great stretch to apply that to

software, which is inevitably late and takes up every

bit of resource it can have, if not more. As we look

at computing history, we can see application growth

that keeps right in line with our hardware.

Think about in-memory computing over time, the

size of an application’s data that we can work on in

memory rather than bringing in from disc piecemeal.

In 1980 a typical application might keep 100 kilobytes

in memory; servers then had between ¼ and ½ a

megabyte of main memory. Move ahead to 1990

and you might find 10 MB of data on a 16 to 32 MB

server. Another jump to 2000 (ignoring Y2K issues)

and applications were using 1 gigabyte of data on

a 2 to 4 GB server. By 2010 it wasn’t uncommon

for server applications to use 100 GB of memory

on servers with 256 GB. In each case, applications

were taking roughly the same proportion of available

server resources. And these are commodity servers;

at this writing a server with 24 processor cores and

256 GB of memory can be had for $8000 US.

Java has not kept up with this trend of growth in ap-

plication size. Most developers report that the Java

heap for their applications (the chunk of memory

Java manages for creating objects) runs between 2

and 4 GB. We see a few applications, a very few, with

larger heaps. Why aren’t there many more big Java

applications? Why do they stop at a few percent of

available resources?

7 Java at Scale

The Memory Management Problem

Java has a problem, one that comes from one of its

greatest features. Java’s memory manager makes

developers’ lives easier and reduces the worst kinds

of bugs. But it also introduces a lack of consistency.

Periodically Java’s Garbage Collector has to run to

identify no-longer-used objects and reclaim their

space. When the GC runs, the application stops.

Sometimes it stops long enough for people to notice.

The graph you see represents the behavior of an

application, in this case a benchmark of an in-memory

cache. This graph was produced by an Open Source

tool called jHiccup. jHiccup runs with a target

application and measures hiccups or stalls, times

when neither it nor the target application were able

to get any work done. The Y axis of the graph

represents the duration of those stalls. The horizon-

tal lines show percentiles, how often stalls were

observed.

In this example, we see a worst case stall of 8½

seconds. .01% of the time we experienced stalls of

that size (a second line barely visible below the first).

.1% of the time we saw stalls of 8 seconds (the next

line down). 1% of the time we had stalls of roughly

5½ seconds. These stalls would delay the results of

our application, and are in addition to the actual work

the application had to do.

The graph shows unacceptable performance for a

user-interactive application. A request that takes an

extra 8 or 9 seconds likely means a frustrated user

and perhaps a lost customer. Unfortunately this is

not an uncommon situation for Java applications

with large memory heaps. This particular heap was

29 GB, and helps to explain why we don’t see heaps

of this size all that often. The length of application

stalls varies linearly with the size of memory, while

frequency varies with the application workload.

8

10 GB cache, 29 GB heap, HotSpot, Parallel GC

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 100 200 300 400 500 600 700

HiccupDura*on(msec)

ElapsedTime(sec)

HiccupsbyTimeInterval

MaxperInterval 99% 99.90% 99.99% Max

Java at Scale

This second graph presents the data from the same

application in terms of service levels. Here we can

map SLA requirements against application behavior.

What is acceptable as a worst case? What do we

require 99% of the time, or 90%? As you can see

by this graph, any attempt to measure and evaluate

performance based on mean times and standard

deviations is not going to describe Java performance

accurately. GC-related stalls do not fit a bell curve.

Worst case results will be many standard deviations

away from the norm, yet occur far more frequently

than a normal distribution would suggest.

9

0% 90% 99% 99.9% 99.99%

Max=8634.368

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

HiccupDura2on(msec)

Percen2le

HiccupsbyPercen2leDistribu2on

Java at Scale

A Classic Look at Application Response

The first graph below from IBM’s CICS server

documentation illustrates our expectation about

application performance and workload. If the load is

moderate, we expect good performance as measured

by the response time to the user. As the load

increases, the response time grows. Eventually load

can grow enough to move response from acceptable

to poor. The assumptions are that response is

related directly to load, and that response times

experienced by individual users are close to the

mean (that bell curve again).

Java applications exhibit a very different behavior. In

the next graph we see application load, represented

as Active User Count, and response time for three

different tasks. Note the large spikes for the update_

user_details task. These spikes occur with regularity

when load is both low and high. Frequency is not

associated with load in any obvious way, and neither

is severity. We see both spikes for slow transactions

early in the run when load is low and later as it

increases. Fixing these spikes is not a simple matter

of either reducing workload or increasing capacity to

get acceptable performance as it was in the day of

IBM CICS.

10 Java at Scale

A Few Realities about GC

Java’s reliance on garbage collection provides a large

set of benefits, in terms of ease of development,

reduced opportunities for errors, and performance.

But all is not perfect, as we have already discussed.

Let’s look at some of the benefits and drawbacks of

automatic garbage collection.

On the plus side, GC is very efficient, much more

so than the malloc() function used by C and C++.

There is no overhead associated with collecting

dead objects, making GC particularly advantageous

for dynamic languages that create and discard large

numbers of objects. And GC is able to find all the

dead objects without help from the developer. Even

cyclic graphs that link back to themselves will be

collected quickly and efficiently.

One downside to GC is that it does take time to

process its way through live objects, typically about 1

second for every GB it encounters. GCs do their best

to reduce the visibility of this processing; they can’t do

much to eliminate it. We’ll talk more about that in a bit.

The inevitability of GC processing means that tuning

to avoid GC stalls generally just delays them past the

test interval. Tune a Java application to avoid a stall

for 20 minutes, and you’re likely to find that it will

show up in the 21st minute, or the 47th, or some-

where else down the line. Delaying them is easy;

getting rid of them is another matter.

Finally, as good as GC is, it is not a panacea. Applica-

tions can still have memory leaks, simply by holding

on to references to objects that they no longer need.

The GC only reclaims objects no longer referenced.

It can do nothing about objects that are referenced

unnecessarily. Still, this is far better than the C and

C++ situation where keeping track of objects is

entirely the programmer’s responsibility.

11 Java at Scale

12

How Does GC Work?

Garbage collection has three phases, often called

Mark, Sweep, and Compact. The Mark phase starts

with any static variables and any pointers on the

stack. It follows each pointer, marking that object as

live, and then follows each of that object’s pointers.

Eventually we reach objects we’ve already marked or

we run out of pointers to check. When we’re done,

everything live is marked, and anything not marked

is not live.

Now we go on to the Sweep phase. Sweep captures

all of the unmarked objects into a free pool. The

free pool will likely be fragmented, with areas of free

space being broken up by live objects. But as long

as enough free space exists for the new objects we

want to create, we can hold off on the next, most

expensive phase.

Compact is the last phase, and the one that gener-

ally involves the most work. Periodically, the GC will

push all the live objects together, squeezing out

all the dead objects and leaving behind the largest

possible free space and updating all the live point-

ers to those objects to point to their new locations.

Compaction has two benefits: first, it eliminates

fragmentation and permits the creation of objects of

whatever size an application needs, up to the size

of the free space itself; and second, it improves

memory locality by letting new objects be created on

the same page or nearby pages. But at some point

Compaction will likely be necessary just to permit

large objects to be created.

Modern GCs use a generational scheme, which we

will discuss shortly. Full GCs, also referred to as Old

Generation GCs, perform Mark, Sweep, and Compact

as three separate passes. This has the benefit of

requiring little extra heap space while the GC runs,

so it can be delayed until the last moment.

The Java heap, containing objects of different sizes

The heap after the Mark phase, with live objects

marked in green

After the Sweep phase, dead objects in blue are

collected into a linked list of free space

After the Compact phase, live objects are contiguous

and there is one large free space for new objects

Java at Scale

13

New Generation GC–

Live Fast, Die Young

Generational GC is an important performance

optimization for Java. First developed for Lisp, it

starts with an assumption that most objects have

short lifespans. By scanning just newly created

objects, the GC can reclaim the largest amount

of free space with the smallest effort. Eventually

objects that survive these New Generation GCs are

promoted to Old Generation status. Old Generation

objects endure GC less frequently, since the cost is

higher (bigger space to scan) and the benefit is lower

(reduced chance to reclaim anything there).

New generation GCs use a copying scheme, where

live objects are moved to a copy space on the heap

as they are encountered. After everything has been

found and moved to the copy space, the old heap

space becomes free space. Copying takes twice the

heap, since all the objects in the area may still be

live, but all the processing can be done in one pass.

New Generation GC requires something called a

Remembered Set. Where a Full GC starts with static

variables and the stack to begin its scan, a New Gen

GC wants to reduce the pointers it has to follow to

find live objects. A Remembered Set keeps track

of every pointer to a New Generation object from

outside the New Gen area. The GC then only has

to follow those pointers, and only if they still point

to something in New Gen. (The pointer may now

point to something in Old Gen, making it irrelevant

for New Gen GC.) This means that Java’s runtime

system must track every pointer stored and record

those that point to New Gen. This requires additional

memory and additional processing not required by

non-GC languages, but the costs are paid back in

simplicity and reduced programmer errors.

Java at Scale

14

GC Terminology: This vs . That

Before we can discuss the finer points of Garbage

Collection implementations, we need to define some

terms. Perhaps the biggest fallacy in the computer

industry is the idea that we have a common termi-

nology. We don’t, and much confusion results from

treating different words as if they are the same or

applying different meanings to the same word.

Concurrent vs. Parallel: A Concurrent GC is one that

can run at the same time as the application. A Paral-

lel GC has multiple parallel tasks which can run at

the same time on separate CPU cores. A GC may be

Concurrent but not Parallel (lets the application run

but only does one task at a time itself), Parallel but

not Concurrent (performs GC work in multiple cores

simultaneously but does not permit the application

to work while it does), both Concurrent and Parallel,

or neither.

Concurrent vs. Stop-the-World: The opposite of a

Concurrent GC is one that stops all the application’s

work while it runs. A Stop-the-World collector may be

Parallel, which will reduce the duration of the stop.

Incremental: An Incremental GC breaks its work up

into discrete chunks. By performing a little work at

a time, it permits the application to run with more

short stalls instead of one long one. The total time

an application is stalled will be the same, or perhaps

a bit greater, but the perceived effect will be reduced.

Precise vs. Conservative: A Conservative GC does

not know the location of every pointer in memory,

and cannot distinguish between pointers and bit

patterns that might be pointers. Conservative GCs

can’t relocate objects (the Compact phase), since

they can’t find and update every pointer to those

objects. Precise GCs do know every reference, thanks

to information provided by the compiler. Java relies

on Precise GC; by contrast, languages like C and C++

can only be provided with Conservative GCs.

Safepoints: GC relies on safepoints, locations or

ranges of locations in code where the GC can identify

every reference and process it. GC has to wait for

each thread to reach a safepoint before the GC can

proceed. That generally means pausing the thread,

although code executed via the Java Native Interface

(JNI) is all safe to run as long as the thread does not

touch Java space. Safepoints need to be reached

frequently, so the GC has a chance to run when

needed. There are also global safepoints, where

every thread needs to reach safepoint and remain

there. These create Stop-the-World pauses.

New Generation GCs are typically copying collectors,

and are Monolithic (not Incremental) and Stop-the-

World. New Gen areas are small, so they don’t stop

the application for long. Old Generation GCs are

usually Mark/Sweep/Compact. They may be Stop-

the-World, or Concurrent, or mostly Concurrent, or

Incremental Stop-the-World, or mostly Incremental

Stop-the-World.

That word mostly is important. It means that the GC

will sometimes be forced into a Monolithic Stop-the-

World GC, and that means long stalls.

Java at Scale

15

Oracle HotSpot JVM: GC Options

The current version of Oracle HotSpot offers three

different garbage collectors which are selectable with

command switches. Each has strengths and weak-

nesses, and each will be better or worse for different

applications.

Parallel GC: Parallel GC is the default GC. It combines

a Monolithic Stop-the-World New Generation GC with

a Parallel Monolithic Stop-the-World Old Generation

GC. As we discussed earlier, that means that Old

Gen GC will stall the application while it runs, although

the length of the stall will be reduced by the parallel-

ism of the GC itself.

CMS GC: Concurrent Mark Sweep GC has been avail-

able in HotSpot for a long time. It uses the same

Monolithic Stop-the-World New Gen GC as Parallel

GC. For Old Gen, CMS offers a mostly Concurrent

non-compacting collector. The CMS Mark phase is

mostly Concurrent, requiring multiple passes through

the heap. The CMS Sweep phase is fully Concurrent.

It maintains free space as a linked list of dead

objects. When CMS runs out of memory chunks big

enough for new objects due to fragmentation, it will

be forced to perform a long and expensive Monolithic

Stop-the-World Mark/Sweep/Compact. CMS is likely

to work best where new objects tend toward a similar

size, since their requirements will match available

spaces in the free list without pushing toward a

Compaction event.

G1GC: The Garbage First GC was offered as an

experimental option prior to the release of Java 7,

but is now a fully supported alternative to Parallel GC

and CMS. G1GC has the same Monolithic Stop-the-

World GC as the others. G1’s Old Gen GC uses a

mostly Concurrent Mark phase, with Stop-the-World

pauses to deal with mutations (objects changing the

targets of pointers). It breaks Old Gen into regions,

and has to keep track of references between regions

using Remembered Sets. G1’s Compact phase

identifies regions that can be compacted in a limited

time. It holds off Compacting popular objects and

regions. The stated goal of G1 was to “avoid, as

much as possible, having a full GC.” Full GCs still

happen, and are supported by a Monolithic Stop-the-

World Mark/Sweep/Compact GC. Full GC is required

to Compact popular objects and regions.

Java at Scale

16

Where Pauses Matter

Now that we have looked at the technology, let’s step

back and consider how much GC behavior matters to

different applications. Not all applications will be af-

fected by GC stalls equally, and not all will see stalls

as a problem.

Stalls are a big problem for interactive applications

like ecommerce. Online customers and users have

little tolerance for slow systems, and may abandon

a transaction or even a business after a few bad

experiences. A system like Java that can randomly

add many seconds to transaction time seems a poor

choice for interactive applications. Keeping stalls

under control is vital.

On the other extreme are batch applications. Here

the user is concerned with the start-to-finish time of

an application, not how long each operation takes.

Stalls of even many minutes may not be an issue if

the time to complete is acceptable.

Big Data applications are more likely to be affected

by stalls; the more memory that applications work

with the longer the stalls are likely to be. Using large

memory offers significant benefits in many cases. For

example, travel sites that want to keep hotel inventory

in memory for fast access, interactive business

intelligence applications for drill-down and sorts or

search applications that want quick access to their

indexes of search terms.

Other applications realize an efficiency and manage-

ment benefit to using more memory. A smaller number

of larger JVMs may be able to do more work with the

same resources and will be easier to configure and

monitor. But again, larger JVM heaps mean more

severe stalls. The typical application server model

uses many smaller instances for just this reason.

Even some Low Latency applications use Java. They

require consistent performance with variability far

below what other developers would find acceptable.

Low Latency applications often have constant data

feeds they must respond to; to stall would be to drop

some data. Low Latency applications use Unsafe

Java APIs to do as much work as possible off-heap.

They concern themselves entirely with New Gen GC,

as an Old Gen GC would quickly violate their perfor-

mance requirements. Stalls beyond a few milliseconds

or in some cases hundreds of microseconds are

unacceptable. Achieving that level of performance

requires Low Latency support at the operating system

level, as well as considerable tuning.

Java at Scale

17

Characterizing Pauses

Two factors affect GC behavior: the way the

application creates and manipulates objects and

the amount of memory the GC must process. The

frequency of GC stalls is most affected by application

characteristics: at what rate does it create new objects,

how big are they, how often are their contents modi-

fied and how long do they live? The severity comes

down to heap size, particularly the amount of heap

taken by live objects. Pause length is generally an

Old Generation issue rather than New Gen, which is

usually too small to trigger a long pause. Further, the

problem may not be the amount of overhead

associated with GC but where it happens. A 30

second pause for GC will certainly be noticeable,

while 300 pauses of 100 milliseconds each may

pass undetected.

Consider GC overhead for a moment. The worst case

is that the live set takes up the entire heap. The GC

runs, but is unable to reclaim more than a few bytes.

So it runs again, and again fails to return anything.

We have 100% GC overhead and no application

execution at all.

The best case is the opposite. With infinite memory

an application can create objects forever. Memory

never runs out, so GC never runs. 100% application

execution, 0% GC overhead, no stalls, no problem.

Reality is somewhere in the middle. GC overhead

follows a 1/x curve: given a fixed live set size, as the

heap size increases, GC overhead decreases as a

percentage of the total.

Measuring Pauses

We must understand the magnitude and source of

a problem before attempting to solve it. The jHiccup

tool we mentioned earlier is useful to quantify stalls

before going on to identify their source. jHiccup runs

as a small agent within the JVM being used by an

application. Every millisecond jHiccup wakes up and

records how long that took. The wakeup should take

no time at all. If a delay exists it’s because some-

thing stopped the thread from running. That delay

might be due to GC, or it might be caused by some-

thing further down the hardware and software stack.

To determine which one, you can run jHiccup on an

idle JVM. If that shows stalls, GC isn’t the immediate

problem.

Note that jHiccup won’t show performance issues

due to application inefficiencies or any other activity

that may cause poor performance (e.g. network,

database). jHiccup strictly shows stalls caused by

the JVM and below. When diagnosing a performance

issue developers can start by determining what

areas of the stack are likely at fault.

Java at Scale

18

What Can You Do?

Now that you can use jHiccup to show performance

problems related to GC you can quantify the issues

and evaluate potential solutions. The first solution

is to try to tune JVM behavior. HotSpot offers a large

number of parameters you can manipulate to reduce

the impact of GC. This kind of tuning is generally

a stopgap; you can make the problem a little less

visible but you can’t make it go away. In addition, any

change in the application’s use of memory can undo

all that tuning effort. Be prepared to tune again and

again for each application.

A second approach is to keep the heap small. Make

more instances with less data in each. Move some

of your data out of the heap into an external cache.

Create pools of reusable objects for threads, data-

base connections, etc. In essence, replace Java’s

generic GC with specific memory managers for your

own objects.

Another alternative is to defer expensive Old Gen GC

out into the future, and then kill and restart your Java

instances before that time comes. This is a common

part of a Low Latency developers’ modus operandi:

never let Old Gen GC happen by using multiple

instances, very large heaps, and a scheduled termi-

nation and restart.

We have another solution, of course. That is to

replace a GC that rarely stalls with one that never

does. What would that take?

The Challenge of the Pauseless JVM

A JVM can eliminate GC stalls by avoiding long Stop-

the-World processing in every situation. First, the

Mark phase must be completely Concurrent, allowing

the application to continue to modify references and

objects as it runs. Multiple Marking passes are a

problem; the more the application modifies objects,

the more passes may be required to get everything.

In addition, Weak, Soft, and Final References repre-

sent challenges that must be correctly processed in

a Concurrent Marker.

Concurrent Compaction is the bigger obstacle to a

pauseless JVM. Moving objects isn’t an issue; it’s

changing all the references to those objects you’ve

just moved. The GC needs a way to handle an appli-

cation that tries to use a stale reference. If it can’t,

it’s stuck with a Monolithic Stop-the-World remapping

operation and a long pause.

New Generation GC will become more of a concern

as applications’ demand for memory grows. Today’s

Monolithic Stop-the-World New Gen GCs are accept-

able only because New Gen is so small. But grow the

heap to 100 GB or more and suddenly New Gen has

a lot of work to do. Small stalls of a few milliseconds

will soon become noticeable and then problematic.

Java at Scale

19

The Zing Virtual Machine

Zing is a high performance Java Virtual Machine for

applications that require predictable performance,

low latency, large memory or all of the above. Zing

runs on 64-bit Linux on Intel and AMD processors.

Zing supports hundreds of gigabytes of heap memory

in a single JVM instance and includes overdraft pro-

tection capabilities that can assign a Java instance

extra memory as needed to avoid crashes.

Zing implements a Concurrent guaranteed-single-

pass Marker that is unaffected by the mutation rate

of objects. Zing processes Weak, Soft, and Final

References concurrently as well. Its Concurrent

Compactor moves objects and remaps references to

them without stalling applications. Zing is not Incre-

mental; it relocates an entire New or Old Generation

in a single GC cycle.

Zing implements Concurrent Compacting GC for both

Old and New Generations. This means that even the

largest New Gen will not stall applications. Zing does

not implement any Stop-the-World behavior. Ever. It is

fully Concurrent all the time. In the following graph,

you can see the result – consistent performance, all

the time.

By way of comparison, let’s return to the jHiccup

graph we saw earlier.

Here we saw significant stall behavior under HotSpot

using Parallel GC.

10 GB cache, 29 GB heap, Azul Zing

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 100 200 300 400 500 600 700

HiccupDura*on(msec)

ElapsedTime(sec)



Java at Scale

20

Comparisons to HotSpot’s other GC options may also

be instructive. As we discussed, CMS (Concurrent

Mark Sweep) is very efficient in its Mark and Sweep

phases, avoiding the Compact as long as it can.

In the case of this caching application, those Com-

pactions have a dramatic impact on performance.

In the Zing graph, the stalls the application

experienced with HotSpot have been reduced to

insignificance.

Small but measurable delays are still visible

on the SLA report:

10 GB cache, 29 GB heap, HotSpot, Parallel GC

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 100 200 300 400 500 600 700

HiccupDura*on(msec)

ElapsedTime(sec)



0% 90% 99% 99.9% 99.99% 99.999%

Max=13.520

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

HiccupDura2on(msec)

Percen2le

HiccupsbyPercen2leDistribu2on

Java at Scale

21

Note the difference in scale for this graph. Where

Parallel GC produced regular stalls of between four

and nine seconds, CMS produces less frequent

stalls of 28 and 35 seconds.

G1GC produces even more stalls, trading frequency

for severity; these stalls are in the range of 1 to 1.7

seconds.

In summary, each of HotSpot’s GC implementations

fits a specific application behavior. CMS trades

cheap and efficient Sweep for long and painful but

infrequent Compact events. G1GC offers frequent

stalls of relatively short duration.

Parallel GC is somewhere in the middle, with more

frequent stalls of shorter length. Only Zing eliminates

stalls entirely, putting GC in the background while the

application continues to run.

10 GB cache, 29 GB heap, HotSpot, CMS GC

10 GB cache, 29 GB heap, HotSpot, G1GC

0

5000

10000

15000

20000

25000

30000

35000

40000

0 100 200 300 400 500 600 700 800

HiccupDura*on(msec)

ElapsedTime(sec)



0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 100 200 300 400 500 600 700 800

HiccupDura*on(msec)

ElapsedTime(sec)



Java at Scale

22

The 29 GB heap of the preceding examples is bigger

than most Java applications use, precisely because

of the performance problems we see here. What

about applications that need even more memory?

From what we have seen, we would expect a bigger

heap to make stalls longer or more frequent or both.

Mike McCandless, a contributor to the Open Source

project Apache Lucene™, decided to find out by

building an in-memory implementation of Wikipedia’s

English language index. The experiment required 132

GB of data in a 240 GB heap. This next graph shows

the performance of both HotSpot CMS and Zing with

a load of 200 queries per second.

Here we can see query times as percentages of the

total. CMS and Zing performed equivalently about

75% of the time, to a maximum of 100 milliseconds

per query. From the 75th percentile to the 99th,

Zing’s performance curve remained relatively flat

while the average response time for CMS queries

about doubled. Beyond the 99th percentile, CMS

began to bog down. 0.1% of its queries took one

second or longer to complete, 0.01% took 1.5

seconds or more, and the longest query took about

2.5 seconds. Zing was both faster and much more

consistent, with its longest query well below 300

milliseconds. A second run with double the load was

even more dramatic, with Zing providing the same

level of consistent performance and CMS worst case

reaching eight seconds.

Interestingly, these experiments did not cause CMS

to compact its heap. If it did, we would expect to

see an even more extreme worst case performance

result.

10 GB cache, 29 GB heap, Azul Zing

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 100 200 300 400 500 600 700

HiccupDura*on(msec)

ElapsedTime(sec)



Java at Scale

In Conclusion

Java works very well and its performance is fine

for many applications, as long as their memory

requirements are small and users can live with the

occasional (or more than occasional) blip. However,

now that even low-cost servers come with hundreds

of GB of memory, more application types expect to

be able to take advantage of it. Low latency trading,

ad serving, complex event processing, analytics

and Big Data are just a few examples. Existing Java

technologies haven’t kept up. Mostly Concurrent

means sometimes not, which isn’t acceptable for a

business-critical application.

Up to now, developers have burned a lot of midnight

oil tuning away application pauses and coding around

the JVM memory manager to avoid GC. Unfortunately,

this made production applications fragile and difficult

to maintain, delayed upgrades and reduced the ben-

efits of using Java in the first place.

But now another option is available. Azul Zing

eliminates application pauses due to garbage collec-

tion – entirely, forever. Just set the memory high and

go. For any application that requires low latency, big

memory, consistent performance or all three, Zing is

the answer.

About Azul Systems

Azul Systems delivers high-performance and elastic

Java Virtual Machines (JVMs) with unsurpassed

scalability, manageability and production-time vis-

ibility. Designed and optimized for x86 servers and

enterprise-class workloads, Azul’s Zing JVM is the

only Java runtime that supports highly consistent

and pauseless execution for throughput-intensive

and QoS-sensitive Java applications. Azul’s products

enable organizations to dramatically simplify Java

deployments with fewer instances, greater response

time consistency, and dramatically better operating

costs.

To discover how the Zing JVM can help your

applications reduce latency, contact:

Azul Systems

1.650.230.6500

[email protected]

www.azulsystems.com

23

Copyr

ight

© 2

01

3 A

zul S

yste

ms, In

c.

11

73

Borr

egas A

ve, S

unnyv

ale

Califo

rnia

, U

SA

Java at Scale

java at scale - azul systems, inc. · 2020. 4. 24. · languages like jruby. by writing compilers...

Documents