building and tuning high performance java platforms

SPRINGONE2GX WASHINGTON, DC

Unless otherwise indicated, these sl ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Building and Tuning High Performance Java Platforms

By Emad Benjamin, Principal Engineer at VMware @vmjavabook

Speaker Bio: Emad Benjamin, [email protected]

2

Graduated with BE, Published undergraduate thesis

1993

Independent consultant On C++ and Java, Open source contributions

1994 -2005

VMware IT - virtualized all Java systems

2005-2010

2010-2012

Tech lead for vFabric Reference Architecture http://tinyurl.com/mvtyoq7

2013 2015

EA2 LiVefire Trainer, VMworld, UberConf, Spring1, PEX, Architecture Conf Presenter

Blog: vmjava.com @vmjavabook

2014

Java Platforms Customer Experience from Around the world

Munich Milan Paris

Prague Beijing Moscow Warsaw

Atlanta Dubai Shanghai Tianjin

Java Platforms Customer Experience from Around the world

Kuala Lumpur

Sydney

Riyadh Barcelona

New York Chicago San

Francisco



How We Got Here? What is the Problem?

Will Third Platforms, Microservices and 12Factor Apps save us?

5

Its not that CIOs want to be out of the Infrastructure business… They just want someone to show them how to break the silos

(1) 2015 PWC CEO Survey; (2)2013 Baine and Company - The Value of Big Data; (3) 2014 IT Infrastructure Conversation - IBM; (4) Ernest and Young - 2014 Enterprise IT Trends and Investments; (5) 2014 Riverbed Tecnologies - The Transformers; (6) 2014 ElasticHosts CIO Study

44% of new applications failed to meet performance expectations (5)

2X 90% of companies allocate at least 2X more cloud capacity than needed to ensure performance (6)

Information Silos between Business, Application Teams, and Infrastructure

Teams leads to excessive over provisioning of hardware in order to

meet SLAs

How We Got Here? What is Really Broken?

Infrastructure Business Apps/Data

Comms Gap

Comms Gap

Big Data, Massively Scalable Apps, Internet Of Things – a new transformation is up-on-us

New Requirements New Capabilities Giving Rise to New Data-Driven Competitors

Massive-scale apps and Mobile

New data types and sources

Internet of Things

Predictive analytics

In-memory processing

Hadoop storage

Enterprise Architecture Defined •  Enterprise architecture is a comprehensive framework used to manage and

align an organization's Information Technology (IT) assets, people, operations, and projects with its operational characteristics.

•  Many standards exist today, http://www.opengroup.org/standards/ea, TOGAF and Zachman to name a couple.

Business Process IT Services

Apps

Data

Ops/Infrastructure

Governance, Architecture, Program and Project Mgmt.

CIO

Business Sponsors

VP Apps

VP Ops

Enterprise Architect

Achieve Contracted SLA to Business Sponsor

Enterprise Application Architecture with Platform

Engineering Focus

Platform Engineering

Business Process IT Services

Apps

Data

Ops/Infrastructure

Governance, Architecture, Program and Project Mgmt.

CIO

Business Sponsors

VP Apps

VP Ops

Enterprise Architect

Achieve Contracted SLA to Business Sponsor

Platform Engineering Focus

Platform Engineering An intersection of three disciplines

Most misunderstood discipline Developers size/wrongly-size

this, but Ops own it. A battle is brewing over who

seeks control



Did you Say Cloud Native? You mean Platform Engineered– right!?

Or Does Native Imply All Your Platform Problems are Natively and Magically Fixed J

Platform Engineering Focuses on Top Down Approach •  Top-Down Design

•  Focuses on starting at the application layer •  Understanding the application use cases and how to best map them onto appropriate

infrastructure •  Top down application platform design, but with bottom-up infrastructure build-out, once you

know what app needs

External Application Platforms

Data warehousing and BI

EDW ODS Datamart Dashboards – Business Intelligence

Internal Application Platforms

External Application Platforms External Apps

Internal Apps

Data Warehousing

Web Portals

Middleware Services

Databases Batch/Data Movement

Storage Network OS Virtualization Infrastructure Operations

Top Down Analysis and Design

•  People + Process + Technology(Apps + Infra) = Robust

Platform Engineering

Understanding SLAs – Is All About Robust Platform Engineering

Platform Engineer

EA2

Developer

Deployment (JVM/App Runtime)

Infrastructure

People

Process

Technology

Isn’t this DevOps? §  DevOps Playbook shouldn't be just about Continuous Integration CI use case

§  But about how to understand the best way to build application platforms •  Need to understand sizing •  Wiring of app components •  Performance, what gets deployed where •  Is it scale-out or scale-up or a mixture


Key Fundamentals of Java Platforms Platform Engineering Rules Defined


Rule #1 Java Platform JVM Sizing Rule #1 –

Understanding Memory Sizing

HotSpot JVMs on VMware vSphere

JVM Max Heap -‐Xmx

JVM Memory

Perm Gen

Ini5al Heap

Guest OS Memory

VM Memory

-‐Xms

Java Stack -‐Xss per thread

-‐XX:MaxPermSize

Other mem

Direct native Memory “off-the-heap”

Non Direct Memory “Heap”

HotSpot JVMs on vSphere

•  Guest OS Memory approx 1G (depends on OS/other processes) •  Perm Size is an area additional to the –Xmx (Max Heap) value and is not GC-ed because it contains class-level

information. •  “other mem” is additional mem required for NIO buffers, JIT code cache, classloaders, Socket Buffers (receive/send),

JNI, GC internal info

•  If you have multiple JVMs (N JVMs) on a VM then:

–  VM Memory = Guest OS memory + N * JVM Memory

VM Memory = Guest OS Memory + JVM Memory

JVM Memory = JVM Max Heap (-‐Xmx value) + JVM Perm Size (-‐XX:MaxPermSize) +

NumberOfConcurrentThreads * (-‐Xss) + “other Mem”

JVMs Sizing Example

JVM Max Heap -‐Xmx (4096m)

JVM Memory (4588m) Perm Gen

Ini5al Heap

Guest OS Memory

VM Memory (5088m)

-‐Xms (4096m)

Java Stack -‐Xss per thread (256k*100)

-‐XX:MaxPermSize (256m)

Other mem (=217m)

500m used by OS

set mem Reserva5on to 5088m

Java Platform JVM Sizing Rule #1 - Understanding Memory Sizing

•  If JVM Heap (-Xmx) is N then JVM Memory is (1.1 to 1.25) * N For example if heap is 4GB then JVM Memory = (1.1 to 1.25) *4=

4.5 to 4.8GB •  Always use memory based sizing, so if a system needs 400GB

made from 100 JVMs of 4GB each, then this is the capacity you are working to. CPU consumption will be secondary

VM Memory = Guest OS Memory + JVM Memory JVM Memory = JVM Max Heap (-‐Xmx value) + JVM Perm Size (-‐XX:MaxPermSize) + NumberOfConcurrentThreads * (-‐Xss) + “other Mem”


Rule #2 Java Platform Multi-tier Sizing Rule #2 - Decisions In One Tier Impacts the Next

20

Application Platforms Are Multi-Tier

CONFIDENTIAL

•  Java Platforms are multitier and multi org

DB Servers Java Applications

Load Balancer Tier

Load Balancers Web Servers

IT Operations Network Team

IT Operations Server Team

IT Apps – Java Dev Team

IT Ops & Apps Dev Team

Organizational Key Stakeholder Departments

Web Server Tier Java App Tier DB Server Tier

DB Servers

Load Balancer Tier Web Server Tier

Java App Tier

DB Server Tier

Web Server Pool

App Server Pool

DB Connection Pool

Html static lookup requests (load on

webservers)

Dynamic Request to DB, create Java Threads

(load on Java App server and DB)

Rule #2 Java Platform Multi-tier Sizing - Decisions In One Tier Impacts the Next § Sizing of each tier has impact on next tier, scaling up and/or out of one tier has

downstream impact on next tier. You need to expand each tier proportionally to avoid bottlenecks or throttling

§ Use separate load balancer pools to isolate functional application groups •  Public Portals Order mgmt App, would have its own load balancer pool, e.g. order-mgmt-

pool •  Microservices, and/or middleware services would have their own load balancer pool.

This would mean the public portal order-mgmt-pool would load balance the microservices calls via microservices-pool

•  The separate pools will help you measure traffic on each functional application pool group, and hence ability to determine the inter-tier traffic volume between the multiple tiers.

§ Count the total number of threads of each application container and what the healthy thread to DB connection ratio is

•  If your application exhibits deadlocks, try to increase DB connection => nThreads+1, until such time you can resolve the code issue. Note: nThreads is like MaxTheads in Tomcat config.


Rule #3 Java Platforms and Understanding Various

Workload Categories

Java Platform Categories – Category 1 •  Smaller JVMs < 4GB heap, 4.5GB

Java process, and 5GB for VM •  vSphere hosts with <96GB RAM is

more suitable, as by the time you stack the many JVM instances, you are likely to reach CPU boundary before you can consume all of the RAM. For example if instead you chose a vSphere host with 256GB RAM, then 256/4.5GB => 57JVMs, this would clearly reach CPU boundary

•  Multiple JVMs per VM •  Use Resource pools to manage

different LOBs •  Consider using 4 socket servers to get

more cores

•  (many smaller JVMs)

Category 1: 100s to 1000s of JVMs

Java Platform Categories – Category 1

•  Consider using 4 socket servers instead of 2 sockets to get more cores

Use 4 socket servers to get more cores

Category 1: 100s to 1000s of JVMs

External Application

Public Portal App

Middleware Services

Category-1

Java Platform Categories – Category 2

•  Fewer JVMs < 20 •  Very large JVMs, 32GB to 128GB •  Always deploy 1 VM per NUMA node

and size to fit perfectly •  1 JVM per VM •  Choose 2 socket vSphere hosts, and

install ample memory128GB to 512GB •  Example is in memory databases, like

SQLFire and GemFire •  Apply latency sensitive BP disable

interrupt coalescing pNIC and vNIC •  Dedicated vSphere cluster

CONFIDENTIAL

•  fewer larger JVMs

Category 2: a dozen of very large JVMs

Use 2 socket servers to get larger NUMA

nodes

External Application

Public Portal App

Middleware Services

Category-2

Java Platform Categories – Category 3 •  Many Smaller JVMs Accessing Information From Fewer

Large JVMs

CONFIDENTIAL

Category 3: Category-1 accessing data from Category-2

Resource Pool 1 Gold LOB 1

Resource Pool 2 SilverLOB 2

Category-3


Rule #3 Java Platform Multi-tier Sizing - Decisions In One Tier Impacts the Next

•  If categories are ignored then either SLAs are missed and/or you are excessively provisioning infrastructure •  Category 3 is a golden category, every enterprise has one, or will eventually have one

–  Category 3 is when second gen web application are accessing third-platform microservices type of platform

–  The third-platform only type of systems are not practical without certain level of second gen interaction

•  Category 1 is the most common in the world presently, suffers from fragmentation of JVMs –  To avoid too many JVMs, consider a minimum of 4GB heap sizes, “4GB is the new 1GB” –  Scale up before scale-out, need adequate scale-out of minimum n+1, where n=2 –  Scale-out JVMs are expensive on administration, CPU, and performance –  CPU bund system because of the numerous number of JVM and hence the excessive GC cycles (4

sockets systems will work better for these) •  Category 2

–  In memory DBs, avoid too many JVMs, prefer large JVMs, but don’t exceed the NUMA boundary –  Messaging systems, microservices, and other back office data providers can be treated this way –  Memory bound (2 socket systems will work better)


Rule #4 Establish a Building Block VM and JVM

30

Design and Sizing of Application Platforms

Step 1- Establish Load profile

•  From production logs/monitoring reports measure:

•  ~Concurrent Users •  ~Requests Per Second •  ~Peak Response Time •  ~Average Response

Time •  ~Establish your response

time SLA

Step2 Establish Benchmark

•  Iterate through Benchmark test until you are satisfied with the Load profile metrics and your intended SLA

•  ~after each benchmark iteration you may have to adjust the Application Configuration

•  ~Adjust the vSphere environment to scale out/up in order to achieve your desired number of VMs, vCPU and RAM configurations

Step 3- Size Production Env.

•  The size of the production environment would have been established in Step2, hence either you roll out the environment from Step-2 or build a new one based on the numbers established

Step 2 – Establish Benchmark

DETERMINE HOW MANY VMs Establish Horizontal Scalability Scale Out Test •  How many VMs do you need to meet your

Response Time SLAs without reaching 70%-80% saturation of CPU?

•  Establish your Horizontal scalability Factor before bottleneck appear in your application

Scale Out Test

Building Block VM Building Block VM

SLA OK?

Test complete

Investigate bottlnecked layer Network, Storage, Application Configuration, & vSphere

If scale out bottlenecked layer is removed, iterate scale out test

If building block app/VM config problem, adjust & iterate No

Building Block VM

Building Block VM

ESTABLISH BUILDING BLOCK VM Establish Vertical scalability Scale Up Test •  Establish how many JVMs on a VM? •  Establish how large a VM would be in terms

of vCPU and memory

Scal

e U

p Te

st

Building Block VM

31

JVM Max Heap -Xmx (30g)

Perm Gen

Initial Heap

Guest OS Memory

-Xms (30g)

Java Stack -Xss per thread (1M*500)

-XX:MaxPermSize (0.5g)

Other mem (=1g)

0.5-1g used by OS

Set memory reservation to 34g

JVM Memory for SQLFire (32g)

VM Memory for SQLFire (34g)

32

Larger JVMs for In-Memory Data Grids

96 GB RAM on Server

Each NUMA Node has 94/2 45GB

8 vCPU VMs less than 45GB RAM on each VM

If VM is sized greater than 45GB or 8 CPUs, Then NUMA interleaving Occurs and can cause 30% drop in memory throughput performance

33 CONFIDENTIAL

ESXi Scheduler

Most Common Sizing and Configuration Question

JVM-1

JVM-2

JVM-1A

JVM-1

JVM-2

JVM-1

JVM-2

JVM-2A

JVM-3

JVM-4 Option-1 Scale out VM and JVM

Option-2 Scale Up JVM heap size

JVM-2

JVM-1

Option-3 Scale up VM with multiple JVM instances

2GB 2GB 2GB 2GB

2vCPU 2vCPU 2vCPU 2vCPU

2vCPU 2vCPU

4GB 4GB

Comparing Scenario-1 of 4 JVMs vs. Scenario-2 of 2 JVMs

CONFIDENTI

AL

•  Scenario-1 4 JVMs off 1GB Heap on each, Average R/T 166ms

•  Scenario-2 2 JVMs of 2GB Heap on each, Average R/T 123ms

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

scenario-‐1 RT

scenario-‐2 RT

Scenario-2 has 26% better response time

Comparing 4 JVMs vs. 2 JVMs

Scenario-2 (2 JVMs, 2GB heap each) has 60% less CPU utilization than scenario-1

Scenario-1 (4JVMs, 1GB Heap Each)

CONFIDENTIAL

What else to consider when sizing?

•  Mixed workloads Job Scheduler vs. Web app require different GC Tuning

•  Job Schedulers care about Throughput •  Web apps care about minimize latency and response time •  You can’t have both reduced response time and increased

throughput, without compromise •  Separate the concerns for optimal tuning

Job

Web

JVM-1

Job

Web

JVM-2

Job

Web

Job

Web

JVM-3

Job

Web

JVM-4

Verti

cal

Horizontal

What is the practical limit for JVM Memory sizing?

64 bit Java theore5cal limit

Guest OS limit ESXi6 limit

Physical Server prac5cal limit

Per NUMA RAM

Most limi5ng prac5cal sizing factor is the per NUMA node RAM

16 Exa Bytes

1 to 16 TB 128vCPU, 4TB

~256GB to 1TB RAM

1st limit 2nd limit 3rd limit 4th limit 5th limit

vSphere maximums: https://www.vmware.com/pdf/vsphere6/r60/vsphere-60-configuration-maximums.pdf


Rule #4 Establish a Building Block VM and JVM •  Its important to establish a well advertised building block VM and JVM, you may have

to have one building block per application group, some have 3 building blocks, small-medium-large etc. –  Build new system based on the building block VM/JVM

•  Don’t mix application types within same building block, but if you do then you have to test it, and understand lifecycle impacts. One app will cause downtime to next app if JVM is restarted

•  Minimize the number of JVMs and avoid the costly administration of JVM sprawl •  Understand what is the largest JVM heap you can afford, understand NUMA boundary •  Don’t mix workload types, like a portal app with a batch system, portal apps care about

response time, hence latency sensitive, while batch jobs care about memory throughput, the 2 workload behaviors are perpendicular in nature and will cause you compromise the tuning of the platform. –  You can specialize the JVM per workload type too.



“It’s not that I’m so smart, it’s just that I stay with problems longer.”

GC Tuning Deep Dive

Which GC?

•  VMware doesn’t care which GC you select, because of the degree of independence of Java to OS and OS to Hypervisor

Tuning GC – Art Meets Science!

•  Either you tune for Throughput or reduction of Latency, one at the cost of the other

Increase Throughput

Reduce Latency Tuning

Decisions

• improved R/T • reduce latency impact • slightly reduced throughput

• improved throughput • longer R/T • increased latency impact

Job

Web

Sizing The Java Heap

JVM Max Heap -Xmx

(4096m)

Eden Space

Survivor Space 2

Old Generation

Survivor Space 1

Slower Full GC

Quick Minor GC

YoungGen -Xmn

(1350m)

OldGen 2746m

Inside the Java Heap

Parallel Young Gen and CMS Old Gen

application threads minor GC threads concurrent mark and sweep GC

Young Generation Minor GC Parallel GC in YoungGen using XX:ParNewGC & XX:ParallelGCThreads

-Xmn

Old Generation Major GC Concurrent using in OldGen using XX:+UseConcMarkSweepGC

Xmx minus Xmn

S0

S1

High Level GC Tuning Recipe

Measure Minor GC Duration and Frequency

Adjust –Xmn Young Gen size and /or ParallelGCThreads

Measure Major GC Duration And Frequency

Adjust Heap space –Xmx

Adjust –Xmn And/or SurvivorSpaces

Step A-Young Gen Tuning

Step B-Old Gen Tuning

Step C- Survivor Spaces Tuning

Applies to Category-1 and 2 Platforms

Applies to Category-2 Platforms

Why is Duration and Frequency of GC Important?

Young Gen Minor GC

Old Gen Major GC

Young Gen minor GC duration

frequency frequency

Old Gen GC duration

We want to ensure regular application user threads get a chance to execute in between GC activity

48

Impact of Increasing Young Generation (-Xmn)

Young Gen Minor GC

Old Gen Major GC

less frequent Minor GC but longer duration

potentially increased Major GC frequency

You can mitigate the increase in GC frequency by increasing -Xmx

You can mitigate the increase in Minor GC duration by increasing ParallelGCThreads

49

Impact of Reducing Young Generation (-Xmn)

Young Gen Minor GC

Old Gen Major GC

more frequent Minor GC but shorter duration

Potentially increased Major GC duration

You can mitigate the increase in Major GC duration by decreasing -Xmx

Survivor Spaces

•  Survivor Space Size = -Xmn / (-XX:SurvivorRatio + 2 ) –  Decrease Survivor Ratio causes an increase in Survivor

Space Size –  Increase in Survivor Space Size causes Eden space to be

reduced hence •  MinorGC frequency will increase •  More frequent MinorGC causes Objects to age quicker •  Use –XX:+PrintTenuringDistribution to measure how

effectively objects age in survivor spaces.

51

Decrease Survivor Spaces by Increasing Survivor Ratio

Young Gen Minor GC

Old Gen Major GC

more frequent Minor GC but shorter duration

Hence Minor GC frequency is reduced with slight increase in minor GC duration

S0 S1 S0

S1

Reduce Survivor Space

52

Increasing Survivor Ratio Impact on Old Generation

Young Gen Minor GC

Old Gen Major GC

S0

S1

Increased Tenure ship/promotion to old Gen

hence increased Major GC

CMS Collector Example

•  FullGC every 2hrs and overall Heap utilization down by 30% java -Xms50g -Xmx50g -Xmn16g -XX:+UseConcMarkSweepGC -XX:

+UseParNewGC –XX:CMSInitiatingOccupancyFraction=75 –XX:+UseCMSInitiatingOccupancyOnly -XX:+ScavengeBeforeFullGC -XX:TargetSurvivorRatio=80 -XX:SurvivorRatio=8 -XX:+UseBiasedLocking -XX:MaxTenuringThreshold=15 -XX:ParallelGCThreads=6 -XX:+OptimizeStringConcat -XX:+UseCompressedStrings -XX:

+UseStringCache

CONFIDENTIAL 54

Parallel Young Gen and CMS Old Gen

Young Gen Minor GC

Old Gen Major GC

Parallel/Throughput GC in YoungGen using XX:ParNewGC XX:ParallelGCThreads

Concurrent using XX:+UseConcMarkSweepGC

Application user threads

Minor GC threads Concurrent Mark and Sweep


•  Customer chose not to use LargePages: –  They were content with performance they already achieved and did not want to

make OS level changes that may impact the amount of total memory available to other processes that may or may not be using LargePages.

•  -XX:+UseNUMA JVM option also does not work with -XX:+UseConcMarkSweepGC

•  Alternate would be to experiment with •  numactl --cpunodebind=0 --membind=0 myapp

•  However we found ESX NUMA locality algorithms were doing great at localizing and did not need further NUMA tuning.


java –Xms30g –Xmx30g –Xmn10g -XX:+UseConcMarkSweepGC -XX:+UseParNewGC –_XX:CMSInitiatingOccupancyFraction=75

–XX:+UseCMSInitiatingOccupancyOnly -XX:+ScavengeBeforeFullGC -XX:TargetSurvivorRatio=80 -XX:SurvivorRatio=8 -XX:+UseBiasedLocking -XX:MaxTenuringThreshold=15 -XX:ParallelGCThreads=4 -XX:+UseCompressedOops -XX:+OptimizeStringConcat -XX:+UseCompressedStrings -XX:+UseStringCache §  This JVM configuration scales up and down effectively

§  -Xmx=-Xms, and –Xmn 33% of –Xmx

§  -XX:ParallelGCThreads=< minimum 2 but less than 50% of available vCPU to the JVM. NOTE: Ideally use it for 4vCPU VMs plus, but if used on 2vCPU VMs drop the -XX:ParallelGCThreads option and let Java select it

Another Example (360GB JVM) •  A monitoring system that does not scale out, runs in a large single JVM

of –Xmx360g, i.e. 360GB •  The server has 512GB and 2 sockets of 10 cores each •  360GB + 1GB for OS + 25% * 360GB for off-the-heap overhead

–  => 360GB + 1GB + 90GB => 451GB is the VMs memory Reservation •  The VM has 20 vCPUs

java –Xms360g –Xmx360g –Xmn10g –Xss1024k -XX:+UseConcMarkSweepGC -XX:+UseParNewGC –_XX:CMSInitiatingOccupancyFraction=75

–XX:+UseCMSInitiatingOccupancyOnly -XX:+ScavengeBeforeFullGC

-XX:TargetSurvivorRatio=80 -XX:SurvivorRatio=8 -XX:+UseBiasedLocking

-XX:MaxTenuringThreshold=15 -XX:ParallelGCThreads=10 -XX:+OptimizeStringConcat -XX:+UseCompressedStrings

-XX:+UseStringCache –XX:+DisableExplicitGC –XX:+AlwyasPreTouch

IBM JVM - GC Choice

-Xgc:mode Usage Example

-Xgcpolicy:Optthruput (Default, WAS 6 and 7)

Performs the mark and sweep operations during garbage collection when the application is paused to maximize application throughput. Mostly not suitable for multi CPU machines.

Apps that demand a high throughput but are not very sensitive to the occasional long garbage collection pause

-Xcpolicy:Optavgpause

Performs the mark and sweep concurrently while the application is running to minimize pause times; this provides best application response times. There is still a stop-the-world GC, but the pause is significantly shorter. After GC, the app threads help out and sweep objects (concurrent sweep).

Apps sensitive to long latencies transaction-based systems where Response Time are expected to be stable

-Xgcpolicy:Gencon (default in WAS 8) Treats short-lived and long-lived objects differently

to provide a combination of lower pause times and high application throughput. Before the heap is filled up, each app helps out and mark objects (concurrent mark).

Latency sensitive apps, objects in the transaction don't survive beyond the transaction commit

Job

Web

Web


JVM Option Description

-Xmn10g Fixed size Young Generation

-XX:+UseConcMarkSweepGC The concurrent collector is used to collect the tenured generation and does most of the collection concurrently with the execution of the application. The application is paused for short periods during the collection. A parallel version of the young generation copying collector is used with the concurrent collector.

-XX:+UseParNewGC This sets whether to use multiple threads in the young generation (with CMS only!). By default, this is enabled in Java 6u13, probably any Java 6, when the machine has multiple processor cores.

–XX:CMSInitiatingOccupancyFraction=75 This sets the percentage of the heap that must be full before the JVM starts a concurrent collection in the tenured generation. The default is some where around 92 in Java 6, but that can lead to significant problems. Setting this lower allows CMS to run more often (all the time sometimes), but it often clears more quickly to avoid fragmentation.



–XX:+UseCMSInitiatingOccupancyOnly Indicates all concurrent CMS cycles should start based on –XX:CMSInitiatingOccupancyFraction=75

-XX:+ScavengeBeforeFullGC Do young generation GC prior to a full GC.

-XX:TargetSurvivorRatio=80 Desired percentage of survivor space used after scavenge.

-XX:SurvivorRatio=8 Ratio of eden/survivor space size



-XX:+UseBiasedLocking Enables a technique for improving the performance of uncontended synchronization. An object is "biased" toward the thread which first acquires its monitor via a monitorenter bytecode or synchronized method invocation; subsequent monitor-related operations performed by that thread are relatively much faster on multiprocessor machines. Some applications with significant amounts of uncontended synchronization may attain significant speedups with this flag enabled; some applications with certain patterns of locking may see slowdowns, though attempts have been made to minimize the negative impact.

-XX:MaxTenuringThreshold=15 Sets the maximum tenuring threshold for use in adaptive GC sizing. The current largest value is 15. The default value is 15 for the parallel collector and is 4 for CMS.



-XX:ParallelGCThreads=4 Sets the number of garbage collection threads in the young/minor garbage collectors. The default value varies with the platform on which the JVM is running.

-XX:+UseCompressedOops Enables the use of compressed pointers (object references represented as 32 bit offsets instead of 64-bit pointers) for optimized 64-bit performance with Java heap sizes less than 32gb.

-XX:+OptimizeStringConcat Optimize String concatenation operations where possible. (Introduced in Java 6 Update 20)

-XX:+UseCompressedStrings Use a byte[] for Strings which can be represented as pure ASCII. (Introduced in Java 6 Update 21 Performance Release)

-XX:+UseStringCache Enables caching of commonly allocated strings

External Apps – Public Portal To-Be Architecture External Application and Data Platforms

Public Portal App

Middleware Services

Data Platform

Public Portal App

20 hosts, 4 socket s per host, 40 cores per host, 256GB RAM

•  2000 JVMs, made of 100 unique apps •  1GB heap on each JVM •  2000/100 => 20 JVM instances for

each unique app •  20 JVMs per app for scaled-out •  CPU Utilization is 10% at peak

Middleware Services


•  400JVMs hosting REST services •  1GB heap on each JVM •  1 app made of 25 unique REST

services, collectively are same service •  1 REST Service per JVM •  CPU utilization 15% at peak

Data Platform


•  4 node RAC for Portal App DB •  20,000 DB connections •  SGA 64GB on each RAC node •  CPU utilization 30% at peak

•  1 Single instance for Middleware Services •  6000 DB Connections •  64 GB SGA •  CPU utilization is 30% at peak

•  Batch process runs on one of the RAC nodes, 12GB, 3% CPU at peak

Lets focus on middleware services of the external application platform

XYZCars.com – Current Middleware Services Platform –  25 unique REST Services –  Xyzcars.com deployed each REST service

on a dedicated JVM –  The 25 JVMs are deployed on physical box

of 12 cores (2 sockets 6 cores each socket) total and 96GB RAM

–  There are a total of 16 hosts/physical boxes, hence total of 400 JVMs servicing peak transactions for their business

–  The current peak CPU utilization across all is at 15%

–  Each JVM has heap size of –Xmx 1024MB –  Majority of transactions performed on

xyzcars.com traverse ALL of the REST services, and hence all of the 25 JVMs

XYZCars.com – Current Middleware Services Platform

R1 R1- denotes REST 1…25 Denotes a JVM

R1 One REST Service Per JVM

R1

R25

R1

R25

Load Balancer Layer

Solution 1 – Virtualize 1 REST : 1 JVM with 25 JVMs Per VM, 2 VMS Per Host

25 JVMs, 1 REST per JVM

On 1 VM

Solution 1 (400GB) – Virtualize 1 REST : 1 JVM with 25 JVMs Per VM, 2 VMs Per Host

–  Sized for current workload, 400GB Heap space

–  Deployed 25 JVMs on each VM, each JVM is 1GB

–  Accounting for JVM off the heap overhead

•  25GB*1.25=31.25GB –  Add Guest OS 1GB

•  31.25+1=32.25GB –  8 Hosts –  25 unique REST Services

•  Each REST Service deployed in its own JVM

•  Original call paradigm has not changed

50 JVMs to 12 cores, this maybe an issue, while the CPU utilization is originally at 15% you can assume 30%+ CPU utilization is the new level. However in actual fact response time may suffer significantly due

to coinciding GC cycles that can cause CPU contention

XYZCars.com – Current Java Platform

R1 R1- denotes REST 1…25 Denotes a JVM

R1 One REST Service Per JVM

Load Balancer Layer R1

R25

R1

R25

Solution 1 (800GB) – Virtualize 1 REST : 1 JVM with 25 JVMs Per VM, 2 VMs Per Host

–  Sized for current workload, 800GB Heap space

–  Deployed 25 JVMs on each VM, each JVM is 1GB

–  Accounting for JVM off the heap overhead

•  25GB*1.25=31.25GB –  Add Guest OS 1GB

•  31.25+1=32.25GB –  16 Hosts –  25 unique REST Services

•  Each REST Service deployed in its own JVM

•  Original call paradigm has not changed

50 JVMs to 12 cores, this maybe an issue, while the CPU utilization is originally at 15% you can assume 30%+ CPU utilization is the new level. However in actual fact response time may suffer significantly due to coinciding GC cycles that

can cause CPU contention

THIS SOLUTION IS NOT GREAT BUT ITS LEAST INTRUSIVE

NOTE: We had to use 16 hosts, as the 8 hosts in the 400GB case, already had 50

JVMs per host, which is significant

Solution 2 – Virtualize 25 REST : 1 JVM with 1 JVMs Per VM, 2 VMs Per Host

25 JVMs, 1 REST per JVM

On 1 VM


R1 R2

R3 R4

R5

R25

R1 R2

R3 R4

R5

R25

Load Balancer Layer

R1

R25

R1

R25

Load Balancer Layer

3x better response time using this approach

Solution 2 – Virtualize 25 REST : 1 JVM with 1 JVM Per VM, 2 VMS Per Host

R1 R2

R3 R4

R5

R25

R1 R2

R3 R4

R5

R25

Load Balancer Layer

Perm Gen

Initial Heap

Guest OS Memory

Java Stack

All the REST transaction across 25 services run within one JVM instance

Solution 2 – Virtualize 25 REST : 1 JVM with 1 JVM Per VM, 2 VMS Per Host Description Today’s Traffic Load Future Traffic Load (2.x current

load) Comment

VM Size (theoretical ceiling NUMA optimized)

[96-{(96*0.02)+1}]/2 = 46.5GB

[96-{(96*0.02)+1}]/2 = 46.5GB Using NUMA overhead equation, this VM of 46.5GB and 6vCPU will be NUMA local

VM Size for Prod 46.5*0.95=44.2GB 46.5*0.95=44.2GB

JVM Heap Allowable (44.2-1)/1.25=34.56GB

(44.2-1)/1.25=34.56GB Working backwards from the NUMA size of 44.2, minus 1GB for Guest OS, and then accounting for 25% JVM overhead by dividing by 1.25

Number of JVMs needed 400/34.56=11.59 => 12 JVMs 800/34.56=23.15 => 24 JVMs Total heap needed divided by how much heap can be placed in each NUMA node

Number of Hosts 6 12 1 JVM per VM, 1 VM per NUMA node

Solution Highlights 6 hosts used instead of 16, 62.5% less hardware and hence reduced licensing cost (3.x better response time)

12 hosts, vs., what would have been 32 hosts 16*2, 62.5%saving, or 25% if you take 16 hosts as the base

The 12 hosts solution handles 2.x amount of current traffic at 25% less hardware than the existing 16hosts that are handling x load

Cluster Layout for Solution 2

Current Arch uses 16 hosts for servicing

400GB heap

Improved to-be Arch uses 6 hosts for

servicing 400GB heap



•  This solution uses 62.5% less hardware

•  3.x better response time •  Substantial software license

saving •  Huge potential for further

scalability

Solution 2 – Virtualize 25 REST : 1 JVM with 1 JVM Per VM, 2 VMS Per Host

JVM Max Heap -Xmx

(34.5GB)

Total JVM Memory (max=42.5GB

Min=37.5) Perm Gen

Initial Heap

Guest OS Memory

VM Memory (43.5)

-Xms (34.5GB)

Java Stack -Xss per thread (256k*1000)

-XX:MaxPermSize (1GB)

Other mem (=1GB)

1GB used by OS

Set memory Reservation to 43.5GB

All REST Services in one Heap

Increase thread pool to 1000 to take on more load since heap is

much larger

XYZCars.com – External Apps Platform's Middleware Services is an Example of Microservices Architecture

–  25 unique REST Services –  Xyzcars.com deployed each REST service

on a dedicated JVM –  Microservices approach

•  Micro defined •  Micro deployed •  Costly architecture and poor performance •  Offers ultimate flexibility, but this is not

practical

Current Arch uses 16 hosts for servicing

400GB heap



•  25 unique REST Services

•  1 JVM has 25 REST service instances

•  Microservices approach •  Micro defined •  Macro deployed •  Cost efficient •  High performing •  Good enough flexibility that is practical

XYZCars.com – External Apps Platform's Middleware Services is an Example of Microservices Architecture

Improved to-be Arch uses 6 hosts for servicing 400GB heap

•  25 unique REST Services

•  1 JVM has 25 REST service instances

•  Microservices approach •  Micro defined •  Macro deployed •  Cost efficient •  High performing

Rest1

Rest2

Rest3

Rest25

Approach 1 - Micro Defined & Micro Deployed Fragmented Scale out consumes more

resources, more VMs, has poor response time

Rest1

Container

Rest2

Rest25

Rest1

Rest2

Rest25

Rest1

Rest2

Rest25

Approach 2 - Micro Defined Microservices BUT & MACRO Deployed NON-Fragmented Scale out

consumes less resources, fewer VMs, has GREAT response time

25 container types, 400 container instances

1 container type, 12 container instances

3rd Platform? §  2nd Platforms are stateful, need lots of care, supposedly

difficult to change §  3rd platform are all around microservices/macroservices

concepts, independent software services, called in a sequence to formulate overall application logic

•  Touted as stateless, if you lose one, it doesn’t matter you always have another copy somewhere else

•  Supposedly easy to re-deploy/flexible, but this is not always the case

2nd Platform

3rd Platform Web tier

App tier

DB tier

2nd Platform

Load Balancer

Authentication

Session Store Licensing

Monitoring Provisioning

DNS Content Database x3

Web Server

x3

…

3rd Platform