building and tuning high performance java platforms
TRANSCRIPT
SPRINGONE2GX WASHINGTON, DC
Unless otherwise indicated, these sl ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Building and Tuning High Performance Java Platforms
By Emad Benjamin, Principal Engineer at VMware @vmjavabook
Speaker Bio: Emad Benjamin, [email protected]
2
Graduated with BE, Published undergraduate thesis
1993
Independent consultant On C++ and Java, Open source contributions
1994 -2005
VMware IT - virtualized all Java systems
2005-2010
2010-2012
Tech lead for vFabric Reference Architecture http://tinyurl.com/mvtyoq7
2013 2015
EA2 LiVefire Trainer, VMworld, UberConf, Spring1, PEX, Architecture Conf Presenter
Blog: vmjava.com @vmjavabook
2014
Java Platforms Customer Experience from Around the world
Munich Milan Paris
Prague Beijing Moscow Warsaw
Atlanta Dubai Shanghai Tianjin
Java Platforms Customer Experience from Around the world
Kuala Lumpur
Sydney
Riyadh Barcelona
New York Chicago San
Francisco
SPRINGONE2GX WASHINGTON, DC
Unless otherwise indicated, these sl ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
How We Got Here? What is the Problem?
Will Third Platforms, Microservices and 12Factor Apps save us?
5
Its not that CIOs want to be out of the Infrastructure business… They just want someone to show them how to break the silos
(1) 2015 PWC CEO Survey; (2)2013 Baine and Company - The Value of Big Data; (3) 2014 IT Infrastructure Conversation - IBM; (4) Ernest and Young - 2014 Enterprise IT Trends and Investments; (5) 2014 Riverbed Tecnologies - The Transformers; (6) 2014 ElasticHosts CIO Study
44% of new applications failed to meet performance expectations (5)
2X 90% of companies allocate at least 2X more cloud capacity than needed to ensure performance (6)
Information Silos between Business, Application Teams, and Infrastructure
Teams leads to excessive over provisioning of hardware in order to
meet SLAs
How We Got Here? What is Really Broken?
Infrastructure Business Apps/Data
Comms Gap
Comms Gap
Big Data, Massively Scalable Apps, Internet Of Things – a new transformation is up-on-us
New Requirements New Capabilities Giving Rise to New Data-Driven Competitors
Massive-scale apps and Mobile
New data types and sources
Internet of Things
Predictive analytics
In-memory processing
Hadoop storage
Enterprise Architecture Defined • Enterprise architecture is a comprehensive framework used to manage and
align an organization's Information Technology (IT) assets, people, operations, and projects with its operational characteristics.
• Many standards exist today, http://www.opengroup.org/standards/ea, TOGAF and Zachman to name a couple.
Business Process IT Services
Apps
Data
Ops/Infrastructure
Governance, Architecture, Program and Project Mgmt.
CIO
Business Sponsors
VP Apps
VP Ops
Enterprise Architect
Achieve Contracted SLA to Business Sponsor
Enterprise Application Architecture with Platform
Engineering Focus
Platform Engineering
Business Process IT Services
Apps
Data
Ops/Infrastructure
Governance, Architecture, Program and Project Mgmt.
CIO
Business Sponsors
VP Apps
VP Ops
Enterprise Architect
Achieve Contracted SLA to Business Sponsor
Platform Engineering Focus
Platform Engineering An intersection of three disciplines
Most misunderstood discipline Developers size/wrongly-size
this, but Ops own it. A battle is brewing over who
seeks control
SPRINGONE2GX WASHINGTON, DC
Unless otherwise indicated, these sl ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Did you Say Cloud Native? You mean Platform Engineered– right!?
Or Does Native Imply All Your Platform Problems are Natively and Magically Fixed J
Platform Engineering Focuses on Top Down Approach • Top-Down Design
• Focuses on starting at the application layer • Understanding the application use cases and how to best map them onto appropriate
infrastructure • Top down application platform design, but with bottom-up infrastructure build-out, once you
know what app needs
External Application Platforms
Data warehousing and BI
EDW ODS Datamart Dashboards – Business Intelligence
Internal Application Platforms
External Application Platforms External Apps
Internal Apps
Data Warehousing
Web Portals
Middleware Services
Databases Batch/Data Movement
Storage Network OS Virtualization Infrastructure Operations
Top Down Analysis and Design
• People + Process + Technology(Apps + Infra) = Robust
Platform Engineering
Understanding SLAs – Is All About Robust Platform Engineering
Platform Engineer
EA2
Developer
Deployment (JVM/App Runtime)
Infrastructure
People
Process
Technology
Isn’t this DevOps? § DevOps Playbook shouldn't be just about Continuous Integration CI use case
§ But about how to understand the best way to build application platforms • Need to understand sizing • Wiring of app components • Performance, what gets deployed where • Is it scale-out or scale-up or a mixture
Unless otherwise indicated, these sl ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Key Fundamentals of Java Platforms Platform Engineering Rules Defined
Unless otherwise indicated, these sl ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Rule #1 Java Platform JVM Sizing Rule #1 –
Understanding Memory Sizing
HotSpot JVMs on VMware vSphere
JVM Max Heap -‐Xmx
JVM Memory
Perm Gen
Ini5al Heap
Guest OS Memory
VM Memory
-‐Xms
Java Stack -‐Xss per thread
-‐XX:MaxPermSize
Other mem
Direct native Memory “off-the-heap”
Non Direct Memory “Heap”
HotSpot JVMs on vSphere
• Guest OS Memory approx 1G (depends on OS/other processes) • Perm Size is an area additional to the –Xmx (Max Heap) value and is not GC-ed because it contains class-level
information. • “other mem” is additional mem required for NIO buffers, JIT code cache, classloaders, Socket Buffers (receive/send),
JNI, GC internal info
• If you have multiple JVMs (N JVMs) on a VM then:
– VM Memory = Guest OS memory + N * JVM Memory
VM Memory = Guest OS Memory + JVM Memory
JVM Memory = JVM Max Heap (-‐Xmx value) + JVM Perm Size (-‐XX:MaxPermSize) +
NumberOfConcurrentThreads * (-‐Xss) + “other Mem”
JVMs Sizing Example
JVM Max Heap -‐Xmx (4096m)
JVM Memory (4588m) Perm Gen
Ini5al Heap
Guest OS Memory
VM Memory (5088m)
-‐Xms (4096m)
Java Stack -‐Xss per thread (256k*100)
-‐XX:MaxPermSize (256m)
Other mem (=217m)
500m used by OS
set mem Reserva5on to 5088m
Java Platform JVM Sizing Rule #1 - Understanding Memory Sizing
• If JVM Heap (-Xmx) is N then JVM Memory is (1.1 to 1.25) * N For example if heap is 4GB then JVM Memory = (1.1 to 1.25) *4=
4.5 to 4.8GB • Always use memory based sizing, so if a system needs 400GB
made from 100 JVMs of 4GB each, then this is the capacity you are working to. CPU consumption will be secondary
VM Memory = Guest OS Memory + JVM Memory JVM Memory = JVM Max Heap (-‐Xmx value) + JVM Perm Size (-‐XX:MaxPermSize) + NumberOfConcurrentThreads * (-‐Xss) + “other Mem”
Unless otherwise indicated, these sl ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Rule #2 Java Platform Multi-tier Sizing Rule #2 - Decisions In One Tier Impacts the Next
20
Application Platforms Are Multi-Tier
CONFIDENTIAL
• Java Platforms are multitier and multi org
DB Servers Java Applications
Load Balancer Tier
Load Balancers Web Servers
IT Operations Network Team
IT Operations Server Team
IT Apps – Java Dev Team
IT Ops & Apps Dev Team
Organizational Key Stakeholder Departments
Web Server Tier Java App Tier DB Server Tier
DB Servers
Load Balancer Tier Web Server Tier
Java App Tier
DB Server Tier
Web Server Pool
App Server Pool
DB Connection Pool
Html static lookup requests (load on
webservers)
Dynamic Request to DB, create Java Threads
(load on Java App server and DB)
Rule #2 Java Platform Multi-tier Sizing - Decisions In One Tier Impacts the Next § Sizing of each tier has impact on next tier, scaling up and/or out of one tier has
downstream impact on next tier. You need to expand each tier proportionally to avoid bottlenecks or throttling
§ Use separate load balancer pools to isolate functional application groups • Public Portals Order mgmt App, would have its own load balancer pool, e.g. order-mgmt-
pool • Microservices, and/or middleware services would have their own load balancer pool.
This would mean the public portal order-mgmt-pool would load balance the microservices calls via microservices-pool
• The separate pools will help you measure traffic on each functional application pool group, and hence ability to determine the inter-tier traffic volume between the multiple tiers.
§ Count the total number of threads of each application container and what the healthy thread to DB connection ratio is
• If your application exhibits deadlocks, try to increase DB connection => nThreads+1, until such time you can resolve the code issue. Note: nThreads is like MaxTheads in Tomcat config.
Unless otherwise indicated, these sl ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Rule #3 Java Platforms and Understanding Various
Workload Categories
Java Platform Categories – Category 1 • Smaller JVMs < 4GB heap, 4.5GB
Java process, and 5GB for VM • vSphere hosts with <96GB RAM is
more suitable, as by the time you stack the many JVM instances, you are likely to reach CPU boundary before you can consume all of the RAM. For example if instead you chose a vSphere host with 256GB RAM, then 256/4.5GB => 57JVMs, this would clearly reach CPU boundary
• Multiple JVMs per VM • Use Resource pools to manage
different LOBs • Consider using 4 socket servers to get
more cores
• (many smaller JVMs)
Category 1: 100s to 1000s of JVMs
Java Platform Categories – Category 1
• Consider using 4 socket servers instead of 2 sockets to get more cores
Use 4 socket servers to get more cores
Category 1: 100s to 1000s of JVMs
External Application
Public Portal App
Middleware Services
Category-1
Java Platform Categories – Category 2
• Fewer JVMs < 20 • Very large JVMs, 32GB to 128GB • Always deploy 1 VM per NUMA node
and size to fit perfectly • 1 JVM per VM • Choose 2 socket vSphere hosts, and
install ample memory128GB to 512GB • Example is in memory databases, like
SQLFire and GemFire • Apply latency sensitive BP disable
interrupt coalescing pNIC and vNIC • Dedicated vSphere cluster
CONFIDENTIAL
• fewer larger JVMs
Category 2: a dozen of very large JVMs
Use 2 socket servers to get larger NUMA
nodes
External Application
Public Portal App
Middleware Services
Category-2
Java Platform Categories – Category 3 • Many Smaller JVMs Accessing Information From Fewer
Large JVMs
CONFIDENTIAL
Category 3: Category-1 accessing data from Category-2
Resource Pool 1 Gold LOB 1
Resource Pool 2 SilverLOB 2
Category-3
Unless otherwise indicated, these sl ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Rule #3 Java Platform Multi-tier Sizing - Decisions In One Tier Impacts the Next
• If categories are ignored then either SLAs are missed and/or you are excessively provisioning infrastructure • Category 3 is a golden category, every enterprise has one, or will eventually have one
– Category 3 is when second gen web application are accessing third-platform microservices type of platform
– The third-platform only type of systems are not practical without certain level of second gen interaction
• Category 1 is the most common in the world presently, suffers from fragmentation of JVMs – To avoid too many JVMs, consider a minimum of 4GB heap sizes, “4GB is the new 1GB” – Scale up before scale-out, need adequate scale-out of minimum n+1, where n=2 – Scale-out JVMs are expensive on administration, CPU, and performance – CPU bund system because of the numerous number of JVM and hence the excessive GC cycles (4
sockets systems will work better for these) • Category 2
– In memory DBs, avoid too many JVMs, prefer large JVMs, but don’t exceed the NUMA boundary – Messaging systems, microservices, and other back office data providers can be treated this way – Memory bound (2 socket systems will work better)
Unless otherwise indicated, these sl ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Rule #4 Establish a Building Block VM and JVM
30
Design and Sizing of Application Platforms
Step 1- Establish Load profile
• From production logs/monitoring reports measure:
• ~Concurrent Users • ~Requests Per Second • ~Peak Response Time • ~Average Response
Time • ~Establish your response
time SLA
Step2 Establish Benchmark
• Iterate through Benchmark test until you are satisfied with the Load profile metrics and your intended SLA
• ~after each benchmark iteration you may have to adjust the Application Configuration
• ~Adjust the vSphere environment to scale out/up in order to achieve your desired number of VMs, vCPU and RAM configurations
Step 3- Size Production Env.
• The size of the production environment would have been established in Step2, hence either you roll out the environment from Step-2 or build a new one based on the numbers established
Step 2 – Establish Benchmark
DETERMINE HOW MANY VMs Establish Horizontal Scalability Scale Out Test • How many VMs do you need to meet your
Response Time SLAs without reaching 70%-80% saturation of CPU?
• Establish your Horizontal scalability Factor before bottleneck appear in your application
Scale Out Test
Building Block VM Building Block VM
SLA OK?
Test complete
Investigate bottlnecked layer Network, Storage, Application Configuration, & vSphere
If scale out bottlenecked layer is removed, iterate scale out test
If building block app/VM config problem, adjust & iterate No
Building Block VM
Building Block VM
ESTABLISH BUILDING BLOCK VM Establish Vertical scalability Scale Up Test • Establish how many JVMs on a VM? • Establish how large a VM would be in terms
of vCPU and memory
Scal
e U
p Te
st
Building Block VM
31
JVM Max Heap -Xmx (30g)
Perm Gen
Initial Heap
Guest OS Memory
-Xms (30g)
Java Stack -Xss per thread (1M*500)
-XX:MaxPermSize (0.5g)
Other mem (=1g)
0.5-1g used by OS
Set memory reservation to 34g
JVM Memory for SQLFire (32g)
VM Memory for SQLFire (34g)
32
Larger JVMs for In-Memory Data Grids
96 GB RAM on Server
Each NUMA Node has 94/2 45GB
8 vCPU VMs less than 45GB RAM on each VM
If VM is sized greater than 45GB or 8 CPUs, Then NUMA interleaving Occurs and can cause 30% drop in memory throughput performance
33 CONFIDENTIAL
ESXi Scheduler
Most Common Sizing and Configuration Question
JVM-1
JVM-2
JVM-1A
JVM-1
JVM-2
JVM-1
JVM-2
JVM-2A
JVM-3
JVM-4 Option-1 Scale out VM and JVM
Option-2 Scale Up JVM heap size
JVM-2
JVM-1
Option-3 Scale up VM with multiple JVM instances
2GB 2GB 2GB 2GB
2vCPU 2vCPU 2vCPU 2vCPU
2vCPU 2vCPU
4GB 4GB
Comparing Scenario-1 of 4 JVMs vs. Scenario-2 of 2 JVMs
CONFIDENTI
AL
• Scenario-1 4 JVMs off 1GB Heap on each, Average R/T 166ms
• Scenario-2 2 JVMs of 2GB Heap on each, Average R/T 123ms
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
scenario-‐1 RT
scenario-‐2 RT
Scenario-2 has 26% better response time
Comparing 4 JVMs vs. 2 JVMs
Scenario-2 (2 JVMs, 2GB heap each) has 60% less CPU utilization than scenario-1
Scenario-1 (4JVMs, 1GB Heap Each)
CONFIDENTIAL
What else to consider when sizing?
• Mixed workloads Job Scheduler vs. Web app require different GC Tuning
• Job Schedulers care about Throughput • Web apps care about minimize latency and response time • You can’t have both reduced response time and increased
throughput, without compromise • Separate the concerns for optimal tuning
Job
Web
JVM-1
Job
Web
JVM-2
Job
Web
Job
Web
JVM-3
Job
Web
JVM-4
Verti
cal
Horizontal
What is the practical limit for JVM Memory sizing?
64 bit Java theore5cal limit
Guest OS limit ESXi6 limit
Physical Server prac5cal limit
Per NUMA RAM
Most limi5ng prac5cal sizing factor is the per NUMA node RAM
16 Exa Bytes
1 to 16 TB 128vCPU, 4TB
~256GB to 1TB RAM
1st limit 2nd limit 3rd limit 4th limit 5th limit
vSphere maximums: https://www.vmware.com/pdf/vsphere6/r60/vsphere-60-configuration-maximums.pdf
Unless otherwise indicated, these sl ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Rule #4 Establish a Building Block VM and JVM • Its important to establish a well advertised building block VM and JVM, you may have
to have one building block per application group, some have 3 building blocks, small-medium-large etc. – Build new system based on the building block VM/JVM
• Don’t mix application types within same building block, but if you do then you have to test it, and understand lifecycle impacts. One app will cause downtime to next app if JVM is restarted
• Minimize the number of JVMs and avoid the costly administration of JVM sprawl • Understand what is the largest JVM heap you can afford, understand NUMA boundary • Don’t mix workload types, like a portal app with a batch system, portal apps care about
response time, hence latency sensitive, while batch jobs care about memory throughput, the 2 workload behaviors are perpendicular in nature and will cause you compromise the tuning of the platform. – You can specialize the JVM per workload type too.
SPRINGONE2GX WASHINGTON, DC
Unless otherwise indicated, these sl ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
“It’s not that I’m so smart, it’s just that I stay with problems longer.”
GC Tuning Deep Dive
Which GC?
• VMware doesn’t care which GC you select, because of the degree of independence of Java to OS and OS to Hypervisor
Tuning GC – Art Meets Science!
• Either you tune for Throughput or reduction of Latency, one at the cost of the other
Increase Throughput
Reduce Latency Tuning
Decisions
• improved R/T • reduce latency impact • slightly reduced throughput
• improved throughput • longer R/T • increased latency impact
Job
Web
Sizing The Java Heap
JVM Max Heap -Xmx
(4096m)
Eden Space
Survivor Space 2
Old Generation
Survivor Space 1
Slower Full GC
Quick Minor GC
YoungGen -Xmn
(1350m)
OldGen 2746m
Inside the Java Heap
Parallel Young Gen and CMS Old Gen
application threads minor GC threads concurrent mark and sweep GC
Young Generation Minor GC Parallel GC in YoungGen using XX:ParNewGC & XX:ParallelGCThreads
-Xmn
Old Generation Major GC Concurrent using in OldGen using XX:+UseConcMarkSweepGC
Xmx minus Xmn
S0
S1
High Level GC Tuning Recipe
Measure Minor GC Duration and Frequency
Adjust –Xmn Young Gen size and /or ParallelGCThreads
Measure Major GC Duration And Frequency
Adjust Heap space –Xmx
Adjust –Xmn And/or SurvivorSpaces
Step A-Young Gen Tuning
Step B-Old Gen Tuning
Step C- Survivor Spaces Tuning
Applies to Category-1 and 2 Platforms
Applies to Category-2 Platforms
Why is Duration and Frequency of GC Important?
Young Gen Minor GC
Old Gen Major GC
Young Gen minor GC duration
frequency frequency
Old Gen GC duration
We want to ensure regular application user threads get a chance to execute in between GC activity
48
Impact of Increasing Young Generation (-Xmn)
Young Gen Minor GC
Old Gen Major GC
less frequent Minor GC but longer duration
potentially increased Major GC frequency
You can mitigate the increase in GC frequency by increasing -Xmx
You can mitigate the increase in Minor GC duration by increasing ParallelGCThreads
49
Impact of Reducing Young Generation (-Xmn)
Young Gen Minor GC
Old Gen Major GC
more frequent Minor GC but shorter duration
Potentially increased Major GC duration
You can mitigate the increase in Major GC duration by decreasing -Xmx
Survivor Spaces
• Survivor Space Size = -Xmn / (-XX:SurvivorRatio + 2 ) – Decrease Survivor Ratio causes an increase in Survivor
Space Size – Increase in Survivor Space Size causes Eden space to be
reduced hence • MinorGC frequency will increase • More frequent MinorGC causes Objects to age quicker • Use –XX:+PrintTenuringDistribution to measure how
effectively objects age in survivor spaces.
51
Decrease Survivor Spaces by Increasing Survivor Ratio
Young Gen Minor GC
Old Gen Major GC
more frequent Minor GC but shorter duration
Hence Minor GC frequency is reduced with slight increase in minor GC duration
S0 S1 S0
S1
Reduce Survivor Space
52
Increasing Survivor Ratio Impact on Old Generation
Young Gen Minor GC
Old Gen Major GC
S0
S1
Increased Tenure ship/promotion to old Gen
hence increased Major GC
CMS Collector Example
• FullGC every 2hrs and overall Heap utilization down by 30% java -Xms50g -Xmx50g -Xmn16g -XX:+UseConcMarkSweepGC -XX:
+UseParNewGC –XX:CMSInitiatingOccupancyFraction=75 –XX:+UseCMSInitiatingOccupancyOnly -XX:+ScavengeBeforeFullGC -XX:TargetSurvivorRatio=80 -XX:SurvivorRatio=8 -XX:+UseBiasedLocking -XX:MaxTenuringThreshold=15 -XX:ParallelGCThreads=6 -XX:+OptimizeStringConcat -XX:+UseCompressedStrings -XX:
+UseStringCache
CONFIDENTIAL 54
Parallel Young Gen and CMS Old Gen
Young Gen Minor GC
Old Gen Major GC
Parallel/Throughput GC in YoungGen using XX:ParNewGC XX:ParallelGCThreads
Concurrent using XX:+UseConcMarkSweepGC
Application user threads
Minor GC threads Concurrent Mark and Sweep
CMS Collector Example
• Customer chose not to use LargePages: – They were content with performance they already achieved and did not want to
make OS level changes that may impact the amount of total memory available to other processes that may or may not be using LargePages.
• -XX:+UseNUMA JVM option also does not work with -XX:+UseConcMarkSweepGC
• Alternate would be to experiment with • numactl --cpunodebind=0 --membind=0 myapp
• However we found ESX NUMA locality algorithms were doing great at localizing and did not need further NUMA tuning.
CMS Collector Example
java –Xms30g –Xmx30g –Xmn10g -XX:+UseConcMarkSweepGC -XX:+UseParNewGC –_XX:CMSInitiatingOccupancyFraction=75
–XX:+UseCMSInitiatingOccupancyOnly -XX:+ScavengeBeforeFullGC -XX:TargetSurvivorRatio=80 -XX:SurvivorRatio=8 -XX:+UseBiasedLocking -XX:MaxTenuringThreshold=15 -XX:ParallelGCThreads=4 -XX:+UseCompressedOops -XX:+OptimizeStringConcat -XX:+UseCompressedStrings -XX:+UseStringCache § This JVM configuration scales up and down effectively
§ -Xmx=-Xms, and –Xmn 33% of –Xmx
§ -XX:ParallelGCThreads=< minimum 2 but less than 50% of available vCPU to the JVM. NOTE: Ideally use it for 4vCPU VMs plus, but if used on 2vCPU VMs drop the -XX:ParallelGCThreads option and let Java select it
Another Example (360GB JVM) • A monitoring system that does not scale out, runs in a large single JVM
of –Xmx360g, i.e. 360GB • The server has 512GB and 2 sockets of 10 cores each • 360GB + 1GB for OS + 25% * 360GB for off-the-heap overhead
– => 360GB + 1GB + 90GB => 451GB is the VMs memory Reservation • The VM has 20 vCPUs
java –Xms360g –Xmx360g –Xmn10g –Xss1024k -XX:+UseConcMarkSweepGC -XX:+UseParNewGC –_XX:CMSInitiatingOccupancyFraction=75
–XX:+UseCMSInitiatingOccupancyOnly -XX:+ScavengeBeforeFullGC
-XX:TargetSurvivorRatio=80 -XX:SurvivorRatio=8 -XX:+UseBiasedLocking
-XX:MaxTenuringThreshold=15 -XX:ParallelGCThreads=10 -XX:+OptimizeStringConcat -XX:+UseCompressedStrings
-XX:+UseStringCache –XX:+DisableExplicitGC –XX:+AlwyasPreTouch
IBM JVM - GC Choice
-Xgc:mode Usage Example
-Xgcpolicy:Optthruput (Default, WAS 6 and 7)
Performs the mark and sweep operations during garbage collection when the application is paused to maximize application throughput. Mostly not suitable for multi CPU machines.
Apps that demand a high throughput but are not very sensitive to the occasional long garbage collection pause
-Xcpolicy:Optavgpause
Performs the mark and sweep concurrently while the application is running to minimize pause times; this provides best application response times. There is still a stop-the-world GC, but the pause is significantly shorter. After GC, the app threads help out and sweep objects (concurrent sweep).
Apps sensitive to long latencies transaction-based systems where Response Time are expected to be stable
-Xgcpolicy:Gencon (default in WAS 8) Treats short-lived and long-lived objects differently
to provide a combination of lower pause times and high application throughput. Before the heap is filled up, each app helps out and mark objects (concurrent mark).
Latency sensitive apps, objects in the transaction don't survive beyond the transaction commit
Job
Web
Web
CMS Collector Example
JVM Option Description
-Xmn10g Fixed size Young Generation
-XX:+UseConcMarkSweepGC The concurrent collector is used to collect the tenured generation and does most of the collection concurrently with the execution of the application. The application is paused for short periods during the collection. A parallel version of the young generation copying collector is used with the concurrent collector.
-XX:+UseParNewGC This sets whether to use multiple threads in the young generation (with CMS only!). By default, this is enabled in Java 6u13, probably any Java 6, when the machine has multiple processor cores.
–XX:CMSInitiatingOccupancyFraction=75 This sets the percentage of the heap that must be full before the JVM starts a concurrent collection in the tenured generation. The default is some where around 92 in Java 6, but that can lead to significant problems. Setting this lower allows CMS to run more often (all the time sometimes), but it often clears more quickly to avoid fragmentation.
CMS Collector Example
JVM Option Description
–XX:+UseCMSInitiatingOccupancyOnly Indicates all concurrent CMS cycles should start based on –XX:CMSInitiatingOccupancyFraction=75
-XX:+ScavengeBeforeFullGC Do young generation GC prior to a full GC.
-XX:TargetSurvivorRatio=80 Desired percentage of survivor space used after scavenge.
-XX:SurvivorRatio=8 Ratio of eden/survivor space size
CMS Collector Example
JVM Option Description
-XX:+UseBiasedLocking Enables a technique for improving the performance of uncontended synchronization. An object is "biased" toward the thread which first acquires its monitor via a monitorenter bytecode or synchronized method invocation; subsequent monitor-related operations performed by that thread are relatively much faster on multiprocessor machines. Some applications with significant amounts of uncontended synchronization may attain significant speedups with this flag enabled; some applications with certain patterns of locking may see slowdowns, though attempts have been made to minimize the negative impact.
-XX:MaxTenuringThreshold=15 Sets the maximum tenuring threshold for use in adaptive GC sizing. The current largest value is 15. The default value is 15 for the parallel collector and is 4 for CMS.
CMS Collector Example
JVM Option Description
-XX:ParallelGCThreads=4 Sets the number of garbage collection threads in the young/minor garbage collectors. The default value varies with the platform on which the JVM is running.
-XX:+UseCompressedOops Enables the use of compressed pointers (object references represented as 32 bit offsets instead of 64-bit pointers) for optimized 64-bit performance with Java heap sizes less than 32gb.
-XX:+OptimizeStringConcat Optimize String concatenation operations where possible. (Introduced in Java 6 Update 20)
-XX:+UseCompressedStrings Use a byte[] for Strings which can be represented as pure ASCII. (Introduced in Java 6 Update 21 Performance Release)
-XX:+UseStringCache Enables caching of commonly allocated strings
External Apps – Public Portal To-Be Architecture External Application and Data Platforms
Public Portal App
Middleware Services
Data Platform
Public Portal App
20 hosts, 4 socket s per host, 40 cores per host, 256GB RAM
• 2000 JVMs, made of 100 unique apps • 1GB heap on each JVM • 2000/100 => 20 JVM instances for
each unique app • 20 JVMs per app for scaled-out • CPU Utilization is 10% at peak
Middleware Services
16 hosts, 2 socket s per host, 16 cores per host, 96GB RAM
• 400JVMs hosting REST services • 1GB heap on each JVM • 1 app made of 25 unique REST
services, collectively are same service • 1 REST Service per JVM • CPU utilization 15% at peak
Data Platform
5 hosts, 4 socket s per host, 40 cores per host, 512GB RAM
• 4 node RAC for Portal App DB • 20,000 DB connections • SGA 64GB on each RAC node • CPU utilization 30% at peak
• 1 Single instance for Middleware Services • 6000 DB Connections • 64 GB SGA • CPU utilization is 30% at peak
• Batch process runs on one of the RAC nodes, 12GB, 3% CPU at peak
Lets focus on middleware services of the external application platform
XYZCars.com – Current Middleware Services Platform – 25 unique REST Services – Xyzcars.com deployed each REST service
on a dedicated JVM – The 25 JVMs are deployed on physical box
of 12 cores (2 sockets 6 cores each socket) total and 96GB RAM
– There are a total of 16 hosts/physical boxes, hence total of 400 JVMs servicing peak transactions for their business
– The current peak CPU utilization across all is at 15%
– Each JVM has heap size of –Xmx 1024MB – Majority of transactions performed on
xyzcars.com traverse ALL of the REST services, and hence all of the 25 JVMs
XYZCars.com – Current Middleware Services Platform
R1 R1- denotes REST 1…25 Denotes a JVM
R1 One REST Service Per JVM
R1
R25
R1
R25
Load Balancer Layer
Solution 1 – Virtualize 1 REST : 1 JVM with 25 JVMs Per VM, 2 VMS Per Host
25 JVMs, 1 REST per JVM
On 1 VM
Solution 1 (400GB) – Virtualize 1 REST : 1 JVM with 25 JVMs Per VM, 2 VMs Per Host
– Sized for current workload, 400GB Heap space
– Deployed 25 JVMs on each VM, each JVM is 1GB
– Accounting for JVM off the heap overhead
• 25GB*1.25=31.25GB – Add Guest OS 1GB
• 31.25+1=32.25GB – 8 Hosts – 25 unique REST Services
• Each REST Service deployed in its own JVM
• Original call paradigm has not changed
50 JVMs to 12 cores, this maybe an issue, while the CPU utilization is originally at 15% you can assume 30%+ CPU utilization is the new level. However in actual fact response time may suffer significantly due
to coinciding GC cycles that can cause CPU contention
XYZCars.com – Current Java Platform
R1 R1- denotes REST 1…25 Denotes a JVM
R1 One REST Service Per JVM
Load Balancer Layer R1
R25
R1
R25
Solution 1 (800GB) – Virtualize 1 REST : 1 JVM with 25 JVMs Per VM, 2 VMs Per Host
– Sized for current workload, 800GB Heap space
– Deployed 25 JVMs on each VM, each JVM is 1GB
– Accounting for JVM off the heap overhead
• 25GB*1.25=31.25GB – Add Guest OS 1GB
• 31.25+1=32.25GB – 16 Hosts – 25 unique REST Services
• Each REST Service deployed in its own JVM
• Original call paradigm has not changed
50 JVMs to 12 cores, this maybe an issue, while the CPU utilization is originally at 15% you can assume 30%+ CPU utilization is the new level. However in actual fact response time may suffer significantly due to coinciding GC cycles that
can cause CPU contention
THIS SOLUTION IS NOT GREAT BUT ITS LEAST INTRUSIVE
NOTE: We had to use 16 hosts, as the 8 hosts in the 400GB case, already had 50
JVMs per host, which is significant
Solution 2 – Virtualize 25 REST : 1 JVM with 1 JVMs Per VM, 2 VMs Per Host
25 JVMs, 1 REST per JVM
On 1 VM
Solution 2 – Virtualize 25 REST : 1 JVM with 1 JVMs Per VM, 2 VMS Per Host
Solution 2 – Virtualize 25 REST : 1 JVM with 1 JVMs Per VM, 2 VMS Per Host
R1 R2
R3 R4
R5
R25
R1 R2
R3 R4
R5
R25
Load Balancer Layer
R1
R25
R1
R25
Load Balancer Layer
3x better response time using this approach
Solution 2 – Virtualize 25 REST : 1 JVM with 1 JVM Per VM, 2 VMS Per Host
R1 R2
R3 R4
R5
R25
R1 R2
R3 R4
R5
R25
Load Balancer Layer
Perm Gen
Initial Heap
Guest OS Memory
Java Stack
All the REST transaction across 25 services run within one JVM instance
Solution 2 – Virtualize 25 REST : 1 JVM with 1 JVM Per VM, 2 VMS Per Host Description Today’s Traffic Load Future Traffic Load (2.x current
load) Comment
VM Size (theoretical ceiling NUMA optimized)
[96-{(96*0.02)+1}]/2 = 46.5GB
[96-{(96*0.02)+1}]/2 = 46.5GB Using NUMA overhead equation, this VM of 46.5GB and 6vCPU will be NUMA local
VM Size for Prod 46.5*0.95=44.2GB 46.5*0.95=44.2GB
JVM Heap Allowable (44.2-1)/1.25=34.56GB
(44.2-1)/1.25=34.56GB Working backwards from the NUMA size of 44.2, minus 1GB for Guest OS, and then accounting for 25% JVM overhead by dividing by 1.25
Number of JVMs needed 400/34.56=11.59 => 12 JVMs 800/34.56=23.15 => 24 JVMs Total heap needed divided by how much heap can be placed in each NUMA node
Number of Hosts 6 12 1 JVM per VM, 1 VM per NUMA node
Solution Highlights 6 hosts used instead of 16, 62.5% less hardware and hence reduced licensing cost (3.x better response time)
12 hosts, vs., what would have been 32 hosts 16*2, 62.5%saving, or 25% if you take 16 hosts as the base
The 12 hosts solution handles 2.x amount of current traffic at 25% less hardware than the existing 16hosts that are handling x load
Cluster Layout for Solution 2
Current Arch uses 16 hosts for servicing
400GB heap
Improved to-be Arch uses 6 hosts for
servicing 400GB heap
Improved to-be Arch uses 12 hosts for
servicing 800GB heap
• This solution uses 62.5% less hardware
• 3.x better response time • Substantial software license
saving • Huge potential for further
scalability
Solution 2 – Virtualize 25 REST : 1 JVM with 1 JVM Per VM, 2 VMS Per Host
JVM Max Heap -Xmx
(34.5GB)
Total JVM Memory (max=42.5GB
Min=37.5) Perm Gen
Initial Heap
Guest OS Memory
VM Memory (43.5)
-Xms (34.5GB)
Java Stack -Xss per thread (256k*1000)
-XX:MaxPermSize (1GB)
Other mem (=1GB)
1GB used by OS
Set memory Reservation to 43.5GB
All REST Services in one Heap
Increase thread pool to 1000 to take on more load since heap is
much larger
XYZCars.com – External Apps Platform's Middleware Services is an Example of Microservices Architecture
– 25 unique REST Services – Xyzcars.com deployed each REST service
on a dedicated JVM – Microservices approach
• Micro defined • Micro deployed • Costly architecture and poor performance • Offers ultimate flexibility, but this is not
practical
Current Arch uses 16 hosts for servicing
400GB heap
Improved to-be Arch uses 6 hosts for
servicing 400GB heap
• 25 unique REST Services
• 1 JVM has 25 REST service instances
• Microservices approach • Micro defined • Macro deployed • Cost efficient • High performing • Good enough flexibility that is practical
XYZCars.com – External Apps Platform's Middleware Services is an Example of Microservices Architecture
Improved to-be Arch uses 6 hosts for servicing 400GB heap
• 25 unique REST Services
• 1 JVM has 25 REST service instances
• Microservices approach • Micro defined • Macro deployed • Cost efficient • High performing
Rest1
Rest2
Rest3
Rest25
Approach 1 - Micro Defined & Micro Deployed Fragmented Scale out consumes more
resources, more VMs, has poor response time
Rest1
Container
Rest2
Rest25
Rest1
Rest2
Rest25
Rest1
Rest2
Rest25
Approach 2 - Micro Defined Microservices BUT & MACRO Deployed NON-Fragmented Scale out
consumes less resources, fewer VMs, has GREAT response time
25 container types, 400 container instances
1 container type, 12 container instances
3rd Platform? § 2nd Platforms are stateful, need lots of care, supposedly
difficult to change § 3rd platform are all around microservices/macroservices
concepts, independent software services, called in a sequence to formulate overall application logic
• Touted as stateless, if you lose one, it doesn’t matter you always have another copy somewhere else
• Supposedly easy to re-deploy/flexible, but this is not always the case
2nd Platform
3rd Platform Web tier
App tier
DB tier
2nd Platform
Load Balancer
Authentication
Session Store Licensing
Monitoring Provisioning
DNS Content Database x3
Web Server
x3
…
3rd Platform