real life java ee performance tuning
DESCRIPTION
Presentation by Matt BrasierHead of ConsultancyC2B2 ConsultingTRANSCRIPT
Real Life Java EE Performance Tuning
Matt BrasierPrincipal ConsultantC2B2 Consulting [email protected]
About Me
Professional Services Consultant
Customers include
• Red Hat (JBoss)
• BEA
• Cape Clear
• Government/Finance/Telecoms
C2B2 Consulting
• SOA and Java EE consultancy
• Fast, Reliable, Manageable, Secure
What we will cover
Philosophy
• How I approach a performance problem situation
Enterprise Java Performance
• What kind of things affect performance of Enterprise Systems
Case Study 1
• A new version of the application runs slowly
Case Study 2
• Logging in takes a long time in the live environment
Case Study 3
• The application does not scale
What we will learn
Philosophy
• Suggestions to keep in mind when looking at a performance problem
Tools
• Suggested tools for looking at a performance problem
Techniques
• How to use the tools, knowledge and skills to solve your performance problem
Philosophy
‘A good understanding’ is the best
performance tuning tool
Prefer common and open source tools
Observe, Hypothesize, Tweak, Test
‘Trust no-one’
Classic Java performance problems
Memory leaks
• Increased GC Time
Poor GC or JVM Memory configuration
CPU bound code
IO bound code
Memory bound code
• Increased GC time
Enterprise Java Performance
CAVEAT: Consultancy Selection Bias
80/20: 80% of time finding, 20% fixing
Many ‘Enterprise’ Java performance problems turn
out not to be ‘classic’ performance bottlenecks
• Infrastructure/Middleware performance
There are many factors that can affect the
performance of an enterprise system
• Not just code
Enterprise Java Performance
Not all Java EE performance problems are classical ‘Java performance problems’
Common types of Java EE performance problem
• Resource starvation
• Threading problems
• ‘Suboptimal configuration’
• Network related problems
• Scalability problems
A Good Understanding
Consider the system as a whole
Know how infrastructure components work
• Not just what they do, but how they do it
How do the Java EE specifications say they
should work?
Approach
Understand the system
Understand the environment
Understand the situation
Talk to people who know
• But trust no-one
Take a look for myself
Observe, Hypothesize, Tweak, Test
• Rinse and repeat
Case Study 1
Case Study 1
Existing customer calls
• “We deployed a new version of the application, and it is running a lot slower”
The Environment
• Sun Java 5
• WebLogic Server 9.2 Cluster (3 nodes)
• WebLogic Integration 9.2 Cluster (3 nodes)
• Documentum Document Management
• Oracle Database
• Solaris OS
Case Study 1
The System
• Web Application
• WLI based workflow system
The situation
• New version deployed into the performance
testing environment
• Automated performance tests indicate the
application is approximately 30% slower
Case Study 1
Observe
• No monitoring in place
• Some alerting, but no historical data
Hypothesize
• If we had more monitoring, we would stand a better
chance
Tweak
• Put some monitoring in place
• Hyperic HQ from SpringSource
Case Study 1
Test
• Re-run tests
Observe
• Monitoring indicates that one server is slower Handling less requests per second Lots of transaction timeouts Higher CPU Less network traffic
Tweak
• Add more monitoring to the slow server
• Examine log files
• Thread dumps!
Case Study 1
Hypothesize
• Thread dumps show lots of threads in logging code waiting to write to the log file
• Log files for the slow server have DEBUG messages in them The other servers don’t
“The logging configurations are identical, the servers are configured with Maven”
• Trust no one
Test
• Log in to the server and manually check the logging configuration
Case Study 1
Solution
• Debug logging was enabled on one server
• Turned debug logging off - the system was then
about the same speed as the old release
Hyperic HQ
Hyperic HQ
Monitoring tool
• Not a profiling tool
Historical data
• Trends
• Abnormal behaviour
• ‘Hot’ spots
Wide variety of data
• JVM level statistics
• JMX statistics
• OS statistics
Thread Dumps
My Number 2 tool for finding performance problems
• CTRL-BREAK in windows
• Kill -3 on Unix/Linux
• Jstack tool
• Available from consoles of many application servers
All threads in the VM and what they are doing at that moment
Thread Dumps
A number of thread dumps over time gives a
good picture
• Any operation that appears a lot is a suspect
• Understand what ‘normal’ thread dumps look like
http://java.sun.com/developer/technicalArticles/Programming/Stacktrace/
Thread Dump
Thread Dumps
Look near the top of each stack
Look for stacks with your code in them
Look for long stacks
Look for deadlocks and other threading
issues
The Understanding
What does a normal WebLogic thread dump look like?
It is not normal to see logging code frequently in a
thread dump
Lots of threads all waiting on a single lock object is a
Bad Thing™
If three servers are supposed to do the same thing,
their thread dumps should look similar
• Over time
Lessons
Thread dumps hold a lot of information
Infrastructure configuration faults are more
common than infrastructure bugs
Automated/continuous build and deploy
solutions are no silver bullet
• Check the results yourself
Believe your ‘instincts’
Case Study 2
Case Study 2
Customer Call
• “We deployed our application into the live environment
and it takes several minutes for users to log in”
Environment
• Apache web servers
• WebLogic Portal 8.1 Cluster (2 nodes)
• Oracle Database
• Windows Server 2003
• Bespoke Single Sign On server
Case Study 2
The System
• Web application based on WSRP portlets
• Oracle database storing user data
The Situtation
• The first users to log-in in the morning find that it
takes several minutes
• After the first few log-ins, the application runs fine
Case Study 2
Hypothesize
• The bespoke Single Sign On server makes me
suspicious Bespoke code is tested less
Test
• Turn on debug logging for the SSO implementation
• Observe timings of log messages
Case Study 2
Observe
• The logs indicate that the SSO log-in is proceeding
as expected
• It appears that loading the users profile data from
the database is taking a long time
Hypothesize
• TCP timeouts when connecting to the database
due to a firewall
Case Study 2
Test
• Observe the connection pool statistics in the
WebLogic console
• The console indicates that a large number of
connections have been opened during the time the
application has been running Connections are not normally closed and re-opened
• See how long you need to leave the system before
the problem occurs
Case Study 2
Solution
• Discussions with the networking team indicated
that there was a firewall, configured to silently
terminate network connections that were Idle for
60 minutes
• Set WebLogic to test connections after they have
been idle for 50 minutes.
Lessons
Consider the system as a whole
• Hardware
• Networking
• OS
• Middleware
• Application
The Understanding
Firewalls are often configured to silently terminate
idle TCP connections
The TCP protocol requires that a connection is closed
by both sides, or times out
• The time out is several minutes
In a healthy WebLogic connection pool, the number
of connections opened since the server started = the
maximum number in the pool
Case Study 3
Case Study 3
Customer call
• “It takes about 20 seconds to render a page, and
the performance does not scale”
Environment
• WebLogic Portal 9.1 Cluster (2 nodes)
• Oracle 10g Database
• Red Hat Enterprise Linux
Case Study 3
The System
• Online content delivery system
• WebLogic Portal with a commercial set of portlets
The Situation
• Two problems Running the performance tests with 20 threads in
JMeter is twice as slow as running the tests with 10
threads
Viewing a content item takes around 20 seconds
Case Study 3
Handle the two problems separately
• They may be related, they may not be
Case Study 3
Observe
• Viewing a content item takes around 16 seconds on my laptop
Test
• Is the rendering speed dependent on the browser used?
• Is the rendering speed dependent on the client machine?
• What does the page source look like?
Case Study 3
Observe
• In Opera the page renders quickly except for the table of contents on the left
• In Firefox, the whole page renders at the same time
• The page renders faster in IE and Opera than firefox
• The page renders faster on faster machines
• There is a lot of Javascript, and AJAX is used to load the table of contents
Case Study 3
Hypothesize
• The AJAX rendering of the TOC is taking a long
time, and slowing down the whole page load
Tweak
• Remove the TOC from the page
• Disable JavaScript in the browser
Test
• The page renders in less than 2 seconds
Case Study 3
Hypothesize
• JMeter does not execute the javascript, so the poor
performance of JMeter is not related to the poor
page load speed
Case Study 3
Solution 1
• The portlet developers have used AJAX to render the table of contents for a content item, this is much slower than just constructing the table of contents on the server side
• Rewrite the portlet to construct the table of contents on the server side
• Developers sometimes select a technology to enhance their CVs, not to implement a business requirement
Case Study 3
Problem 2 – Scalability
Observe
• Running the tests on JMeter with 10 users, each
page response takes 5s
• Running the test with 20 users each page
response takes 12s
• JMeter is being run on an old laptop, which is at
100% CPU in both cases
Case Study 3
Hypothesize
• As the test machine is at 100% CPU, it is the
performance of JMeter that is being measured, not
the performance of WebLogic
Observe
• WebLogic is running at around 2% CPU usage, with
many idle threads
Case Study 3
Tweak
• Run the test from a number of more modern
machines, and make sure each one does not
exceed 70% CPU
Observe
• Four machines can each run 20 threads and get
responses in 1.5 seconds, and WebLogic is still
running at around 5% CPU and not struggling
Case Study 3
Solution
• The problem was that the test client was not able
to generate the loads requested, resulting in the
performance of the test client being measured
• Use a larger test client
Useful tools
Ethereal/Wireshark
• Network traffic sniffer
• See when requests/responses were sent/received
Firebug + YSlow
• Firefox plugin for performance analysis
Lessons
Separate problems should initially be
prioritised and investigated separately
• Keep in mind that they may be related
Ensure the test system can generate the
required load
• It should have plenty of free resources available
Lessons
The consultant effect
• Take a step back
• Get a fresh perspective
The Understanding
A slow test client will give slow results
Client side rendering is usually less efficient
than server side
WebLogic is normally fast!
What did we learn?
Simple tools can provide a lot of information
Understanding how the system should
behave will help highlight possible causes
Experience is vital
• Write a log of what you find
Take a step back from the problem
• Use a second pair of eyes
What did we learn?
Philosophy
• Understand they system as a whole
• A deep understanding of how it should work
Tools
• Thread dumps
• Monitoring tools
• Packet sniffing
Techniques
• Observe, Hypothesize, Tweak, Test
Questions
Session Evaluation
Please complete a session evaluation and
turn it into any conference staff member or
at the registration desk. Thank you.