performance concurrency troubleshooting final

System Performance

Build Fuel Tune

Simar Singh

[email protected]

[email protected]

mailto:[email protected]



Learn and Apply

Topics • Performance

• Concurrency (Threads)

• Troubleshooting

• Processing (CPU/Cores)

• Memory (System / Process)

• Thread Dumps

• Garbage Collection

• Heap Dumps

• Core Dumps & Postmortem

• Java (jstack, jmap, jstat, VisualVM)

• Solaris (prstat vmstat mpstat pstack)

Index (Click Links in Slide Show)

• Concepts

• Processing

• Memory

Concepts Concurrency and Performance

(Part 1)

What will we Discuss?

– LEARN – There are laws and principals that govern concurrency and performance. – Performance can be built, fueled and/or tuned. – How do we measure performance and capacity in abstract terms? – Capacity (throughput) and Load are often used interchangeably but

incorrectly. – What is the difference between Resource utilization and saturation? – How performance & capacity are measured on a live system (CPU & Memory)?

– APPLY – Find out how is your system being used or abused? – Find out how your system is performing as a whole? – Find out how a particular process in the system is performing? – Find out how a particular thread in the process performing? – Find out the bottle-necks? What is less or missing?

Performance – Built, Fueled or Tuned

• Built (Implementation and Techniques) – Binary Search O(log n) is more efficient than Linear Search O(n) – Caching can improve Disk I/O significantly boosting

performance.

• Fueled (More Resources) – Simply get a machine with more CPU(s) and Memory if

constrained. – Implement RAID to improve Disk I/O

• Tuned (Settings and Configurations) Tune Garbage Collection to optimize Java Processes – Tune Oracle parameters to get optimum database performance

Capacity and Load

• Load is an Expectation out of system – It is the rate of work that we put on the system. – It is an factor external to the system. – Load may vary with time and events. – It has no upper cap, can increase infinitely

• Capacity is a Potential of the system – It is the max rate of work, the system supports efficiently, effectively & infinitely – It is a factor, internal to the system. Maximum capacity of a system is finite and stays fairly constant. We often call Throughput as the System’s Capacity for Load.

• Chemistry between Load & Capacity – LOAD = CAPACITY? Good Expectation matches the potential. Hired – LOAD > CAPACITY? Bad Expectations is more than potential. Fired – LOAD < CAPACITY? Ugly Expectations is less then potential. Find another one – If not good better be ugly than bad.

Performance Measurement of a System

Measures of System’s Capacity • Response Time or Latency

– Measures time spent executing a request • Round-trip time (RTT) for a Transaction

– Good for understanding user experience – Least scalable, Developers focus on how much time each transaction takes

• Throughput

– Measures the number of transactions executed over a period of time • Output Transactions per second (TPS)

– A measure of the system's capacity for load – Depending upon the resource type, It could be hit rate (for cache)

• Resource Utilization

– Measures the use of a resource • Memory, disk space, CPU, network bandwidth

– Helpful for system sizing, is generally the easiest measurement to Understand – Throughput and Response Time can conflict, because resources are limited

• Locking, resource contention, container activity

It is time for System Capacity to be Loaded with work (Throttling & Buffering Techniques)

• No one stops us to load a system more than its capacity (Max Throughput).

• Transactions Per Seconds -Misconception, Real traffic may be in bursts – Received 3600 transactions in a hour, not sure if every second only 60 were pumped – Probably we received in bursts - all in first 10 minutes and for nothing last 50 minutes – So we really cant say, at what tps? We can regulate bursts with throttling and buffering

• Throttling – (Implemented by producer to smoothen output) – Spreads bursts over time to smoothen output from a process – We may add throttles to control output rate from threads to each external interface Throttle of 10 tps ensures max output is 10 tps regardless of the load & capacity. Throttling is scheme for producers (Check production to rate the consumer can accept)

• Buffering – (Implemented by consumer to smoothen input)

– Spreads burst over time to smoothen input from an external interface – We add buffering to control input rate to threads from each external interface Application processes input at 10 tps, load above it will be buffered & processed later Buffering is a scheme for consumers (Take whatever is produced, consume at our own)

Supply Chain Principle (Apply it to define a optimum Thread Pool Size)

• The more throughput you want, more will be the resource consumption. • You may apply this principle to define the optimum thread-pool size for a

system/application.

– To support a Throughput (t) transactions per second- (t) =

20 tps

– Where each transaction takes (d) seconds to complete- (d) = 5 seconds

– We need (d*t) threads at least (min size of the thread pool)- (d*t) = 100 threads

• Thread is an abstract CPU unit resource here.

To support a Throughput (t) of 20 tps Where each transaction takes(d) 5 seconds We need 100 (d*t) threads at least

1 sec 2 sec 3 sec 4 sec 5 sec

20 1 sec 2 sec 3 sec 4 sec 5 sec

20

20


20

20

20


20

20

20

20


20

20

20

20

20


20

20

20

20

20

Quantify Resource Consumption Utilization & Saturation

• Resource Utilization – Utilization measures how busy a resource is.

– It is usually represented as a percentage average over a time interval.

• Resource Saturation – Saturation is often a measure of work that has queued waiting for the resource

– It can be measured as both

• As an average over time

• And at a particular point in time.

– For some resources that do not queue, saturation may be synthesized by error counts. Example Page-Faults reveal memory saturation.

• Load (input rate of requests) is an independent/external variable

• Resource consumption, Throughput (out-put rate of response) are dependent/internal variables, a function of load.

How Load, Resource Consumption and Throughput related?

• As load increases, throughput increases, until maximum resource utilization on the bottleneck device is reached. At this point, maximum possible throughput is reached, Saturation occurs.

• Then, queuing (waiting for saturated resources) starts to occur.

• Queuing typically manifests itself by degradation in response times.

• This phenomenon is described by Little’s Law:

L = X * R

L (LOAD), X (THROUGHPUT) and R (RESPONSE TIME)

• As L increases, X increases (R also increases slightly, because there is always some level of contention at the component level).

• At some point, X reaches Xmax – the maximum throughput of the system. At this point, as L continues to increase, the response time R increases in proportion and through-put may then start to decrease, both due to resource contention.

Performance pattern of a Concurrent Process

Example How Throughput and Resource Consumption are related?

• Throughput & Latency can have an inverse or direct relationship

– Concurrent tasks (Threads) often contend for resources (locking & contention)

• Single-Threaded – Higher Throughput = Lower Latency

– Consistent throughput, does not increase with incoming load & resources

– Processes serially, Good for batch jobs

– Response Time linearly varies with request order.

• Multi-Threaded – Higher Throughput = Higher Latency (Most of the time)

– Throughput may increase linearly with load, it starts to drop after threshold

– Process Concurrently, Good for interactive modules (Web Apps)

– Near consistent Response Time, doesn’t vary much with order but load.

Single Threaded – 10 CPU(s) Multi Threaded – 10 CPU(s)

Threads = 1

Latency = .1 seconds

Throughput = 1/.1 = 10 tx/sec

Threads = 10


Throughput = 1/.1 * 10 = 100

Threads = 1

Latency = .001 second

Throughput = 1/.001 = 1000 tx/sec

Threads = 100


Throughput = 1/.2 * 100 = 500 tx/sec

Producer Consumer Principle Predicting Maximum Throughput

Identify Bottleneck Device/Resource

• The Utilization Law: Ui = T * Di

• Where Ui is the percentage of utilization of a device in the application, T is the application throughput, and Di is the service demand of the application device.

• The maximum throughput of an application Tmax is limited by the maximum service demand of all of the devices in the application.

• EXAMPLE - A load test reports 200 kb/sec average throughput:

CPUavg = 80% Dcpu = 0.8 / 200 kb/sec = 0.004 sec/kb

Memoryavg = 30% Dmemory = 0.3 / 200 kb/sec = 0.0015 sec/kb

Diskavg = 8% Ddisk = 0.08 / 200 kb/sec = 0.0004 sec/kb

Network I/Oavg = 40% Dnetwork I/O = 0.4 / 200 kb/sec = 0.002 sec/kb

• In this case, Dmax corresponds to the CPU. So, the CPU is the bottleneck device.

• We can use this to predict the maximum throughput of the application by setting the CPU utilization to 100% and dividing by Dcpu. In other words, for this example:

Tmax = 1 / Dcpu = 250 kb/sec

• In order to increase the capacity of this application, it would first be necessary to increase CPU capacity. Increasing memory, network capacity or disk capacity would have little or no effect on performance until after CPU capacity has been increased sufficiently.

Work Pools & Thread Pools Working Together

• Work Pools are queues of work to be performed by a software application or component.

– If all threads in thread pool are busy, incoming work can be queued in work pool

– Threads from thread pool, when freed can execute them later

• Work Pools are filling up congestion & smoothen bursts

– A queue consisting of units of work to be performed – CONGESTION, by allowing the current (client) threads to submit

work and return – BURST, over capacity transaction can buffered in work pool and

executed later – Allow for caching of units of work to reduce system intensive

calls • Can perform a bulk fetch form a database instead of fetching on record at a time

Queuing Tasks may be risky

• One task could lock up another that would be able to continue if the queued task were to run.

• Queuing can smoothen in-coming traffic burst limited in time (depending upon the rate of traffic and size)

• Fails if traffic arrives on average faster than they can be processed.

• In general, Work Pools are in memory so it is important to understand what the impact of restarting a system is, as in memory elements will be lost.

– Is it relevant to lose the queued work?

– Is the queue backed up on disk?

Bounded & Unbounded Pools (Load Shedding)

• If not bounded, pools can grow freely but can cause system to exhaust resources.

– Work Pool / Queue Unbounded - (May overload Memory / Heap & crash) • Each work object in the queue stays holding the space until consumed

– Thread Pool Unbounded – (May overload CPU / Native Space and Crash) • Each thread asks to be scheduled on CPU and consumes native stack space

• If queue size is bounded, incoming execute requests block when it is full. We can apply different Policies to

handle t, for example

– Reject if there is no space (Can have side affects) – Remove based on Priority – (Ex priority may be function of time –

Timeouts) • Thread Pools can have different policies when Work Pools is full:

– Block till there is available space – Starve (VERY BAD – Sometimes Needed)

– Run in Current Thread (Very Dangerous!)

Work pool & thread pool sizes can often be traded off for each other

Large Work-Pool and small thread pools

– Minimizes CPU usage, OS resources, and context-switching overhead.

– Can lead to artificially low throughput especially if tasks frequently block (ex I/O bound)

Small Work pool generally require larger thread pool sizes

– Keeps CPUs busier

– May cause scheduling overhead (Context Switching) and may lessen throughput. Especially if the number of CPUs are less.

Processing (CPU) Performance & Troubleshooting

(Part 2)

CPU

• Many modern systems from Sun boast numerous CPUs or virtual CPUs (which may be cores or hardware threads).

• The CPUs are shared by applications on the system, according to a policy prescribed by the operating system and scheduler

• If the system becomes CPU resource limited, then application or kernel threads have to wait on a queue to be scheduled on a processor, potentially degrading system performance.

• The time spent on these queues, the length of these queues and the utilization of the system processor are important metrics for quantifying CPU-related performance bottlenecks.

Process – User and Kernel Level Threads

• Process includes the set of executable programs, address space, stack, and process control block. One or more threads may execute the program(s).

• User-level threads (threads library) – Invisible to the OS and are maintained by a thread Library.

– are the interface for application parallelism

• Kernel threads – the unit that can be dispatched on a processor and it’s structures are

maintain by the kernel

• Lightweight processes (LWP) – Each LWP supports one or more User Level Thread and maps to exactly one

Kernel Level Thread. Maintains the state of a thread.

CPU Consumption Model

By default Solaris 10 uses Process 4 model, rest are obsolete.

Dispatcher and Run Queue at CPU

User Thread over a Solaris LWP State of User Thread and LWP may be different

Solaris Threading Model

If you are in a thread, the thread library must schedule it on an a LWP

Each LWP has a kernel thread, which schedules it on a CPU.

Threading models are used between LWPs & Solaris Threads

JVM Organization

JVM Memory Organization & Threads

• Method Area

– JVM loads the class file, their type info and binary data in this area – This memory area is shared by all threads

• Heap Area

– JVM places all objects the program instantiates onto the heap – This memory area is shared by all threads – This memory can be adjusted by VM options -Xmx & -Xms as required

• Java Stack and Program Counter (PC) Register

– Each new thread that executes, gets its own pc register & Java stack. – The value of the pc register indicates the next instruction to execute. – A thread's Java stack stores the state of Java method invocations for the

thread. The state of a Java method invocation includes • its local variables & the parameters with which it was invoked, • its return value (if any), and intermediate calculations.

– This memory may be adjusted by VM option –Xss, typically 1m for RK Apps – The state of native method (JVM method) invocations is stored in an

implementation-dependent way in native method stacks, as well as possibly in registers or other implementation-dependent memory areas.

A Java thread’s Stack Memory

• The Java stack is composed of stack frames (or frames).

• A stack frame contains the state of one Java method invocation.

– When a thread invokes a method, the Java virtual machine pushes a new frame onto that thread's Java stack.

– When the method completes, the virtual machine pops and discards the frame for that method.

Thread Modes Kernel & User Mode Privilege

• A LWP may either execute in kernel (sys) or user (usr) privilege mode.

• Operations like, processing data on local memory and inter-process communication between threads of the same process does not require kernel mode privilege for the thread executing the user program.

• However, intra-process communication or hardware access are done by kernel programs the executing thread requires kernel mode privilege

• User programs often call by call kernel programs by making system calls.

• A LWP runs in user mode until it makes a system call that requires kernel mode privilege. The mode switch then happens, which is costly.

LWP/Thread Modes User Mode and Kernel Mode

Don’t confuse the modes with type (Kernel and User)

Complete Process State Diagram State of a process is a super set of Thread States A process’s thread state is defined by its threads.

VMSTAT - Glimpse of CPU Behavior

vmstat tool provides a glimpse of the system's behavior

The vmstat tool provides a glimpse of the system's behavior on one line indicates

both CPU utilization and saturation.

The first line is the summary since boot, followed by samples every five seconds

Far right is cpu:id for percent idle lets us determine how utilized the CPUs are

In this ex, the idle time for the 5 second samples was always 0, indicating

100% utilization.

On the far left is kthr:r for the total number of threads on the ready to run queues.

If the value is more than the number of CPU’s it indicates CPU saturation.

Meanwhile, kthr:r was mostly 2 and sustained, indicating a modest

saturation for this single CPU server. A value of 4 would indicate high

saturation.

More about VMSTAT

Count Description

kthr

r Total number of runnable threads on the dispatcher queues

faults

in Number of interrupts per second

sy Number of system calls per second

cs Number of context switches per second, both voluntary and involuntary

cpu

us Percent user time; time the CPUs spent processing user-mode threads

sy Percent system time; time the CPUs spent processing system calls on behalf of user-mode threads, plus the time spent processing kernel threads

id Percent idle; time the CPUs are waiting for runnable threads. This value can be used to determine CPU utilization

CPU Utilization • You can calculate CPU utilization from vmstat by subtracting id from 100 or by adding us

and sy.

• 100% utilized may be fine—it can be the price of doing business.

• When a Solaris system hits 100% CPU utilization, there is no sudden dip in performance; the performance degradation is gradual. Because of this, CPU saturation is often a better indicator of performance issues than is CPU utilization.

• The measurement interval is important: 5% utilization sounds close to idle; however, for a 60-minute sample it may mean 100% utilization for 3 minutes and 0% utilization for 57 minutes. It is useful to have both short- and long-duration measurements.

• An server running at 10% CPU utilization sounds like 90% of the CPU is available for "free," that is, it could be used without affecting the existing application. This isn't quite true. When an application on a server with 10% CPU utilization wants the CPUs, they will almost always be available immediately. On a server with 100% CPU utilization, the same application will find that the CPUs are already busy—and will need to preempt the currently running thread or wait to be scheduled. This can increase latency.

CPU Saturation

• The kthr:r metric from vmstat is useful as a measure for CPU saturation. However, since this is the total across all the CPU run queues, divide kthr:r by the CPU count for a value that can be compared with other servers.

• Any sustained non-zero value is likely to degrade performance. The performance degradation is gradual (unlike the case with memory saturation, where it is rapid).

• Interval time is still quite important. It is possible to see CPU saturation (kthr:r) while a CPU is idle (cpu:idl). You may find that the run queue is quite long for a short period of time, followed by idle time. Averaging over the interval gives both a non-zero run queue length and idle time.

Solaris Peformance Tools Tool Uses Description

vmstat kstat For an initial view of overall CPU behavior

psrinfo kstat For physical CPU properties

uptime getloadavg() For the load averages, to gauge recent CPU activity

sar kstat, sadc For overall CPU behavior, and dispatcher queue statistics; sar also allows historical data collection

mpstat kstat For per-CPU statistics

prstat procfs To identify process CPU consumption

dtrace Dtrace For detailed analysis of CPU activity, including scheduling events and dispatcher analysis

uptime Command

• The numbers are the 1-, 5-, and 15-minute load averages.

• Load averages is often approximated as the average number of runnable and running threads, which is a reasonable description.

• A value equal to your CPU count usually means 100% utilization; less than your CPU count is proportionally less than 100% utilization; and greater than your CPU count is a measure of saturation

• A consistent load average higher than your CPU count may cause degraded performance. Solaris handles CPU saturation very well, so load averages should not be used for anything more than an initial approximation of CPU load.

Prints up time with CPU Load averages. They represent both

utilization and saturation of the CPUs.

sar - The system activity reporter Provide live statistics or can be activated to record historical

CPU statistics, prints the user (%usr), system (%sys), wait I/O

(%wio), and idle times (%idle).

Identifies long-term patterns that may be missed when taking a

quick look at the system. Also, historical data provides a

reference for what is "normal" for your system

The following example shows the default output of sar, which is

also the -u option to sar. An interval of 1 second and a count of

5 were specified.

sar –q - Statistics on the run queues

runq-sz (run queue size). Equivalent to the kthr:r field from vmstat; can be

used as a measure of CPU saturation

swpq-sz (swapped-out queue size). Number of swapped-out threads. Swapping

out threads is a last resort for relieving memory pressure, so this field will be

zero unless there was a dire memory shortage.

%runocc (run queue occupancy). Helps prevent a danger when intervals are

used, that is, short bursts of activity can be averaged down to unnoticeable

values. The run queue occupancy can identify whether short bursts of run queue

activity occurred

%swpocc (swapped out occupancy). Percentage of time there were swapped

out threads. If one thread is swapped, all others of threads of the process must also be.

About the Individual Processors

syscl (system calls) csw (context switches)

icsw (involuntary context switches) migr (migrations of threads between processors)

intr (interrupts) ithr (interrupts as threads)

smtx (kernel mutexes) srw (kernel reader/writer mutexes)

psrinfo -v command determines the number of processors in the system and their

speed. In Solaris 10, -vp prints additional information.

The mpstat command summarizes the utilization statistics for each CPU. Following

is an example of four CPU machine, being sampled every 1 second.

Is my system performing well?

What are sampling and Clock tick woes?

• While most counters you see in Solaris are highly accurate, sampling issues remain in a few minor places. In particular, the run queue length as seen from vmstat (kthr:r) is based on a sample that is taken every second. Example, a problem was caused by a program that deliberately created numerous short-lived threads every second, such that the one-second run queue sample usually missed the activity.

• The runq-sz from sar -q suffers from the same problem, as does %runocc(which for short-interval measurements defeats the purpose of %runocc).

• These are all minor issues, and a valid workaround is to use DTrace, with which statistics can be created at any accuracy desired

Who Is Using the CPU? The default output from the prstat command shows one line of output

per process, showing CPU utilization value before the prstat

command was executed.

The system load average indicates the demand and queuing for

CPU resources averaged over a 1-, 5-, and 15-minute period if that

exceeds the number of CPUs, the system is overloaded.

How is the CPU being consumed? • Use Options -m(show microstates) & -L(show per-thread) observe per-thread microstates. • Microstates represent a time-based summary broken into percentages for each thread. • USR through LAT sum to 100% of the time spent for each thread during the prstat sample. • USR (user time) and SYS (system time) thread spent running on the CPU. • The LAT (latency) is the amount of time thread spent waiting for CPU. A non-zero number means there

was some queuing/saturation for CPU resources. • SLP inidicates the time thread spends blocked waiting for blocking events like Disk I/O etc. • TFL & DTL determine if and how much the thread is waiting for memory paging. • TRP indicates the time spent on software traps

Each Thread is waiting for CPU about 80% of the time. - CPU resources are Constrained

Each Thread is waiting for CPU about 0.2% of the time. - CPU resources are not constrained.

How are threads inside the process performing?

The example shows us that thread number two in the target process is using the most CPU, and

spending 83% of its time waiting for CPU. We can further look at information about thread

number two with the pstack <pid>/<LWPID> command. Just pstack <pid> to shows all threads

Take a java thread dump and identify the thread with native thread id = 2. This is the one. This

way con relate the code in Java that called the native system call or library method on the

system.

Process Stack on a Java Virtual Machine: pstack

• Use the “C++ stack unmangler” with Java virtual machine (JVM) targets to see the native java function calls c stack

Tracing Processes truss

truss traces system calls made on behalf of a process. It includes the user LWP

(thread) number, system call name, arguments and return codes for each system call.

truss –c option traces system call counts

Why Memory Saturation brings more rapid a degradation in performance

compared to CPU saturation.

• Memory saturation may cause rapid degradation in performance. To come over saturation OS resorts to page-in/out and swapping, which themselves are an heavy task and with processes competing for memory, a race condition may occur.

• The available memory on a server may be artificially constrained, either through pre-allocation of memory or through the use of a garbage collection mechanism that doesn’t free up memory until some threshold is reached.

Thread Dumps

• What exactly is "Thread dump“

– Thread dump" basically gives you information on what each of the thread in the VM is doing at any given point of time.

• If an application seems stuck, or is running out of resources, a thread dump will reveal

the state of the server. Java's thread dumps are a vital tool for server debugging. For scenarios like

– PERFORMANCE RELATED ISSUES – DEADLOCK (SYSTEM LOCKS UP) – TIMEOUT ISSUES – SYSTEM STOPS PROCESSING TRAFFIC

Thread dumps in Redknee Applications

• Java thread dumps are obtained by doing:

– Send (kill -3 <pid>) - On Unix See thread dump in ctl logs

– Press (Ctrl + Shift Break) – on Windows See thread dumps on xbuild console

– $JAVA_HOME/bin/jstack <pid> See thread dumps on Shell console

• Java thread dumps list all of the threads in an application

• Threads are outputted in the order that they are created, newest thread being at the top

• Threads should be named with a useful name of what they do or what they are responsible for (Open Tickets)

Common Threads in Redknee

• Idle” – CORBA Threads to handle incoming requests, however are currently not doing any work

• “RMI TCP Connection(<port>)-<IP>” – Outbound connection over RMI to a specific host and port

• "FileLogger“ – Framework thread for logging

• "JavaIDL Reader for <host>:<port>“ – CORBA Thread reading requests from a server

• "TP-Processor8“ – Tomcat Web Thread

• “Thread-<#>” – Thread that has not been named (BAD)

• "ChannelHome ForwardingThread“ – Thread used to cluster transactions over to peer

– One of these threads per Home that is clustered (DB table)

• "Worker#1“ – Worker threads doing work

Thread Dump May Give you Clues • C:\learn\classes>java Test • Full thread dump Java HotSpot(TM) Client VM (1.4.2_04-b05 mixed mode):

• "Signal Dispatcher" daemon prio=10 tid=0x0091db28 nid=0x744 waiting on condition [0..0]

• "Finalizer" daemon prio=9 tid=0x0091ab78 nid=0x73c in Object.wait() [1816f000..1816fd88] • at java.lang.Object.wait(Native Method) • - waiting on <0x10010498> (a java.lang.ref.ReferenceQueue$Lock) • at java.lang.ref.ReferenceQueue.remove(Unknown Source) • - locked <0x10010498> (a java.lang.ref.ReferenceQueue$Lock) • at java.lang.ref.ReferenceQueue.remove(Unknown Source) • at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)

• "Reference Handler" daemon prio=10 tid=0x009196f0 nid=0x738 in Object.wait() [1812f000..1812fd88] • at java.lang.Object.wait(Native Method) • - waiting on <0x10010388> (a java.lang.ref.Reference$Lock) • at java.lang.Object.wait(Unknown Source) • at java.lang.ref.Reference$ReferenceHandler.run(Unknown Source) • - locked <0x10010388> (a java.lang.ref.Reference$Lock)

• "main" prio=5 tid=0x00234998 nid=0x4c8 runnable [6f000..6fc3c] • at Test.findNewLine(Test.java:13) • at Test.<init>(Test.java:4) • at Test.main(Test.java:20)

• "VM Thread" prio=5 tid=0x00959370 nid=0x6e8 runnable

• "VM Periodic Task Thread" prio=10 tid=0x0023e718 nid=0x74c waiting on condition • "Suspend Checker Thread" prio=10 tid=0x0091cd58 nid=0x740 runnable

What is there in the Thread Dump?

• In this case we can see that, at the time we took the thread dump, there were seven threads: Show Thread Dump – Signal Dispatcher

– Finalizer

– Reference Handler

– main

– VM Thread

– VM Periodic Task Thread

– Suspend Checker Thread

• Each thread name is followed by whether the thread is a daemon thread or not.

• Then comes prio the priority of the thread [ex: prio=5].

• tid and nid are Java thread id and the native thread id.

• Then what follows the state of the thread. It is either: – Runnable [marked as R in some VMs]: This state indicates that the thread is either running currently or is ready to run the next time the OS

thread scheduler schedules it.

– Suspended [marked as S in some VMs]: I presume this indicates that the thread is not in a runnable state. Can some one please confirm?!

– Object.wait() [marked as CW in some VMs]: indicates that the thread is waiting on an object using Object.wait()

– Waiting for monitor entry [marked as MW in some VMs]: indicates that the thread is waiting to enter a synchronized block

• What follows the thread description line is a regular stack trace.

Threads in a Dead-Lock

• A set of threads are said to be in a dead lock when there is a cyclic wait condition, ie. each thread in the deadlock is waiting on a resource locked by some other thread in the set of deadlocked threads. In newer JDKs they are detected automatically – Found one Java-level deadlock:

– =============================

– "Thread-1":

– waiting to lock monitor 0x0091a27c (object 0x140fa790, a java.lang.Class),

– which is held by "Thread-0"

– "Thread-0":

– waiting to lock monitor 0x0091a25c (object 0x14026800, a java.lang.Class),

– which is held by "Thread-1"

– Java stack information for the threads listed above:

– ===================================================

– "Thread-1":

– at Deadlock$2.run(Deadlock.java:48)

– - waiting to lock <0x140fa790> (a java.lang.Class)

– - locked <0x14026800> (a java.lang.Class)

– "Thread-0":

– at Deadlock$1.run(Deadlock.java:33)

– - waiting to lock <0x14026800> (a java.lang.Class)

– - locked <0x140fa790> (a java.lang.Class)

– Found 1 deadlock.

Memory Performance & Troubleshooting

(Part 3)

Memory

• Memory includes

physical (RAM)

Swap space

• Swap space is a part storage acting as a memory.

• Memory is more complicated a subject than CPU.

• Memory saturation triggers CPU saturation (Page Faults / GC)

Memory Utilization and Saturation

• To sustain a higher throughput, application spawns more threads and holds the request data

• Each thread occupies memory for data it operates on and its own stack.

• A point where memory demanded by an process can no longer be met from available memory, saturation occurs.

• Sudden increases in utilization without accompanying increases in throughput can also be used to detect degraded performance modes caused by software ‘aging’ issues, such as memory leaks

VMSTAT – Glimpse of Memory Utilization

Counter Description

swap Available swap space in Kbytes.

free Combined size of the cache list and free list.

re Page reclaims—The number of pages reclaimed from the cache list.

mf Minor faults—The number of pages attached to an address space.

fr Page-frees—Kilobytes that have been freed

pi and po Kilobytes Paged in and Paged out respectively

de Anticipated short-term memory in kilobytes shortfall to free ahead.

sr The number of pages scanned by the page scanner per second.

If the scan rate (sr) is continuously over 200 pages per second then there

is a memory shortage on the system.

Memory Consumption Model

Relieving Memory Pressure

After the free memory exhausts, from cache list (FS,I/O etc cache).

Next the swapper swaps out entire threads, seriously degrading the

performance of swapped-out applications. The page scanner selects pages,

and is characterized by the scan rate (sr) from vmstat. Both use some form

of the Not Recently Used algorithm.

The swapper and the page scanner are only used when appropriate. Since

Solaris 8, the cyclic page cache, which maintains lists for a Least Recently

Used selection, is preferred.

Heap and Non-Heap Memory

• Heap Memory Storage for Java objects

-Xmx<size> & -Xms<size>

• Non Heap Memory Per-class structures such as runtime constant pool, field and method data,

Code for methods and constructors, as well as interned Strings.

Store loaded classes and other meta-data

JVM code itself, JVM internal structures, loaded profiler agent code and data, etc.

-XX:MaxPermSize=<size>

• Other Space system/OS takes for process

Stacks of a threads (-Xss & -Xoss)

System & Native space

What is Garbage Collection?

Reclaim memory from inaccessible object

Stack Overflow or Out of Memory

• If u See OutOfMemoryError: unable to create native thread

– This means your Application is falling short Native Memory space – C Space – Either, Insufficient memory to allocate thread stack or PC to the new Thread – Or application has crossed JVM’s memory limit (3.2 GB in 32 bit environment) – The JVM/application hangs with this error, we need to restart.

• See if you can reduce active threads which ate away system’s memory • Or if you can decrease stack size to decrease memory use per thread • If you Can’t bring memory consumption down, need more system memory

• If u See StackOverflowException

– It means the thread that threw this exception fell short of Stack Memory Space

– A thread stacks method states invoked by it on to the stack memory – For the number of nested invocations the thread makes, memory is

insufficient – Only the thread dies with this exception, the application doesn’t hang.

• See if you can bring down number of nested invocations by the thread • Or else, increase the stack size with VM option –Xss, by default it is 1m

Pros and Cons of Garbage Collection?

Disadvantages

Unpredictable application

pauses

Increased CPU/memory

utilization

Brutally complex

Advantages

Increased reliability

Easier to write complex

apps

No memory leaks or

invalid pointers

GC Logging

• Java Garbage Collection activity may be recorded in a log file. VM options – -verbosegc (Enable GC Logging, outputs to std-err

– -xloggc:<file> (GC logging to file)

– –XX:+PrintGCDetails (Detailed GC records)

– -XX:+PrintGCDateStamps (absolute instead of relative timestamps) – Note: From relative timestamps in a GC log we can find absolute times by either by tracing forward from

application/GC start or backwards from application/GC stop

• Asynchronous garbage collection occurs whenever memory available memory is low.

• System.gc() does not force a synchronous garbage collection but just gives a hint to VM. VM options – +XXDisableExplicitGC - Disable explicit GC

What to look for in GC Logs?

• Important information from GC logs – The size of the heap after garbage collection

– The time taken to run the garbage collection

– The number of bytes reclaimed by garbage collection

• Heap Size after GC may give us a good idea of memory requirement. – 36690K->35325K(458752K), 4.3713348 secs – (1365K reclaimed)

• The other two help us assess the cost of GC to your application.

• All of them together help us tune GC.

How to Calculate Impact of GC on your Application?

• Run test (60sec, Collect GC logs) – 36690K->35325K(458752K), 4.3713348 secs – (1365K reclaimed)

– 42406K->41504K(458752K), 4.4044878 secs – (902K reclaimed)

– 48617K->47874K(458752K), 4.5652409 secs – (770K reclaimed)

• Measure – Out of 60 sec, GC ran for 17.2 sec, ie 29% of the time.

– Considering relative CPU utilization, GC cost may be even higher.

– 3037K of memory was recycled in 60 secs, ie 51831 bytes/second

• Analyze – 29% time being consumed by GC is too high (should be between 5-15%)

– Is 51831 bytes/sec of memory recycled justifiable against operation?

– For an average 50 byte objects, it churned around 1036 objects/ sec

Heap Ranges – Xms to Xmx

• Heap Range can be defined – VM Args –Xmx & -Xms define Upper & Lower Bounds of Heap Size

• What causes VM to expand heap? – Expansion of heap is a CPU Intensive and causes defragmented Heap

– VM Tries GC, Defragmentation, Compaction, etc to free up memory.

– If still unable to free up required memory, VM decides to expand heap

– VM may not wait till brink, it keeps some free space for temp objects

– By default, Sun tries to keep the proportion of free space to living objects at each garbage collection within 40%-70% range.

• If less than 40% heap is free after GC, expand the heap

• If more than 70% heap is free after GC, contract the heap

– VM Args that can customize the default ratio

• -XX:MinFreeHeapRatio

• -XX:MaxFreeHeapRatio

Gross Heap Tuning

• Consequences of large heap sizes – GC Cycles occur less frequently, but each sweep takes longer – Long GC cycles may induce perceptible pauses in the system. – If heap grows to a size more than available RAM, paging/swapping may occur.

• Consequences of low heap sizes – GC runs too frequently with less recovery in each cycle – Cost of GC becomes more – Since, GC has to sweep less space each time, pauses are imperceptible.

• Max verses Min Heap sizes. – Contraction & Expansion of heap is costly, should be worth the cause. – Frequent contraction expansion also leads to segmented heap. – Keep Xmx=Xms, for transaction oriented system which frequently peaks. – Keep Xms<Xmx if the application infrequently operates at upper capacity.

We Just Learnt Gross Heap Tuning

There might just be need for Fine Tuning

• We can fine tune the GC considering the intricacies of GC Algorithm & Heap Structure. We will learn shortly.

• Goss Heap tuning is quite simple yet effective & empirically established.

• Gross techniques are fairly effective irrespective of the variables and most important we can always afford apply them.

What is the advanced heap made of? The one that works with Generational Garbage Collector in JVM

• HEAP is made up of – Old Space or Tenure Space

• Objects, when get old in the young space, are transferred here.

– Young Space or Eden Space

• Young objects are held here.

– Scratch Space

• Working Space for algorithms

– New Space

• <Young Space> + <Scratch Space>

jmap -heap

Generational Garbage Collector Modern Heap

Fine Tuning the Heap

Are there better GC implementations to chose? JDK 1.4.x Options

Low Pause Collectors Throughput Collectors Heap Sizes Generation

Young

Old

Permanent

1 CPU 2+ CPUs

Serial Copying Collector (default)

Parallel Copying Collector

-XX:+UseParNewGC

1 CPU 2+ CPUs

Copying Collector (default)

Parallel Scavenge Collector

-XX:+UseParallelGC

-XX:+UseAdaptiveSizePolicy

-XX:+AggressiveHeap

-XX:NewSize

-XX:MaxNewSize

-XX:SurvivorRatio

Mark-Compact Collector (default)

Concurrent Collector

-XX:+UseConcMarkSweepGC



-Xms

-Xmx

Can be turned off with –Xnoclassgc (use with care) -XX:PermSize

-XX:MaxPermSize

jstat

Reference http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jstat.html

http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jstat.html

http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jstat.html

Heap Dump (Java) Snapshot of the memory of a time

VMs usually invokes a GC before dumping heap

It contains

• Objects (Class, fields, primitive values and references)

• Classes (Classloader, name, super class, static fields)

• GC Roots (Objects defined to be reachable by the JVM)

• Thread Stacks (in time with per-frame about local objects)

Does not Contain

• Allocation information

Who created the objects and where they have been created?

• Live & Stale

Used memory consists of both live and dead objects.

JVM usually does a GC before generating a heap dump

Tools may attempt to remove these when loading the dump unreachable from the GC roots

Heap Dump (Java) How to take it?

• On Demand

VM-arg > JDK1.4.2_12 # -XX:+HeapDumpOnCtrlBreak

Tools # JDK6 Jconsole, VisualVM, MAT

jmap -d64 -dump:file=<file-ascii-hdump> <pid>

jmap -d64 -dump:format=b, file=<file-bin-hdump> <pid>

• Automatic on Crash

VM-arg # -XX:+HeapDumpOnOutOfMemoryError

• Postmorterm after crash; from Core-Dump

jmap -d64 -dump:format=b,file=<file> <java-bin> <core-file>

Heap Dump (Java) Shallow vs Retained Heap

Shallow heap • Held by object’s primitive fields and reference variables

• Excludes referenced objects but just references (32/64 bits)

Retained heap • Object’s shallow size plus the shallow sizes of the objects that are

accessible, directly or indirectly, only from this object.

• Memory that’s freed by the GC when this object is collected.

Garbage Collection Roots • A garbage collection root is an object accessible from outside the heap.

• GC root objects, which will not be collected by Garbage Collector at the time of measuring Locals (Java/Native), Threads, System Class, JNI,, Monitor, Finalizer)

Shallow vs. Retained Heap

http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fconcepts%2Fshallowretainedheap.html

In general, retained size is a GC root is an integral measure, which helps to understand consumption memory by objects graphs

http://help.eclipse.org/indigo/index.jsp?topic=/org.eclipse.mat.ui.help/concepts/shallowretainedheap.html


Dominator Tree (Object Dependencies)

• Identifies chunks of retained memory & the keep-alive

• In the dominator tree each object is the immediate dominator of its children, so dependencies between the objects are easily identified.

• The edges in the dominator tree do not directly correspond to object references from the object graph. Same object may actually be under retained set of multiple roots.

• http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fconcepts%2Fshallowretainedheap.html


OQL (Object Query Language) Heap Dump not just for

Troubleshooting • OQL is an Object Query Language that let’s us query the heap dump in SQL

fashion.

• This enables us to analyze heap not only after problems but proactively search for patterns. Ex select to see if there are more than two objects for Boolean, ideally two .TRUE and .FALSE (singleton like Enums) are sufficient –

select toHtml(a) + " = " + a.value from java.lang.Boolean a

where objectid(a.clazz.statics.TRUE) != objectid(a)

&& objectid(a.clazz.statics.FALSE) != objectid(a)

(Runs on Visual VM

• Visual VM and MAT, both support nice interfaces for OQL http://docs.oracle.com/javase/6/docs/technotes/tools/share/jhat.html

http://help.eclipse.org/indigo/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fwelcome.html

http://docs.oracle.com/javase/6/docs/technotes/tools/share/jhat.html

http://docs.oracle.com/javase/6/docs/technotes/tools/share/jhat.html

http://help.eclipse.org/indigo/index.jsp?topic=/org.eclipse.mat.ui.help/welcome.html

http://help.eclipse.org/indigo/index.jsp?topic=/org.eclipse.mat.ui.help/welcome.html

References

• Thread Dump Analyzer (Thread Dumps)

• (http://java.net/projects/tda/)

• GC Viewer (GC logs)

• (http://www.tagtraum.com/gcviewer.html)

• Eclipse Memory Analyzer tool (Heap Dump, OQL) (http://help.eclipse.org/indigo/topic/org.eclipse.mat.ui.help/welcome.html )

• Visual VM / J-Console /JMX – (Inspect Live Application, Snapshots, Dumps, OQL)

Bundled with Java SDK

http://java.net/projects/tda/

http://www.tagtraum.com/gcviewer.html

http://help.eclipse.org/indigo/topic/org.eclipse.mat.ui.help/welcome.html

http://help.eclipse.org/indigo/topic/org.eclipse.mat.ui.help/welcome.html

Feedback – Q&A

[email protected]

[email protected]



performance concurrency troubleshooting final

Documents

system capacity

load capacity load

load load

performance capacity

capacity throughput

system sizing

capacity max throughput

maximum capacity