analyzing server crashes hangs. crashes versus hang. all about server crash. all about server...

46
Analyzing Server Crashes Hangs

Upload: victoria-wilson

Post on 12-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Analyzing Server Crashes Hangs

Analyzing Server Crashes HangsCrashes Versus Hang.All about Server crash.All about Server hangs-Analyzing Thread dumpsAnalysis of thread dump samplesResources.AgendaCrash Versus HangDistinction between crash and hangs.Crash implies weblogic server java process no longer exists.Hang implies that weblogic server java process still exists but is not responding.Customers tend to use these terms interchangably.All about crashesDetermine all potential sources of native code used by the WebLogic Server.nativeIO.Type2 jdbc driver.Native libraries accessed with JNI calls.SSL native libraries.JVM itself. Most of the times its from JVM.Sometimes the JVM will produce a small log file that may contain useful information as to which library the crash has originated from. (hs_err_pid*.log)

Debugging with hs_err_pid.log We get current thread stack trace from hs_err_pid.log and depending on current thread information the issue can be debugged further:If current thread shows stack from nativeIO (performance pack): Workaround: Disable nativeIO. Fix: File a bug with CCE.If current thread shows stack from native call from type 2 driver: Workaround: Switch to pure JAVA type 4 driver instead of type 2 driver. Fix: Work with vendor of the database driver.

Debugging with hs_err_pid.logIf current thread shows stack from JNI call from application code: Fix: Instruct customer that its application bug and needs to be fixed in their code.If the current shows stack from native code from weblogic SSL WorkAround:Use pure java version of SSL instead of native version If the current thread indicates crash from compiled/optimized code: WorkAround: Turn off the compilation and hence optimization (-Xint) Javacode->bytecode->compilation->optimization(hotspots) Fix: Work with JVM vendor support.

Debugging with hs_err_pid.logIf the current thread indicates crash from threading library (applicable for solaris): Workaround: Switch to alternate thread library.The default thread library on solaris8 and below is:/usr/lib/libthread.so.1 This can be switched to: (Default from Solaris9)/usr/lib/lwp/libthread.so.1Add /usr/lib/lwp to your LD_LIBRARY_PATH and-XX:+OverrideDefaultLibthreadCrashes without coreMost crashes will cause a core dump. However sometimes the core file may not be available.Running out of disk space or quota to write the file. Not having the correct access permissions to create or write a file in the directory.The prior presence of a core dump of the same name that is read-only or write-protected.Crashes without coreCheck the "ulimit -c" (Have it set to unlimited). Use coreadm on solaris.($ coreadmn)Also check the following parameter which on Solaris is in /etc/system file and can be used to disable core files: set sys:coredumpsize=0 On linux, the coredump is turned off by default on all systems. In RedHat Advanced Server 2.1 it should be under /etc/security. There should be a self-explanatory file called limits.conf and look for the word core. If set to "0" then coredump is disabled.

Crashes with coreCore file is available.A core file is a memory map of the running process, and it saves the state of the application at the time of its termination.Core file is dependent on the exact shared libraries and OS. Core file *must* be analyzed on the customers machine.

Crashes with coreIf debugger is not available:Solaris 8,9Use pstack and pmap-/usr/proc/bin/pstack core >pstack.txt-/usr/proc/bin/pmap core >pmap.txtAnalyze pstack.txt and pmap.txt to understand which library caused the crash.

Crashes with coreGather information from the core.Use a debugger Different for different operating systemsMethodology is the same. Check to see what the current thread is.More info is available athttp://supportlab.bea.com:8000/spWiki/attach?page=SystemCorePattern%2FCorePattern.html

Crash on Windows

Get the windows debugging tools from http://www.microsoft.com/whdc/devtools/debugging/installx86.mspxStart up weblogic cd into and run (ignore messages saying that NT_SYMBOL_PATH is not set).Wait till process dies. Upon this event, directory will be created with dump and log files. Open a case with Sun Support and send the dmp file.( If we have the symbols, we can run the debugger against the dmp file by opening the dmp file in windows debugger GUI)

All about hangsProcess still exists.Process not responding.No response sent to clients.java weblogic.Admin PING command doesnt return a normal reponse.Take multiple thread dumps (Kill -3 pid on unix platforms. Ctrl break on windows) For linux use ps -efHl | grep 'java' **. ** to identify root pid.

All about hangsThread dumps for SUN JVM are sent to stdout.If you are using nohup, thread dumps are directed to nohup.out.For beasvc use -log:"d:\bea\user_projects\domains\myWLSdomain\myWLSserver-stdout.txt" Use beasvc -dump -svcname:service-name

You can also use java weblogic.Admin THREAD_DUMP command.

Not able to take thread dumps-Xrs option (JVM option) would make the OS immune to any signals including SIGQUIT(Sun JVM uses SIGQUIT to perform thread dumps)If a process is not responding to kill -3 then its a JVM bug.

All about hangs There are scenarios where the process appears to be hung (non responsive) and there are free threads availableProcess runs OutOfMemory. If java heap is full, server process appears to be hung and not accepting any requests because each request needs memory from heap for allocating objects.Process running out of File descriptors. Server cannot accept further requests because sockets cannot be created.GC taking long times (more than 20secs). This appears like a hang for end users.

Thread queues and Threadsweblogic.kernel.Default Worker threads that serve the external client requests.weblogic.kernel.system Internal system work likeRJVM heartbeats,Http state Dumps for JNDI updates in a cluster etcWeblogic.socket.Muxer- Defaults to 3 on Unix systems and 2 on Windows.Used for socket readsWeblogic.admin.rmi- Handle OA& M requests like deployment of application,Application poller etcWeblogic.admin.html- only on admin server to handle console requests.Core health monitor runtime health of the serverJmsDispatcher, JMS.TimerTreePool, JMS.TimerClientPool -for jmsAnalyzing Thread DumpsCommon Thread states in thread dump:Runnable [marked as R in some VMs]: This state indicates that the thread is either running currently or is ready to run the next time the OS thread scheduler schedules it.Object.wait() [marked as CW in some VMs]: Indicates that the thread waiting on an object using Object.wait() .This thread would progress further either upon notify() by another thread or if the condition for its wait() is fulfilled. For eg: wait(longtimeout) Waiting for monitor entry [marked as MW in some VMs]: Indicates that the thread is waiting to enter a synchronized block.

Analyzing Thread DumpsAnalyze thread dump for following scenarios.Java Deadlock: More than one thread waiting to release the lock.Threads blocked during n/w IO: Database or remote process nor responding.Infinite Looping in the code.Multiple thread dump for with few seconds interval helps to debug slow response timeAnalyzing thread dumpsClassic deadlock Look for the threads waiting for monitor entry: For eg:"ExecuteThread: '95' for queue: 'default'" daemon prio=5 tid=0x411cf8 nid=0x6c waiting for monitor entry [0xd0f80000..0xd0f819d8]at weblogic.common.internal.ResourceAllocator.release(ResourceAllocator.java:766)at weblogic.jdbc.common.internal.ConnectionEnv.destroy(ConnectionEnv.java:590)The above thread is waiting to acquire lock on ResourceAllocator object.The next step is to identify the thread that is holding the ResourceAllocator object"ExecuteThread: '0' for queue: '__weblogic_admin_rmi_queue'" daemon prio=5 tid=0x41b978 nid=0x77 waiting for monitor entry [0xd0480000..0xd04819d8]at weblogic.jdbc.common.internal.ConnectionEnv.getPrepStmtCacheHits(ConnectionEnv.java:174)at weblogic.common.internal.ResourceAllocator.getPrepStmtCacheHitCount (ResourceAllocator.java:1525)This thread is holding lock on ResourceAllocator object, but is waiting for ConnectionEnv object. This is a classic deadlock.

Analyzing Thread dumpsThreads in wait()A sample dump:"ExecuteThread: '10' for queue: 'SERV_EJB_QUEUE'" daemon prio=5 tid=0x005607f0 nid=0x30 in Object.wait() [83300000..83301998]at java.lang.Object.wait(Native Method)- waiting on (a weblogic.ejb20.pool.StatelessSessionPool)at weblogic.ejb20.pool.StatelessSessionPool.waitForBean(StatelessSessionPool.java:222)The above thread would come out of wait() under two conditions (depending on application logic)One of the thread available in the execute queue pool would call notify() on this object when an instance is available. (If the wait() is indefinite). This can cause the thread to hang for ever if server never does a notify() to this object.2) If the timeout exceeds, the thread would throw an exception and back to execute queue thread pool.Analyzing Thread dumps Threads waiting for monitor entry and culprit thread stuck on remote call. This issue is more observed with a thread acquiring lock on a synchronized object and hung up with database (something wrong on database like database not responding) and rest of the threads that need the synchronized object are waiting for monitor entry.There are scenarios where thread holding the lock is not apparent. In these cases most of the times it would be locked at native layer which is a JVM bug. In those cases, taking pstack is the first step.

Tool for analyzing thread dumpSamuraihttp://yusuke.homeip.net/samurai/?english#content_1_0

Performance Tuning Overview

J2EE Tuning Zones

Platform (OS) TuningKey Tuning ParametersTCP Parameterstcp_time_wait_interval tcp_keepalive_intervalndd -set /dev/tcp parameter value

File Descriptors/etc/system set rlim_fd_cur 8192 (Soft Limit)set rlim_fd_max 8192 (Hard Limit)

Platform (OS) TuningKey Tuning ParametersPrior to Solaris 2.7, the tcp_time_wait_interval parameter was called tcp_close_wait_interval. This parameter determines the time interval that a TCP socket is kept alive after issuing a close call. The default value of this parameter on Solaris is four minutes. When many clients connect for a short period of time, holding these socket resources can have a significant negative impact on performance. Setting this parameter to a value of 60000 (60 seconds) has shown a significant throughput enhancement when running benchmark JSP tests on Solaris. You might want to reduce this setting further if the server gets backed up with a queue of half-opened connections.

Tip: Use the netstat -s -P tcp command to view all available TCP parameters

Platform (OS) TuningKey Tuning ParametersHard limits are a kernel-configurable item, and users can't exceed them. Soft limits are the user defaults, and users can change that using the ulimit program or the limit/unlimit builtins. man setrlimit(2) Basically, soft limits can be changed to anything up to the hard limit. Think of soft limits as the warning barrier. When a user reaches the soft limit they will get an warning message but are still allowed to use more space up to the hard limit. Also, you can configure the system to set expiration times for users who have exceeded thier soft limit. Just remember that the max file descriptors is 1024.

JVM Tuning OptionsJVM vendor and version.User Certified Versions.JVM Heap Size Parameters.Garbage Collection Schemes (Sun 1.4.2 JVM)Generational Collector (Default, Stop the world)Throughput Collector Concurrent Low Pause Collector Incremental Low Pause Collector Unix Threading Model export LD_LIBRARY_PATH=/usr/lib/lwp One to One mapping between Java and O/S threadJVM TuningHeap Sizing ParametersHeap Size-Xms, -XmxYoung Generation Space- XX:NewRatio, -XX:NewSize, -XX:MaxNewSize, Survivor Space-XX:SurvivorRatioPermanent Generation -XX:PermSize & -XX:MaxPermSizeAggressive Heap-XX:+AggressiveHeapFor more information and self learning look at http://www.petefreitag.com/articles/gctuning/

WebLogic Core TuningOptionsNativeIO Performance Packs. Tuning Default ExecuteQueue.Thread usage control.StuckThreadDetection. Connection Backlog Buffering.

WebLogic Core Tuning Performance PacksUses a platform-optimized, native socket multiplexor.Uses own socket reader threads and frees up weblogic threads.Available for most of the Platform Solaris, Linux, HP-UX, AIX, Win Can be configured using WebLogic Admin Console.

WebLogic Core Tuning Performance PacksBenchmarks show major performance improvements when you use native performance packs on machines that host WebLogic Server instances. Performance packs use a platform-optimized, native socket multiplexor to improve server performance. For example, the native socket reader multiplexor threads have their own execute queue and do not borrow threads from the default execute queue, which frees up default execute threads to do application work.However, if you must use the pure-Java socket reader implementation for host machines, you can still improve the performance of socket communication by configuring the proper number of socket reader threads for each server instance and client machine. WebLogic Core TuningDefault Execute Thread TuningNumber of simultaneous operations that can be performed by applications.Production Mode Default 25Tuning criteria. Request turn around time.Number of CPUs% Socket Reader Threads (Default 33%).In 8.1 Execute Queue can be tuned for OverFlow conditionIncreases thread count dynamically. WebLogic Core TuningThread usage ControlThread usage can be controlled by creating additional Execute QueuesPerformance Optimization for critical application.Throttle the performance To protect application from DeadlockIt can have Negative impact on overall performance

WebLogic Core TuningStuckThreadDetection & Connection Backlog Buffering.

StuckThread DetectionDetects when execute thread can not complete work or accept new work.Warning purpose only, doesnt change behaviour/state of the thread.Stuck Thread Max Time , Stuck Thread Timer Interval Connection Backlog BufferingThe number of backlogged TCP connection requests.

WebLogic Core TuningGuidelinesNativeIO gives better perfromance, consider Java IO if NativeIO is not stable.High number of thread can have negative impact on performance.More threads does not imply that you can process more work. Avoid application designs that require creating new threads.JDBC Connection Pool TuningOptionsConnection Pool Sizing and Testing. Caching Statements.Connection Pool Request Timeouts.Recovering Leaked Connection.PinnedToThread.

JDBC Connection Pool Tuning Connection Pool Sizing and TestingSizingInitial capacity and Maximum capacity.Shrink Frequency.TestingTest Frequency.Test Reserved/ Released ConnectionsMaximum Connections Made UnavailableTest Table NameJDBC Connection Pool Tuning Caching Statements.Reuses Callable and Prepared Statements in Cache.Reduces CPU usage at Database side and Improve performance.Cache AlgorithmsLRU (Least Recently Used) Fixed Statement CacheSizeConfigured per connection pool.It cache size for each connection in pool.JDBC Connection Pool Tuning Recovering Leaked Connection. Connection Request TimeoutLeaked Connection Forcibly reclaims unused connection.Inactive Connection Timeout.Connection Request Timeout.Connection Reserve Timeout.Maximum number of request that can wait for connection. PinnedToThreadPins Connection to ExecuteThreadConnection.close() doesnt return connection to pool.JDBC Connection Pool TuningGuidelinesConfigure initial capacity = maximum capacity.In most cases, maximum number of connection used does not exceed number of execute threads.Configure connection refreshing, if database calls fails because of stale connections.Try to avoid PinnedToThread if database resource is limited. Common Performance ProblemsMemory Leakjava.lang.OutOfMemoryError , is a symptom , however it is not a proof.Turn on verbose:gc for GC logs, i.e.[Full GC 154K->99K(32576K), 0.0085354 secs]Analyze GC for following scenarios, Full Garbage collection does not get chance to run before OutOfMemory is thrown. OutOfMemory is thrown eventhough memory usage is not reached to upper limit of the heap OutOfMemory is thrown during the load test ramp up.Tune -XX:MaxPermSize, -Xms, -Xmx, -XX:NewSize, -XX:MaxNewSize XX:SurvivorRatio to resolve OOM.Common Performance ProblemsMemory LeakHeap memory usage grows after each FULL GC at steady state condition of the load test Potential memory leakCheck for more common leaking objects.Caching in the application , i.e EJB pool objects, HTTP Session objects , JMS MessagesUse Memory Profiler to pinpoint memory leaking code, i.e JProbe and OptimizeITPerformance Standards and ToolsStandardsECPerf J2ee Benchmark for Application ServersSPECjAppServer2001 Benchmark to measure Application Server performanceSPEC JBB2000 Server side JVM performance benchmark.http://www.spec.org/jbb2000/ToolsOptimizeIt, JProbe, PerformaSure. Mercury LoadRunner, WebLoad, Grinder(OpenSource)

Application with Baseline ParemetersPerformance TestMonitor Test &Collect Data Identify Bottlenecks Determine SolutionsApply Solutions

Complex & Iterative.

No Hard Rules.

Driven by performance requirements.Peak and Sustained Load.Response Time Per request. Scalability

Tuned Application Operating Systems( Solaris, Linux, AIX and more)Application Server Core JDBC Connection Pool and DriversJava Virtual Machine( Sun, JRockit etc.)EJB ContainerServlet EngineJTAJMSJ2EE Application