oracle press case study - using real-time diagnostic tools to diagnose intermittent database hangs

Case Study: Using Real-Time Diagnostic Tools to Diagnose Intermittent Database Hangs

Authors: Carl Davis, Consulting Technical Advisor – Center of Expertise (COE), Oracle USA Skill Level Rating for this Case Study: Intermediate About Oracle Case Studies

Oracle Case Studies are intended as learning tools and for sharing information or knowledge related to a complex event, process, procedure, or to a series of related events. Each case study is written based upon the experience that the writer/s encountered. Each Case Study contains a skill level rating. The rating provides an indication of what skill level the reader should have as it relates to the information in the case study. Ratings are:

• Expert: significant experience with the subject matter • Intermediate: some experience with the subject matter • Beginner: little experience with the subject matter

Case Study Abstract

The purpose of this case study is to show how to deploy diagnostic tools from the Center of Expertise (COE) such as LTOM and show_sessions to diagnose complex performance problems in real-time. This case study will primarily focus on the Pre-Analysis phase. This is where the majority of the work was done. Using real-time diagnostic tools to extract the necessary trace information is the most difficult part of this case study. Once we collected the diagnostic trace files, analyzing them was quite simple. Performance problems that are intermittent happen without warning, last for a short duration, and are extremely difficult to diagnose. Traditional means of diagnosing these kinds of problems usually result in iterative attempts to capture the necessary data, resulting in very long engagements between the customer and support. Frequently these problems never truly get resolved as customers choose to upgrade in the hope that the problem goes away. This case study deals with one of the most difficult performance problems to diagnose, an intermittent database hang.

Performance problems can be differentiated into two categories, hangs and slowdowns. A true database hang, sometimes called a database freeze, is the most severe type of performance problem. In this case, existing database connections become non-responsive and any new connections to the database are impossible. Execution of code has either been halted, stuck in a tight loop, or is proceeding at an extremely slow rate causing the user to perceive the hang as indefinite. The true database hang also prevents customers or support analysts from obtaining diagnostic data as database connectivity is not possible. Fortunately, these types of hangs are rare. Far more common is the database slowdown. The database slowdown differs from a true database hang in that database connections are still possible especially when connecting as the sys user. Database activity proceeds slowly even to the point where the user may consider the database completely hung but the execution of code is still proceeding. Slowdowns, by definition, do not severely limit the ability for the customer or support analyst to obtain at least some diagnostic data, as database connectivity is still possible. A further differentiation can be made between a true hang and an intermittent hang. A true hang will remain in the frozen state indefinitely. An intermittent hang will eventually free itself. Diagnosing either hang requires using tools outside of Oracle in order to collect diagnostic traces. Operating system debuggers like GDB can be used to obtain systemstate information and in some cases hanganalyze trace files. If the database is experiencing a true hang then the user can take the time to use GDB or similar debuggers as the database will remain in the frozen state and diagnostics can be collected. The intermittent database hang, however, may not last long enough for the time consuming approach of using an Operating system debugger. By using LTOM’s manual data recorder and automatic hang detector it is possible to detect the hang and issue external commands to collect diagnostic traces using operating system utilities like GDB or other custom utilities like COE’s show_sessions program. Our task here is to show how to deploy real-time tools to diagnose complex performance problems. Utilizing Oracle Diagnostic Methodology (ODM) we will step through the data collection, analysis, and resolution phases.

Case History The customer had been suffering from intermittent database hangs for over 6 months. Oracle support had been involved but collecting the necessary diagnostic trace files had proved impossible due to the short duration of the hang (3-5 minutes). Collecting diagnostic traces was made even more difficult as the problem would occur without warning. The customer was able to collect statspack snapshots. Again these proved problematic, as 30-minute snapshots encapsulating the hang did not provide enough detail to determine what was causing the hang. Using statspack snapshots to diagnose intermittent performance problems presents it’s own challenges. The problem with static data

captures to solve intermittent problems is that any performance spike that occurs during the static statspack snapshots will be averaged out over the entire snapshot interval. In our case we had a 3-5 minute hang averaged out over 30 minutes of snapshot interval.

Pre-Analysis Work

Detail

Step 1) Problem Verification: As with any problem the first step is to identify and clarify what problem needs to be solved and to verify its existence. To accomplish this we used LTOM's manual data recorder to collect information from the Oracle database together with information on operating system metrics. This presented us with an integrated picture of what was happening on both the database and the operating system before, during, and after the hang. We deployed LTOM and first set up the manual recorder to collect data at 3 second snapshots. When the database hung we were able to clearly determine the nature of the problem and confirm that the hang was a result of an oracle resource issue and not something external to oracle. The following LTOM snapshot taken at 11:14:27 showed normal activity with no Oracle sessions waiting and adequate operating system resources available prior to the database hang:

---------------SNAPSHOT# 4751 Mon Feb 14 11:14:27 PST 2005 r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id 0 0 28 94723088 26511624 1026 5418 412 4 4 0 0 20 0 0 20 7242 49559 9029 16 9 75 SID PID SPID %CPU TCPU PROGRAM USERNAME EVENT SEQ SECS P1 P2 P3

The next LTOM snapshot taken 3 seconds later at 11:14:30 captured the system just prior to the total database hang. Virtually all database sessions were waiting on the same library cache latch (latch #106). ---------------SNAPSHOT# 4752 Mon Feb 14 11:14:30 PST 2005 r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id 0 0 28 94723088 26511624 1026 5418 412 4 4 0 0 20 0 0 20 7242 49559 9029 16 9 75 SID PID SPID %CPU TCPU PROGRAM USERNAME EVENT SEQ SECS P1 P2 P3 8 9 21471 * * QMN0 null latch free 2952 0 43487260472 106 5 19 20 21804 * * TNS DEV latch free 62783 0 43487260472 106 2 63 601 19039 * * TNS KS4029 latch free 6313 0 43487260472 106 3 74 447 20696 * * TNS KG5770 latch free 1201 0 43487260472 106 3 79 660 21950 * * TNS AB0320 latch free 5442 0 43487260472 106 4

81 443 3430 * * TNS null latch free 54 2 43487260472 106 53 82 717 16672 * * TNS JL3855 latch free 6661 0 43487260472 106 5 95 664 6718 * * TNS RP5398 latch free 32354 0 43487260472 106 5 98 749 20654 * * TNS PF5036 latch free 609 0 43487260472 106 1104 740 23674 * * TNS RF4846 latch free 1483 0 43487260472 106 3108 166 24606 * * TNS NM227 latch free 11903 0 43487260472 106 5109 518 12420 * * TNS AS5032 latch free 1839 0 43487260472 106 2116 220 8864 * * TNS AL4171 latch free 34631 0 43487260472 106 3120 547 24207 * * TNS FJ356 latch free 7192 0 43487260472 106 1126 382 13454 * * TNS BM249 latch free 5902 0 43487260472 106 4129 306 21359 * * TNS DI3891 latch free 2787 0 43487260472 106 3130 665 24166 * * TNS EW3761 latch free 2078 0 43487260472 106 4131 530 3773 * * TNS null latch free 51 2 43487260472 106 50… … … 817 813 5656 * * TNS null latch free 28 2 43487260472 106 27818 814 5774 * * TNS null latch free 27 0 43487260472 106 26819 815 5775 * * TNS null latch free 27 2 43487260472 106 26820 816 5971 * * TNS null latch free 25 0 43487260472 106 24821 817 6007 * * TNS null latch free 24 0 43487260472 106 23822 818 6127 * * TNS null latch free 23 2 43487260472 106 22823 819 6146 * * TNS null latch free 23 0 43487260472 106 22

The latch name can be retrieved from v$latch using p2 from wait event data when the event is latch free.

The query to retrieve the latch name would be as follows:

Select name from v$latch where latch# = 106;

The next LTOM snapshot occurred 2 minutes and 25 seconds later at 11:16:52. Here we see the system had returned to a normal state as the database hang had completed. Our database connection during snapshot 4752 and 4753 was also hung resulting in our next snapshot occurring 2 minutes and 25 seconds later instead of 3 seconds later as expected. ---------------SNAPSHOT# 4753 Mon Feb 14 11:16:52 PST 2005 r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id 0 0 25 94723012 26513624 1026 5418 412 4 4 0 0 20 0 0 20 7242 49559 9029 16 10 76 SID PID SPID %CPU TCPU PROGRAM USERNAME EVENT SEQ SECS P1 P2 P3

We had, at this point, successfully verified that the customer problem was a database hang. We also had some indication to what could be causing it at a high level. It is apparent we had severe library cache latch contention. In particular some process was holding the library cache parent latch for an extremely long time (in excess of 2 minutes). We also know from reviewing the LTOM vmstat data, the hang had nothing to do with an operating system resource such as CPU or memory.

Step 2) Dig deeper to extract additional diagnostic information from the hung database. Once the problem had been verified we needed to continue collecting further diagnostic traces to determine the cause of the problem, i.e. what is causing the library latch contention. The data collection from LTOM proved that the hang was due to processes waiting for the parent library cache latch. The next challenge was to determine which process was holding the latch and why it was being held for such a long time causing all other processes to wait. Efforts to use GDB, the operating system debugger, to attach to a process to take a systemstate proved unsuccessful. The customer could not get the information because GDB would also hang and even though it appeared to attach to the process and the command to generate the systemstate appeared to work, no trace file could ever be produced. This effort was problematic, in that even if successful the systemstate could not complete before the hang would end. The hang would last 3-5 minutes and it normally took over 5 minutes to generate a systemstate on their production database when the database had little activity. Hang Analyze was not a possibility because there is no way to call Hang Analyze thru GDB in oracle version 8. Because we couldn’t gather further diagnostic data during the hang COE created a program called show_sessions which was used instead of GDB to gather data directly from the SGA. Show_sessions attaches to the SGA and reads information contained in our Oracle data structures similar to the way a systemstate dump works. Show_sessions was able to gather comprehensive process and session data that would normally be gathered from systemstate, hanganalyze, or querying v$session, v$process, v$session_wait, v$sql etc. but were not available due to the hang preventing database access. We again deployed LTOM and used LTOM's automatic hang detector to detect the hang and make a call to the show_sessions program. LTOM's automatic hang detector not only detects a hang but allows the user to specify an optional file to run when the hang is detected. We configured LTOM to call the show_sessions program when the next hang occurred. From reviewing the output of show_sessions we could determine that the process holding the latch was the SMON process. We then reconfigured LTOM to call both show_sessions along with the unix utility pstack. We created a shell file to call both programs 3 times with a 30 second delay between calls. This would give us multiple samples to compare during the time the hang was occurring. The pstack utility produces a hexadecimal stack trace with a list of function calls that the process was executing during the time the pstack command was issued. We waited for the next hang and collected the information from LTOM, show_sessions and the pstack command on the blocking process (SMON).

To call pstack issue the following command:

$ pstack ospid

Finally we had all the diagnostic trace information necessary to analyze and solve the problem.

Analysis

Summary

Now that all the required diagnostic traces were obtained we could determine the cause of the database hang. We took the following actions: 1. Reviewed the output from show_sessions. Found the process holding the parent

library cache latch and the sql that this process was currently executing. 2. Reviewed the pstack of the blocking process. This showed the underlying functions

of Oracle code that was executing during the time of the hang of the process causing the hang.

3. Reviewed the bug database to see if this was a known bug. 4. Identified effective solutions. 5. Delivered the best solution.

Detailed Analysis

1. Review the output from show_sessions to find which process was holding the library cache latch.

*** Process (0xa01e272e0) Serial: 1 OSPid: 28710 HOLDING LATCH: 0x380014980 *** Latch: 106 (0x380014980) Level: 5 Gets: 1110 Misses: 0 ImmediateGets: 0 ImmediateMisses: 0 Sleeps: 0 SpinGets: 0 Sleeps1: 0 Sleeps2: 0 Sleeps3: 0 *** Session (5): a0269e608 User: SYS PID: 28710 blocker: 0 SQL Addr: 9e2a85958 SQL: delete from obj$ where owner#=:1 and name=:2 and namespace=:3 and(remoteowner=:4 or remoteowner is null and :4 is null)and(linkname=:5 or linkname is null and :5 is null)and(subname=:6 or subname is null and :6 is null) pSQL Addr: 9e2a85958 pSQL: delete from obj$ where owner#=:1 and name=:2 and namespace=:3 and(remoteowner=:4 or remoteowner is null and :4 is null)and(linkname=:5 or linkname is null and :5 is null)and(subname=:6 or subname is null and :6 is null) Session Waits: Seq: 632 Event#: 94 P1: 0x1 (1) P2: 0x51dd (20957) P3: 0x1 (1) Time: 3

Here we could see that the process at address 0xa01e272e0 was holding latch #106. This is what was preventing all the other processes from acquiring this latch and causing the database to hang. Other relevant information captured by show_sessions was the ospid (28710) which we could map back to the SMON process by querying v$process and v$session. We could also see the actual sql that the process was executing:

delete from obj$ where owner#=:1 and name=:2 and namespace=:3 and(remoteowner=:4 or remoteowner is null and :4 is null)and(linkname=:5 or linkname is null and :5 is null)and(subname=:6 or subname is null and :6 is null)

2. Review the process stack from the blocking process (SMON). 28710: ora_smon_live 0000000100fdb414 kglhdde (101e3c3e0, 9bf904bf8, a0001c1b8, 0, 9cce56774, 2000) + 114 0000000100fdb990 kglhdunp2 (1f, a200a2bc8, 3e8, 101ca1f30, 0, 1f0) + 2b0 0000000100fdb624 kglhdunp (101e3c3e0, 25, 0, 70000, 1, 7ffffffc) + 1a4 0000000100fcd9a4 kglobf0 (9cce56458, 0, 1, 1, 3c4, 0) + 1c4 0000000100fcbb08 kglhdiv (a2009f178, 9ef2dc040, 10000, ffffffffffffff98, 3c4, e) + 2c8 0000000100fd45b8 kglpndl (380003710, a2009f118, e, 101e3d748, ffffffff7fffc320, e) + a58000000010041e54c kssdct (a1f6cb5c8, 24, 1, 0, 101e3ede8, 0) + 18c 000000010034dfac ktcrcm (0, 0, 0, 0, 0, 0) + aec 00000001005dd1e0 kqlclo (a03b86d24, 0, 3, 20000000, 0, 0) + a80 000000010023b46c ktmmon (0, 380005880, 6, a01e272e0, 3800050b8, 0) + 188c 000000010042330c ksbrdp (0, 101e401e0, 0, 0, 100909238, 100909204) + 2ec 000000010090939c opirip (32, 0, 0, 0, 0, 0) + 31c 000000010015f720 opidrv (32, 0, 0, 6c6f6700, 0, 0) + 6a0 0000000100149e10 sou2o (ffffffff7fffe890, 32, 0, 0, 101e7c5c8, 100134e0c) + 10 0000000100134f28 main (1, ffffffff7fffeab8, ffffffff7fffeac8, 0, 0, 100000000) + 128 0000000100134ddc _start (0, 0, 0, 0, 0, 0) + 17c

Comparing the 3 samples of the pstack of the SMON process revealed that the SMON process was stuck executing the same code during the hang. We could see that all 3 samples showed the identical stack. This meant that the same code was executing (or stuck in execution) during the time we collected pstacks.

3. Now that we know which session was holding the latch (from step #1) and the sql that

was being executed, we could search the bug database to see if this was a known bug. The stack trace of SMON clearly matched the bug signature of bug 2791662. (NOTE: This bug is not viewable by customers). This bug causes database hangs/freezes due to a process (in our case SMON) holding a library cache latch when executing a drop statement while invalidating read-only objects (cursors) dependent on the object being dropped. In the case where you have a large number of read-only dependent objects, this latch could be held for a very long time. This was the cause of the user’s intermittent hang.

4. A patch existed to correct the problems associated with bug 2791662. A less desirable

workaround would have been to issue the drop command on the underlying object in off-hours.

5. Clearly the best solution was to apply the patch to fix the bug. The customer applied

the patch and the problem was fixed.

Conclusion The use of a structured methodology (ODM) will always lead to reduced resolution times. Problem identification and verification is an extremely important step and unfortunately is too often overlooked. In this case different diagnostic tools were available but before the right diagnostic tool could be selected the problem needed to be identified and verified. The technical differences between a hang and a slowdown can appear to be insignificant to the end user but this is a very important distinction when it comes to selecting the most appropriate diagnostic trace/tool. Using statspack for example to determine the root cause of a hang was not as preferable as collecting systemstate or hanganalyze dumps. The need to deploy the right data collection tool early on cannot be over emphasized. The customer had tried for months to collect data during the hang but was unsuccessful. Had LTOM/show_sessions been deployed in the beginning of this engagement a solution could have been obtained in days rather than months.

References

The Oracle Center of Expertise offers a suite of tools to assist customers in resolving performance issues. These tools include: Note 352363.1: LTOM - The On-Board Monitor User Guide Note 301137.1: OS Watcher User Guide Note 362094.1: HANGFG User Guide Note 362791.1: STACKX User Guide Show_sessions: This is not currently available for download. However, if you are interested in getting information on this tool, please contact the author of this case study, Carl Davis. Other relevant documents to this study include: Note 312789.1: What is the Oracle Diagnostic Methodology (ODM)? Note 215858.1: Interpreting HANGANALYZE trace files to diagnose hanging and performance problems Note 273324.1: Using HP-UX Debugger GDB To Produce System State Dump

https://metalink.oracle.com/metalink/plsql/ml2_documents.showDocument?p_database_id=NOT&p_id=352363.1




mailto:[email protected]




oracle press case study - using real-time diagnostic tools to diagnose intermittent database hangs

Documents