oracle press case study - the mysterious performance drop
TRANSCRIPT
-
8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop
1/9
Case Study: The Mysterious Performance Drop
Author: Roderick Manalac, Consulting Technical Advisor, Oracle USA
Skill Level Rating for this Case Study: Expert
About Oracle Case Studies
Oracle Case Studies are intended as learning tools and for sharing information or
knowledge related to a complex event, process, procedure, or to a series of related
events. Each case study is written based upon the experience that the writer/s
encountered.
Each Case Study contains a skill level rating. The rating provides an indication of what
skill level the reader should have as it relates to the information in the case study.
Ratings are:
Expert: significant experience with the subject matter
Intermediate: some experience with the subject matter
Beginner: little experience with the subject matter
Case Study Abstract
Sometimes the simplest or seemingly innocent actions can have significant ramifications
on the performance of a very busy system. Diagnosing these types of problems
sometimes requires some understanding of obscure Oracle behaviors. This article will
describe how two minor features combined can cause an interesting performance issue,
and how the issue was diagnosed and resolved.
Case History
A customer's Applications environment slowed down every afternoon for several
consecutive workdays. On most days, the slowdown would only last 10 or 20 minutesand then return to normal. However on a few days, the performance would degrade and
remain very unacceptable or continually worsen until they were forced to shutdown and
restart (bounce) the database during business hours. Then, good performance would
return until the following afternoon. The Application had been running fine for the
months prior to these events. The customer stated that nothing had changed recently in
the environment to trigger this behavior no patches were applied; no hardware was
added or removed.
-
8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop
2/9
Analysis
Fortunately, the customer already had statspack configured to capture performance
snapshots every 30 minutes, so it was time to glance at some Statspack reports. Ideally,
one would look for significant differences between a normal processing day with
acceptable performance and a bad day. In this case, we also had the luxury of analyzingthe periods immediately before during and after the performance issue.
On acceptable days before the problems appeared, the top Timed Events were CPU
time and some IO related events. In the first Statspack period including the problem
window, "latch free" jumped to the top. On days where the performance corrected itself,
CPU returned to the top and everything generally reverted to stats seen in the BEFORE
problem reports. However, on the days where the problem did not correct itself, IO
related events appeared on top followed by "latch free" and "buffer busy waits" while
CPU was not getting used as much.
So, even the summary info at the beginning of each statspack report was telling adifferent story. The system was displaying three distinct performance profiles, which
could humorously be labeled The Good, The Bad, and The Ugly.
The Good, Statspack Report1:
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
CPU time 38,323 28.81
db file scattered read 4,115,952 28,707 21.58
db file sequential read 2,169,347 18,995 14.28
buffer busy waits 1,722,685 15,287 11.49
log file sync 208,260 12,209 9.18
The Bad, Statspack Report1:
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
latch free 413,223 1,667 35.14
db file sequential read 241,965 1,218 25.68db file scattered read 485,092 601 12.68
buffer busy waits 38,232 259 5.45
CPU time 205 4.32
1 Unfortunately, the original data was not available at the time of publication. Some representative values were used
in these examples.
Page2
-
8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop
3/9
The Ugly, Statspack Report2:
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
db file scattered read 3,111,783 118,912 43.42
db file sequential read 1,408,059 43,565 15.91latch free 4,281,865 30,869 11.27
buffer busy waits 1,146,414 23,682 8.65
CPU time 20,754 7.58
In The Good a more detailed survey of the statspack report showed little CPU used for
parsing or recursive calls. This meant that most of the CPU time was likely getting used
to process SQL and the few IO waits meant that most of the popular data was residing
happily in cache.
The Good, Statspack Report, Statistics2
:
. . .
Instance Activity Stats for DB: PROD Instance: prod1 Snaps: 3140 -3141
Statistic Total per Second per Trans
--------------------------------- ------------------ -------------- ------------
CPU time 3,832,300 11,207 7.58
. . .
parse count (hard) 21 0.0 0.0
parse time cpu 975 0.3 0.5
. . .
recursive cpu usage 312 0.1 0.1
. . .
2 Unfortunately, the original data was not available at the time of publication. Some representative values were used
in these examples.
Page3
-
8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop
4/9
In The Bad the latch free waits, lead one to study the latch section of statspack in more
detail. In that section, it appeared most of the sleeps revolved around library cache. The
statistics section also showed a higher parse count (hard) compared to The Good and
the library cache section added more corroboration by reporting a large number of
reloads and invalidations in the SQL AREA (with reloads almost equal to invalidations).
The Bad, Statspack Report, Latch and Library Cache Statistics3:
. . .
Instance Activity Stats for DB: PROD Instance: prod1 Snaps: 3340 -3341
Statistic Total per Second per Trans
--------------------------------- ------------------ -------------- ------------
. . .
parse count (failures) 83 0.0 0.0
parse count (hard) 1,521 0.4 0.8
parse count (total) 10,780 8.3 14.9
. . .
Latch Sleep breakdown for DB: PROD Instance: prod1 Snaps: 3340 -3341
-> ordered by misses desc
Get Spin &
Latch Name Requests Misses Sleeps Sleeps 1->4
-------------------------- -------------- ----------- ----------- ------------
library cache 143,525,937 1,344,491 218,264 1161117/1551
99/22432/574
3/0
shared pool 56,948,537 446,574 105,545 353300/81370
/11553/351/0
. . .
Library Cache Activity for DB: PROD Instance: prod1 Snaps: 3340 -3341
->"Pct Misses" should be very low
Get Pct Pin Pct Invali-
Namespace Requests Miss Requests Miss Reloads dations
--------------- ------------ ------ -------------- ------ ---------- --------
BODY 6,679 0.0 6,679 0.0 0 0
CLUSTER 223 0.4 284 0.7 0 0
INDEX 781 11.8 710 13.0 0 0
SQL AREA 2,490,099 3.5 37,707,666 0.6 28,618 341
TABLE/PROCEDURE 3,015,349 0.2 940,923 7.2 23,303 0
TRIGGER 12,400 0.0 12,400 0.0 0 0
. . .
3 Unfortunately, the original data was not available at the time of publication. Some representative values were used
in these examples.
Page4
-
8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop
5/9
Finally, The Ugly report showed much higher IO activity than the other two profiles.
The hard parsing activity had disappeared. More telling, a regularly executed SQL
statement that appeared in Top SQL sorted by most Reads was nowhere to be found in
the Top Reads in other reports. It was listed in the Top SQL sorted by Executions in
all reports.
The Ugly, Statspack Report, Top SQL4:
. . .
SQL ordered by Reads for DB: : PROD Instance: prod1 Snaps: 3440 -3441
-> End Disk Reads Threshold: 1000
CPU Elapsd
Physical Reads Executions Reads per Exec %Total Time (s) Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
490,078 58 8,449.6 14.0 367.73 664.48 3381540416
SELECT col1, col2, col3 FROM large_table
. . .
SQL ordered by Executions for DB: PROD Instance: prod1 Snaps: 3440 -3441
-> End Executions Threshold: 10
CPU per Elap per
Executions Rows Processed Rows per Exec Exec (s) Exec (s) Hash Value
------------ --------------- ---------------- ----------- ---------- ----------
58 390,304 6729.4 0.00 0.00 1208562063
SELECT col1, col2, col3 FROM large_table
. . .
Conclusion and Learnings
At this point, it would be easy to jump to the conclusion that some user or job had
executed DBMS_STATS (or ANALYZE) against some key tables. This would invalidate
many SQL statements referencing those tables, so that they could be reparsed with the
new cost-based optimizer (CBO) statistics (explaining the high latch contention reported
in the Bad profile). Then on certain days, these statistics led CBO to choose poor
execution plans (causing the Ugly profile). Thus, one potential solution would be to
stop gathering table statistics in the middle of the day. Another possible fix would be to
preserve the good execution plan for the rogue SQL statement using an OUTLINE.
Unfortunately, there were a couple of holes with that theory that would at least rule out
the first solution. Chiefly, the customer was insistent that no DBA or scheduled job was
gathering new statistics against the objects. Secondly, even if this led to bad execution
plans, why would performance be restored after the database was shutdown and
4 Unfortunately, the original data was not available at the time of publication. Some representative values were used
in these examples.
Page5
-
8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop
6/9
restarted? [SQL trace output later confirmed that the execution plans would change after
the SQL was invalidated, but good execution plans were restored after a database
instance bounce.]
Some activity was definitely invalidating SQL. So if it was not DBMS_STATS, then
what? If the cursors were just aging out under shared pool space pressure the reportwould just show reloads without invalidations. No patches were being applied; no
columns were being added, etc. No space maintenance was going on, so no indexes were
rebuilt or tables moved. It turns out just granting or revoking privileges on objects to
users is sufficient to invalidate dependent SQL as well. At that point, the customer did
admit they had been giving database access to new employees over the last few weeks.
They did not consider that a change in operational issues could have this impact.
But what could cause the executions plan to change once in a while when a particular
SQL statement was hard parsed again? Dynamic sampling was not enabled. This left a
more obscure feature called bind peeking.
If optimizer_features_enable = 9.0.0 or higher, then CBO will calculate some costs for
inequality predicates (or equality predicates against columns with histograms) based on
the bind values supplied by the first person who (re)loads SQL into the shared pool. The
SQL in question happened to fit the criteria where the bind value supplied could
significantly impact what CBO thought was the best execution plan. A simple example
appears at the end of this study.
With all the observations and evidence now falling into place, the consensus was that the
safest solution was to limit new user additions to off-hours maintenance windows. Also
they created an OUTLINE for the problem SQL. Adding a HINT was not possible given
it was a third party application. Changing optimizer_features_enabled to disable bind
peeking may have had negative impacts on other SQL.
References
The scripts below illustrate how invalidation and bind peeking work. They were tested
against a recently installed vanilla 10gR2 seed database. Ideally you need to manually go
back and forth between two SQL*Plus sessions sitting side-by-side to best view what is
happening.
Session 1:REM T1: Set up test case
var x number;
var y number;
create table bigtab as select * from all_objects;
create index bt_ix on bigtab (object_id);
execute dbms_stats.gather_table_stats (ownname=>'SCOTT', -
Page6
-
8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop
7/9
tabname=>'BIGTAB', CASCADE => TRUE, -
method_opt => 'FOR ALL COLUMNS SIZE 1');
REM Start with narrow range of values
begin :x := 1000; :y := 1001; end;
/
SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;
REM Goto session 2 - exec plan should be index range scan
REM T3: now lets re-execute with a wide range
begin :x := 0; :y := 50000; end;
/
SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;
REM Goto session 2 again - exec plan should still be the same
REM T5: So lets invalidate this puppy
GRANT SELECT ON BIGTAB TO ORDSYS;
REM now query against v$sql_plan and v$sql in session 2 shows
REM "no rows selected"
REM T7: so now lets reload the cursor
SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;
REM now session 2 shows a new plan with FAST FULL SCAN and it will
REM be used from now on no matter what the bind values are.
REM T9: so let's turn off bind peeking off
alter session set "_optim_peek_user_binds" = false;
SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;
REM now v$sql_area will show version count of 2 (cause some other
REM session may still have bind peeking enabled)
Session 2:column operation format a20
column options format a20
column object_name format a20
REM T2: After initial load with narrow range binds
select sql_id, sql_text from v$sql where sql_text like
'SELECT COUNT(*) FROM BIGTAB%';
SQL_ID
Page7
-
8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop
8/9
-------------
SQL_TEXT
-----------------------------------------------------------------
2aa40mj45939v
SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x AND :y
select operation, options, object_name from v$sql_plan
where sql_id = '2aa40mj45939v';
OPERATION OPTIONS OBJECT_NAME
-------------------- -------------------- --------------------
SELECT STATEMENT
SORT AGGREGATE
FILTER
INDEX RANGE SCAN BT_IX
REM T4: Plan for second SQL still the same
select operation, options, object_name from v$sql_plan
where sql_id = '2aa40mj45939v';
OPERATION OPTIONS OBJECT_NAME
-------------------- -------------------- --------------------
SELECT STATEMENT
SORT AGGREGATE
FILTER
INDEX RANGE SCAN BT_IX
REM T6: after grant is issued
select operation, options, object_name from v$sql_plan
where sql_id = '2aa40mj45939v';
no rows selected
select sql_id from v$sql where sql_id = '2aa40mj45939v';
no rows selected
REM T8: Now cursor is reloaded
select operation, options, object_name from v$sql_plan
where sql_id = '2aa40mj45939v';
OPERATION OPTIONS OBJECT_NAME
-------------------- -------------------- --------------------
SELECT STATEMENTSORT AGGREGATE
FILTER
INDEX FAST FULL SCAN BT_IX
select loads, invalidations, executions, version_count
from v$sqlarea
where sql_id = '2aa40mj45939v';
LOADS INVALIDATIONS EXECUTIONS VERSION_COUNT
Page8
-
8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop
9/9
---------- ------------- ---------- -------------
2 1 1 1
REM T10: After session altered "_optim_peek_user_binds" = FALSE
REM and query re-executed.
select loads, invalidations, executions, version_count
from v$sqlarea
where sql_id = '2aa40mj45939v';
LOADS INVALIDATIONS EXECUTIONS VERSION_COUNT
---------- ------------- ---------- -------------
3 1 2 2
select operation, options, object_name from v$sql_plan
where sql_id = '2aa40mj45939v' order by plan_hash_value;
OPERATION OPTIONS OBJECT_NAME
-------------------- -------------------- --------------------
FILTER
SORT AGGREGATESELECT STATEMENT
INDEX FAST FULL SCAN BT_IX
FILTER
SORT AGGREGATE
SELECT STATEMENT
INDEX RANGE SCAN BT_IX
Page9