oracle press case study - the mysterious performance drop

8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop

1/9

Case Study: The Mysterious Performance Drop

Author: Roderick Manalac, Consulting Technical Advisor, Oracle USA

Skill Level Rating for this Case Study: Expert

About Oracle Case Studies

Oracle Case Studies are intended as learning tools and for sharing information or

knowledge related to a complex event, process, procedure, or to a series of related

events. Each case study is written based upon the experience that the writer/s

encountered.

Each Case Study contains a skill level rating. The rating provides an indication of what

skill level the reader should have as it relates to the information in the case study.

Ratings are:

Expert: significant experience with the subject matter

Intermediate: some experience with the subject matter

Beginner: little experience with the subject matter

Case Study Abstract

Sometimes the simplest or seemingly innocent actions can have significant ramifications

on the performance of a very busy system. Diagnosing these types of problems

sometimes requires some understanding of obscure Oracle behaviors. This article will

describe how two minor features combined can cause an interesting performance issue,

and how the issue was diagnosed and resolved.

Case History

A customer's Applications environment slowed down every afternoon for several

consecutive workdays. On most days, the slowdown would only last 10 or 20 minutesand then return to normal. However on a few days, the performance would degrade and

remain very unacceptable or continually worsen until they were forced to shutdown and

restart (bounce) the database during business hours. Then, good performance would

return until the following afternoon. The Application had been running fine for the

months prior to these events. The customer stated that nothing had changed recently in

the environment to trigger this behavior no patches were applied; no hardware was

added or removed.


2/9

Analysis

Fortunately, the customer already had statspack configured to capture performance

snapshots every 30 minutes, so it was time to glance at some Statspack reports. Ideally,

one would look for significant differences between a normal processing day with

acceptable performance and a bad day. In this case, we also had the luxury of analyzingthe periods immediately before during and after the performance issue.

On acceptable days before the problems appeared, the top Timed Events were CPU

time and some IO related events. In the first Statspack period including the problem

window, "latch free" jumped to the top. On days where the performance corrected itself,

CPU returned to the top and everything generally reverted to stats seen in the BEFORE

problem reports. However, on the days where the problem did not correct itself, IO

related events appeared on top followed by "latch free" and "buffer busy waits" while

CPU was not getting used as much.

So, even the summary info at the beginning of each statspack report was telling adifferent story. The system was displaying three distinct performance profiles, which

could humorously be labeled The Good, The Bad, and The Ugly.

The Good, Statspack Report1:

Top 5 Timed Events

~~~~~~~~~~~~~~~~~~ % Total

Event Waits Time (s) Ela Time

-------------------------------------------- ------------ ----------- --------

CPU time 38,323 28.81

db file scattered read 4,115,952 28,707 21.58

db file sequential read 2,169,347 18,995 14.28

buffer busy waits 1,722,685 15,287 11.49

log file sync 208,260 12,209 9.18

The Bad, Statspack Report1:

Top 5 Timed Events

~~~~~~~~~~~~~~~~~~ % Total


-------------------------------------------- ------------ ----------- --------

latch free 413,223 1,667 35.14

db file sequential read 241,965 1,218 25.68db file scattered read 485,092 601 12.68

buffer busy waits 38,232 259 5.45

CPU time 205 4.32

1 Unfortunately, the original data was not available at the time of publication. Some representative values were used

in these examples.

Page2


3/9

The Ugly, Statspack Report2:

Top 5 Timed Events

~~~~~~~~~~~~~~~~~~ % Total


-------------------------------------------- ------------ ----------- --------

db file scattered read 3,111,783 118,912 43.42

db file sequential read 1,408,059 43,565 15.91latch free 4,281,865 30,869 11.27

buffer busy waits 1,146,414 23,682 8.65

CPU time 20,754 7.58

In The Good a more detailed survey of the statspack report showed little CPU used for

parsing or recursive calls. This meant that most of the CPU time was likely getting used

to process SQL and the few IO waits meant that most of the popular data was residing

happily in cache.

The Good, Statspack Report, Statistics2

:

. . .

Instance Activity Stats for DB: PROD Instance: prod1 Snaps: 3140 -3141

Statistic Total per Second per Trans

--------------------------------- ------------------ -------------- ------------

CPU time 3,832,300 11,207 7.58

. . .

parse count (hard) 21 0.0 0.0

parse time cpu 975 0.3 0.5

. . .

recursive cpu usage 312 0.1 0.1

. . .


in these examples.

Page3


4/9

In The Bad the latch free waits, lead one to study the latch section of statspack in more

detail. In that section, it appeared most of the sleeps revolved around library cache. The

statistics section also showed a higher parse count (hard) compared to The Good and

the library cache section added more corroboration by reporting a large number of

reloads and invalidations in the SQL AREA (with reloads almost equal to invalidations).

The Bad, Statspack Report, Latch and Library Cache Statistics3:

. . .

Instance Activity Stats for DB: PROD Instance: prod1 Snaps: 3340 -3341

Statistic Total per Second per Trans

--------------------------------- ------------------ -------------- ------------

. . .

parse count (failures) 83 0.0 0.0

parse count (hard) 1,521 0.4 0.8

parse count (total) 10,780 8.3 14.9

. . .

Latch Sleep breakdown for DB: PROD Instance: prod1 Snaps: 3340 -3341

-> ordered by misses desc

Get Spin &

Latch Name Requests Misses Sleeps Sleeps 1->4

-------------------------- -------------- ----------- ----------- ------------

library cache 143,525,937 1,344,491 218,264 1161117/1551

99/22432/574

3/0

shared pool 56,948,537 446,574 105,545 353300/81370

/11553/351/0

. . .

Library Cache Activity for DB: PROD Instance: prod1 Snaps: 3340 -3341

->"Pct Misses" should be very low

Get Pct Pin Pct Invali-

Namespace Requests Miss Requests Miss Reloads dations

--------------- ------------ ------ -------------- ------ ---------- --------

BODY 6,679 0.0 6,679 0.0 0 0

CLUSTER 223 0.4 284 0.7 0 0

INDEX 781 11.8 710 13.0 0 0

SQL AREA 2,490,099 3.5 37,707,666 0.6 28,618 341

TABLE/PROCEDURE 3,015,349 0.2 940,923 7.2 23,303 0

TRIGGER 12,400 0.0 12,400 0.0 0 0

. . .


in these examples.

Page4


5/9

Finally, The Ugly report showed much higher IO activity than the other two profiles.

The hard parsing activity had disappeared. More telling, a regularly executed SQL

statement that appeared in Top SQL sorted by most Reads was nowhere to be found in

the Top Reads in other reports. It was listed in the Top SQL sorted by Executions in

all reports.

The Ugly, Statspack Report, Top SQL4:

. . .

SQL ordered by Reads for DB: : PROD Instance: prod1 Snaps: 3440 -3441

-> End Disk Reads Threshold: 1000

CPU Elapsd

Physical Reads Executions Reads per Exec %Total Time (s) Time (s) Hash Value

--------------- ------------ -------------- ------ -------- --------- ----------

490,078 58 8,449.6 14.0 367.73 664.48 3381540416

SELECT col1, col2, col3 FROM large_table

. . .

SQL ordered by Executions for DB: PROD Instance: prod1 Snaps: 3440 -3441

-> End Executions Threshold: 10

CPU per Elap per

Executions Rows Processed Rows per Exec Exec (s) Exec (s) Hash Value

------------ --------------- ---------------- ----------- ---------- ----------

58 390,304 6729.4 0.00 0.00 1208562063

SELECT col1, col2, col3 FROM large_table

. . .

Conclusion and Learnings

At this point, it would be easy to jump to the conclusion that some user or job had

executed DBMS_STATS (or ANALYZE) against some key tables. This would invalidate

many SQL statements referencing those tables, so that they could be reparsed with the

new cost-based optimizer (CBO) statistics (explaining the high latch contention reported

in the Bad profile). Then on certain days, these statistics led CBO to choose poor

execution plans (causing the Ugly profile). Thus, one potential solution would be to

stop gathering table statistics in the middle of the day. Another possible fix would be to

preserve the good execution plan for the rogue SQL statement using an OUTLINE.

Unfortunately, there were a couple of holes with that theory that would at least rule out

the first solution. Chiefly, the customer was insistent that no DBA or scheduled job was

gathering new statistics against the objects. Secondly, even if this led to bad execution

plans, why would performance be restored after the database was shutdown and


in these examples.

Page5


6/9

restarted? [SQL trace output later confirmed that the execution plans would change after

the SQL was invalidated, but good execution plans were restored after a database

instance bounce.]

Some activity was definitely invalidating SQL. So if it was not DBMS_STATS, then

what? If the cursors were just aging out under shared pool space pressure the reportwould just show reloads without invalidations. No patches were being applied; no

columns were being added, etc. No space maintenance was going on, so no indexes were

rebuilt or tables moved. It turns out just granting or revoking privileges on objects to

users is sufficient to invalidate dependent SQL as well. At that point, the customer did

admit they had been giving database access to new employees over the last few weeks.

They did not consider that a change in operational issues could have this impact.

But what could cause the executions plan to change once in a while when a particular

SQL statement was hard parsed again? Dynamic sampling was not enabled. This left a

more obscure feature called bind peeking.

If optimizer_features_enable = 9.0.0 or higher, then CBO will calculate some costs for

inequality predicates (or equality predicates against columns with histograms) based on

the bind values supplied by the first person who (re)loads SQL into the shared pool. The

SQL in question happened to fit the criteria where the bind value supplied could

significantly impact what CBO thought was the best execution plan. A simple example

appears at the end of this study.

With all the observations and evidence now falling into place, the consensus was that the

safest solution was to limit new user additions to off-hours maintenance windows. Also

they created an OUTLINE for the problem SQL. Adding a HINT was not possible given

it was a third party application. Changing optimizer_features_enabled to disable bind

peeking may have had negative impacts on other SQL.

References

The scripts below illustrate how invalidation and bind peeking work. They were tested

against a recently installed vanilla 10gR2 seed database. Ideally you need to manually go

back and forth between two SQL*Plus sessions sitting side-by-side to best view what is

happening.

Session 1:REM T1: Set up test case

var x number;

var y number;

create table bigtab as select * from all_objects;

create index bt_ix on bigtab (object_id);

execute dbms_stats.gather_table_stats (ownname=>'SCOTT', -

Page6


7/9

tabname=>'BIGTAB', CASCADE => TRUE, -

method_opt => 'FOR ALL COLUMNS SIZE 1');

REM Start with narrow range of values

begin :x := 1000; :y := 1001; end;

/

SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;

REM Goto session 2 - exec plan should be index range scan

REM T3: now lets re-execute with a wide range

begin :x := 0; :y := 50000; end;

/


REM Goto session 2 again - exec plan should still be the same

REM T5: So lets invalidate this puppy

GRANT SELECT ON BIGTAB TO ORDSYS;

REM now query against v$sql_plan and v$sql in session 2 shows

REM "no rows selected"

REM T7: so now lets reload the cursor


REM now session 2 shows a new plan with FAST FULL SCAN and it will

REM be used from now on no matter what the bind values are.

REM T9: so let's turn off bind peeking off

alter session set "_optim_peek_user_binds" = false;


REM now v$sql_area will show version count of 2 (cause some other

REM session may still have bind peeking enabled)

Session 2:column operation format a20

column options format a20

column object_name format a20

REM T2: After initial load with narrow range binds

select sql_id, sql_text from v$sql where sql_text like

'SELECT COUNT(*) FROM BIGTAB%';

SQL_ID

Page7


8/9

-------------

SQL_TEXT

-----------------------------------------------------------------

2aa40mj45939v

SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x AND :y

select operation, options, object_name from v$sql_plan

where sql_id = '2aa40mj45939v';

OPERATION OPTIONS OBJECT_NAME

-------------------- -------------------- --------------------

SELECT STATEMENT

SORT AGGREGATE

FILTER

INDEX RANGE SCAN BT_IX

REM T4: Plan for second SQL still the same




-------------------- -------------------- --------------------

SELECT STATEMENT

SORT AGGREGATE

FILTER


REM T6: after grant is issued



no rows selected

select sql_id from v$sql where sql_id = '2aa40mj45939v';

no rows selected

REM T8: Now cursor is reloaded




-------------------- -------------------- --------------------

SELECT STATEMENTSORT AGGREGATE

FILTER

INDEX FAST FULL SCAN BT_IX

select loads, invalidations, executions, version_count

from v$sqlarea


LOADS INVALIDATIONS EXECUTIONS VERSION_COUNT

Page8


9/9

---------- ------------- ---------- -------------

2 1 1 1

REM T10: After session altered "_optim_peek_user_binds" = FALSE

REM and query re-executed.

select loads, invalidations, executions, version_count

from v$sqlarea


LOADS INVALIDATIONS EXECUTIONS VERSION_COUNT

---------- ------------- ---------- -------------

3 1 2 2


where sql_id = '2aa40mj45939v' order by plan_hash_value;


-------------------- -------------------- --------------------

FILTER

SORT AGGREGATESELECT STATEMENT

INDEX FAST FULL SCAN BT_IX

FILTER

SORT AGGREGATE

SELECT STATEMENT


Page9

oracle press case study - the mysterious performance drop

Documents