oracle press case study - the mysterious performance drop

Upload: jpaulino

Post on 30-May-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop

    1/9

    Case Study: The Mysterious Performance Drop

    Author: Roderick Manalac, Consulting Technical Advisor, Oracle USA

    Skill Level Rating for this Case Study: Expert

    About Oracle Case Studies

    Oracle Case Studies are intended as learning tools and for sharing information or

    knowledge related to a complex event, process, procedure, or to a series of related

    events. Each case study is written based upon the experience that the writer/s

    encountered.

    Each Case Study contains a skill level rating. The rating provides an indication of what

    skill level the reader should have as it relates to the information in the case study.

    Ratings are:

    Expert: significant experience with the subject matter

    Intermediate: some experience with the subject matter

    Beginner: little experience with the subject matter

    Case Study Abstract

    Sometimes the simplest or seemingly innocent actions can have significant ramifications

    on the performance of a very busy system. Diagnosing these types of problems

    sometimes requires some understanding of obscure Oracle behaviors. This article will

    describe how two minor features combined can cause an interesting performance issue,

    and how the issue was diagnosed and resolved.

    Case History

    A customer's Applications environment slowed down every afternoon for several

    consecutive workdays. On most days, the slowdown would only last 10 or 20 minutesand then return to normal. However on a few days, the performance would degrade and

    remain very unacceptable or continually worsen until they were forced to shutdown and

    restart (bounce) the database during business hours. Then, good performance would

    return until the following afternoon. The Application had been running fine for the

    months prior to these events. The customer stated that nothing had changed recently in

    the environment to trigger this behavior no patches were applied; no hardware was

    added or removed.

  • 8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop

    2/9

    Analysis

    Fortunately, the customer already had statspack configured to capture performance

    snapshots every 30 minutes, so it was time to glance at some Statspack reports. Ideally,

    one would look for significant differences between a normal processing day with

    acceptable performance and a bad day. In this case, we also had the luxury of analyzingthe periods immediately before during and after the performance issue.

    On acceptable days before the problems appeared, the top Timed Events were CPU

    time and some IO related events. In the first Statspack period including the problem

    window, "latch free" jumped to the top. On days where the performance corrected itself,

    CPU returned to the top and everything generally reverted to stats seen in the BEFORE

    problem reports. However, on the days where the problem did not correct itself, IO

    related events appeared on top followed by "latch free" and "buffer busy waits" while

    CPU was not getting used as much.

    So, even the summary info at the beginning of each statspack report was telling adifferent story. The system was displaying three distinct performance profiles, which

    could humorously be labeled The Good, The Bad, and The Ugly.

    The Good, Statspack Report1:

    Top 5 Timed Events

    ~~~~~~~~~~~~~~~~~~ % Total

    Event Waits Time (s) Ela Time

    -------------------------------------------- ------------ ----------- --------

    CPU time 38,323 28.81

    db file scattered read 4,115,952 28,707 21.58

    db file sequential read 2,169,347 18,995 14.28

    buffer busy waits 1,722,685 15,287 11.49

    log file sync 208,260 12,209 9.18

    The Bad, Statspack Report1:

    Top 5 Timed Events

    ~~~~~~~~~~~~~~~~~~ % Total

    Event Waits Time (s) Ela Time

    -------------------------------------------- ------------ ----------- --------

    latch free 413,223 1,667 35.14

    db file sequential read 241,965 1,218 25.68db file scattered read 485,092 601 12.68

    buffer busy waits 38,232 259 5.45

    CPU time 205 4.32

    1 Unfortunately, the original data was not available at the time of publication. Some representative values were used

    in these examples.

    Page2

  • 8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop

    3/9

    The Ugly, Statspack Report2:

    Top 5 Timed Events

    ~~~~~~~~~~~~~~~~~~ % Total

    Event Waits Time (s) Ela Time

    -------------------------------------------- ------------ ----------- --------

    db file scattered read 3,111,783 118,912 43.42

    db file sequential read 1,408,059 43,565 15.91latch free 4,281,865 30,869 11.27

    buffer busy waits 1,146,414 23,682 8.65

    CPU time 20,754 7.58

    In The Good a more detailed survey of the statspack report showed little CPU used for

    parsing or recursive calls. This meant that most of the CPU time was likely getting used

    to process SQL and the few IO waits meant that most of the popular data was residing

    happily in cache.

    The Good, Statspack Report, Statistics2

    :

    . . .

    Instance Activity Stats for DB: PROD Instance: prod1 Snaps: 3140 -3141

    Statistic Total per Second per Trans

    --------------------------------- ------------------ -------------- ------------

    CPU time 3,832,300 11,207 7.58

    . . .

    parse count (hard) 21 0.0 0.0

    parse time cpu 975 0.3 0.5

    . . .

    recursive cpu usage 312 0.1 0.1

    . . .

    2 Unfortunately, the original data was not available at the time of publication. Some representative values were used

    in these examples.

    Page3

  • 8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop

    4/9

    In The Bad the latch free waits, lead one to study the latch section of statspack in more

    detail. In that section, it appeared most of the sleeps revolved around library cache. The

    statistics section also showed a higher parse count (hard) compared to The Good and

    the library cache section added more corroboration by reporting a large number of

    reloads and invalidations in the SQL AREA (with reloads almost equal to invalidations).

    The Bad, Statspack Report, Latch and Library Cache Statistics3:

    . . .

    Instance Activity Stats for DB: PROD Instance: prod1 Snaps: 3340 -3341

    Statistic Total per Second per Trans

    --------------------------------- ------------------ -------------- ------------

    . . .

    parse count (failures) 83 0.0 0.0

    parse count (hard) 1,521 0.4 0.8

    parse count (total) 10,780 8.3 14.9

    . . .

    Latch Sleep breakdown for DB: PROD Instance: prod1 Snaps: 3340 -3341

    -> ordered by misses desc

    Get Spin &

    Latch Name Requests Misses Sleeps Sleeps 1->4

    -------------------------- -------------- ----------- ----------- ------------

    library cache 143,525,937 1,344,491 218,264 1161117/1551

    99/22432/574

    3/0

    shared pool 56,948,537 446,574 105,545 353300/81370

    /11553/351/0

    . . .

    Library Cache Activity for DB: PROD Instance: prod1 Snaps: 3340 -3341

    ->"Pct Misses" should be very low

    Get Pct Pin Pct Invali-

    Namespace Requests Miss Requests Miss Reloads dations

    --------------- ------------ ------ -------------- ------ ---------- --------

    BODY 6,679 0.0 6,679 0.0 0 0

    CLUSTER 223 0.4 284 0.7 0 0

    INDEX 781 11.8 710 13.0 0 0

    SQL AREA 2,490,099 3.5 37,707,666 0.6 28,618 341

    TABLE/PROCEDURE 3,015,349 0.2 940,923 7.2 23,303 0

    TRIGGER 12,400 0.0 12,400 0.0 0 0

    . . .

    3 Unfortunately, the original data was not available at the time of publication. Some representative values were used

    in these examples.

    Page4

  • 8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop

    5/9

    Finally, The Ugly report showed much higher IO activity than the other two profiles.

    The hard parsing activity had disappeared. More telling, a regularly executed SQL

    statement that appeared in Top SQL sorted by most Reads was nowhere to be found in

    the Top Reads in other reports. It was listed in the Top SQL sorted by Executions in

    all reports.

    The Ugly, Statspack Report, Top SQL4:

    . . .

    SQL ordered by Reads for DB: : PROD Instance: prod1 Snaps: 3440 -3441

    -> End Disk Reads Threshold: 1000

    CPU Elapsd

    Physical Reads Executions Reads per Exec %Total Time (s) Time (s) Hash Value

    --------------- ------------ -------------- ------ -------- --------- ----------

    490,078 58 8,449.6 14.0 367.73 664.48 3381540416

    SELECT col1, col2, col3 FROM large_table

    . . .

    SQL ordered by Executions for DB: PROD Instance: prod1 Snaps: 3440 -3441

    -> End Executions Threshold: 10

    CPU per Elap per

    Executions Rows Processed Rows per Exec Exec (s) Exec (s) Hash Value

    ------------ --------------- ---------------- ----------- ---------- ----------

    58 390,304 6729.4 0.00 0.00 1208562063

    SELECT col1, col2, col3 FROM large_table

    . . .

    Conclusion and Learnings

    At this point, it would be easy to jump to the conclusion that some user or job had

    executed DBMS_STATS (or ANALYZE) against some key tables. This would invalidate

    many SQL statements referencing those tables, so that they could be reparsed with the

    new cost-based optimizer (CBO) statistics (explaining the high latch contention reported

    in the Bad profile). Then on certain days, these statistics led CBO to choose poor

    execution plans (causing the Ugly profile). Thus, one potential solution would be to

    stop gathering table statistics in the middle of the day. Another possible fix would be to

    preserve the good execution plan for the rogue SQL statement using an OUTLINE.

    Unfortunately, there were a couple of holes with that theory that would at least rule out

    the first solution. Chiefly, the customer was insistent that no DBA or scheduled job was

    gathering new statistics against the objects. Secondly, even if this led to bad execution

    plans, why would performance be restored after the database was shutdown and

    4 Unfortunately, the original data was not available at the time of publication. Some representative values were used

    in these examples.

    Page5

  • 8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop

    6/9

    restarted? [SQL trace output later confirmed that the execution plans would change after

    the SQL was invalidated, but good execution plans were restored after a database

    instance bounce.]

    Some activity was definitely invalidating SQL. So if it was not DBMS_STATS, then

    what? If the cursors were just aging out under shared pool space pressure the reportwould just show reloads without invalidations. No patches were being applied; no

    columns were being added, etc. No space maintenance was going on, so no indexes were

    rebuilt or tables moved. It turns out just granting or revoking privileges on objects to

    users is sufficient to invalidate dependent SQL as well. At that point, the customer did

    admit they had been giving database access to new employees over the last few weeks.

    They did not consider that a change in operational issues could have this impact.

    But what could cause the executions plan to change once in a while when a particular

    SQL statement was hard parsed again? Dynamic sampling was not enabled. This left a

    more obscure feature called bind peeking.

    If optimizer_features_enable = 9.0.0 or higher, then CBO will calculate some costs for

    inequality predicates (or equality predicates against columns with histograms) based on

    the bind values supplied by the first person who (re)loads SQL into the shared pool. The

    SQL in question happened to fit the criteria where the bind value supplied could

    significantly impact what CBO thought was the best execution plan. A simple example

    appears at the end of this study.

    With all the observations and evidence now falling into place, the consensus was that the

    safest solution was to limit new user additions to off-hours maintenance windows. Also

    they created an OUTLINE for the problem SQL. Adding a HINT was not possible given

    it was a third party application. Changing optimizer_features_enabled to disable bind

    peeking may have had negative impacts on other SQL.

    References

    The scripts below illustrate how invalidation and bind peeking work. They were tested

    against a recently installed vanilla 10gR2 seed database. Ideally you need to manually go

    back and forth between two SQL*Plus sessions sitting side-by-side to best view what is

    happening.

    Session 1:REM T1: Set up test case

    var x number;

    var y number;

    create table bigtab as select * from all_objects;

    create index bt_ix on bigtab (object_id);

    execute dbms_stats.gather_table_stats (ownname=>'SCOTT', -

    Page6

  • 8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop

    7/9

    tabname=>'BIGTAB', CASCADE => TRUE, -

    method_opt => 'FOR ALL COLUMNS SIZE 1');

    REM Start with narrow range of values

    begin :x := 1000; :y := 1001; end;

    /

    SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;

    REM Goto session 2 - exec plan should be index range scan

    REM T3: now lets re-execute with a wide range

    begin :x := 0; :y := 50000; end;

    /

    SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;

    REM Goto session 2 again - exec plan should still be the same

    REM T5: So lets invalidate this puppy

    GRANT SELECT ON BIGTAB TO ORDSYS;

    REM now query against v$sql_plan and v$sql in session 2 shows

    REM "no rows selected"

    REM T7: so now lets reload the cursor

    SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;

    REM now session 2 shows a new plan with FAST FULL SCAN and it will

    REM be used from now on no matter what the bind values are.

    REM T9: so let's turn off bind peeking off

    alter session set "_optim_peek_user_binds" = false;

    SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x and :y;

    REM now v$sql_area will show version count of 2 (cause some other

    REM session may still have bind peeking enabled)

    Session 2:column operation format a20

    column options format a20

    column object_name format a20

    REM T2: After initial load with narrow range binds

    select sql_id, sql_text from v$sql where sql_text like

    'SELECT COUNT(*) FROM BIGTAB%';

    SQL_ID

    Page7

  • 8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop

    8/9

    -------------

    SQL_TEXT

    -----------------------------------------------------------------

    2aa40mj45939v

    SELECT COUNT(*) FROM BIGTAB WHERE OBJECT_ID BETWEEN :x AND :y

    select operation, options, object_name from v$sql_plan

    where sql_id = '2aa40mj45939v';

    OPERATION OPTIONS OBJECT_NAME

    -------------------- -------------------- --------------------

    SELECT STATEMENT

    SORT AGGREGATE

    FILTER

    INDEX RANGE SCAN BT_IX

    REM T4: Plan for second SQL still the same

    select operation, options, object_name from v$sql_plan

    where sql_id = '2aa40mj45939v';

    OPERATION OPTIONS OBJECT_NAME

    -------------------- -------------------- --------------------

    SELECT STATEMENT

    SORT AGGREGATE

    FILTER

    INDEX RANGE SCAN BT_IX

    REM T6: after grant is issued

    select operation, options, object_name from v$sql_plan

    where sql_id = '2aa40mj45939v';

    no rows selected

    select sql_id from v$sql where sql_id = '2aa40mj45939v';

    no rows selected

    REM T8: Now cursor is reloaded

    select operation, options, object_name from v$sql_plan

    where sql_id = '2aa40mj45939v';

    OPERATION OPTIONS OBJECT_NAME

    -------------------- -------------------- --------------------

    SELECT STATEMENTSORT AGGREGATE

    FILTER

    INDEX FAST FULL SCAN BT_IX

    select loads, invalidations, executions, version_count

    from v$sqlarea

    where sql_id = '2aa40mj45939v';

    LOADS INVALIDATIONS EXECUTIONS VERSION_COUNT

    Page8

  • 8/14/2019 Oracle Press Case Study - The Mysterious Performance Drop

    9/9

    ---------- ------------- ---------- -------------

    2 1 1 1

    REM T10: After session altered "_optim_peek_user_binds" = FALSE

    REM and query re-executed.

    select loads, invalidations, executions, version_count

    from v$sqlarea

    where sql_id = '2aa40mj45939v';

    LOADS INVALIDATIONS EXECUTIONS VERSION_COUNT

    ---------- ------------- ---------- -------------

    3 1 2 2

    select operation, options, object_name from v$sql_plan

    where sql_id = '2aa40mj45939v' order by plan_hash_value;

    OPERATION OPTIONS OBJECT_NAME

    -------------------- -------------------- --------------------

    FILTER

    SORT AGGREGATESELECT STATEMENT

    INDEX FAST FULL SCAN BT_IX

    FILTER

    SORT AGGREGATE

    SELECT STATEMENT

    INDEX RANGE SCAN BT_IX

    Page9