a smart optimizer gets even smarter: demystifying oracle 12c database’s new histograms

A Smart Optimizer Gets Even Smarter:

Demystifying Oracle 12c Database’s New Histograms

My Credentials

30+ years of database-centric IT experience Oracle DBA since 2001 Oracle 9i, 10g, 11g OCP Oracle ACE Director > 100 articles on databasejournal.com and ioug.org Oracle-centric blog (Generally, It Depends) Regular speaker at Oracle OpenWorld, IOUG

COLLABORATE, and OTN ACE Tours Oracle University instructor – core Oracle DBA

courses

Coming Soon To a Bookstore Near You …

Coming in June 2015 from Oracle Press:Oracle Database Upgrade, Migration &

Transformation Tips & Techniques

• Covers everything you need to know to upgrade, migrate, and transform any Oracle 10g or 11g database to Oracle 12c

• Discusses strategy and tactics of planning Oracle migration, transformation, and upgrade projects

• Explores latest transformation features:• Recovery Manager (RMAN)• Oracle GoldenGate• Cross-Platform Transportable Tablespaces• Cross-Platform Transport (CPT)• Full Transportable Export (FTE)

• Includes detailed sample code

Our Agenda

Histograms: What Good Are They? Frequency vs. Height-Balanced Histograms Histograms: Epic Fail? Top Frequency Histograms Hybrid Histograms Histograms: Practical Examples Q+A

It’s Vegas, Baby!

5

Bottom line: The Oracle query optimizer is like a Vegas oddsmaker.

If it gets the query’s execution plan right, we’re on Easy Street …

… but if it gets it wrong, then we get to meet Tony in a dark alley.

So ... how do we tilt the odds in the optimizer’s favor?

Simple! We teach the optimizer to “count cards”

Histograms: What Good Are They?

… but with a histogram, the skewed distribution of Trans-Gender is dramatically obvious and the optimizer now knows the proper cardinality whenever a predicate includes that value

M F X

646.8

558

0.01

Indian Gender (in Millions)

Without a histogram, the Oracle optimizer would decide all three genders are equally distributed across all values …

33.3

33.3

33.4

Male FemaleTransGender

FACT: Australia, India, and several other countries now legally recognize Trans-Gender as a completely distinct gender …

… but the number of people declaring themselves as Trans-Gender is microscopic when compared to those declaring themselves as either (M)ale or (F)emale

Histograms: Terminology

Histograms: Frequency distribution methods Frequency: Plots how often a value occurs Height-Balanced: Distributes counts of values equally

per bucket NDV: Number of Distinct Values

Calculated while statistics are gathered NB: Number of histogram Buckets

Maximum of 254, but can be lower Skewness: A measure of how evenly data values

are distributed within a population or sample set

Pre-12c

Classic (Pre-12c) Histograms

8

Histograms: A Sample Use Case

Table AP.RANDOMIZED_SORTED: 100,000 rows Column KEY_STS clustered around just two values:

06 (CA) = 50% and 36 (NY) = 10% All other KEY_STS values distributed randomly but at

much lower frequency among 15 other values Therefore, NDV = 17 and NB = 17

CA NY IL NJ MT OH MI WI VA DE IN DC ID MD AK WA OR0

10,000

20,000

30,000

40,000

50,000

Distribution of KEY_STS Values(Note: FIPS decoded to US State for clarity)

Pre-12c

Initial Test Data Population

. . . FOR ctr IN 1..100000 LOOP INSERT INTO ap.randomized_sorted VALUES( ctr ,(TO_DATE('12/31/2013','mm/dd/yyyy') - DBMS_RANDOM.VALUE(1,3650))` ,LPAD(' ',DBMS_RANDOM.VALUE(1,32) ,SUBSTR('abcdefghijklmnopqrstuvwxyz‘ ,DBMS_RANDOM.VALUE(1,26), 1)) ,DECODE(MOD(ROUND(DBMS_RANDOM.VALUE(1,100000),0),100) , 0,10, 1,30, 2,17, 3,10, 4,36 , 5,11, 6,02, 7,18, 8,11, 9,36 ,10,24, 11,30, 12,26, 13,17, 14,36 ,15,34, 16,41, 17,39, 18,34, 19,36 ,20,51, 21,34, 22,55, 23,51, 24,36 ,25,17, 26,10, 27,30, 28,17, 29,36 ,30,18, 31,16, 32,16, 33,18, 34,36 ,35,26, 36,55, 37,30, 38,26, 39,36 ,40,39, 41,34, 42,53, 43,39, 44,36 ,45,55, 46,51, 47,34, 48,17, 49,36, 06)); IF MOD(ctr, 5000) = 0 THEN COMMIT; END IF; END LOOP; COMMIT;. . .

Frequency Histograms

Histogram targets are automatically chosen: Oracle recommends leaving SAMPLE_SIZE clause at its default

(AUTO_SAMPLE_SIZE) whenever DBMS_STATS.GATHER_* procedures are used to gather statistics

Thus whenever possible, a frequency histogram will be created Frequency histograms are most effective whenever NDV <=

maximum NB of 254

06 36 17 34 30 39 26 55 51 10 18 11 16 24 02 53 410

10,000

20,000

30,000

40,000

50,000

Frequency Histogram Values Distribution

Pre-12c

Generating a Frequency Histogram

BEGIN DBMS_STATS.GATHER_TABLE_STATS ( ownname => 'AP' ,tabname => 'RANDOMIZED_SORTED' ,method_opt => 'FOR COLUMNS KEY_STS' );END;/

TTITLE "Histogram Endpoints| \ for AP.RANDOMIZED_SORTED.KEY_STS| \(from DBA_HISTOGRAMS)"SELECT endpoint_number ,endpoint_value ,endpoint_repeat_count FROM dba_histograms WHERE owner = 'AP' AND table_name = 'RANDOMIZED_SORTED' AND column_name = 'KEY_STS' ORDER BY 1;TTITLE OFF

TTITLE "Histogram Metadata| \ for AP.RANDOMIZED_SORTED.KEY_STS| \(from DBA_TAB_COL_STATISTICS)"SELECT histogram ,num_distinct ,num_buckets FROM dba_tab_col_statisticsWHERE owner = 'AP' AND table_name = 'RANDOMIZED_SORTED' AND column_name = 'KEY_STS';

Histogram Metadata for AP.RANDOMIZED_SORTED.KEY_STS (from DBA_TAB_COL_STATISTICS) # of # of Distinct HistogramHistogram Type Values Buckets--------------- --------- ---------FREQUENCY 17 17

Histogram Endpointsfor AP.RANDOMIZED_SORTED.KEY_STS (from DBA_HISTOGRAMS) Endpoint Endpoint Repeat Endpoint Value Count--------- --------- --------- 1000 02 0 51000 06 0 54000 10 0 56000 11 0 58000 16 0 63000 17 0 66000 18 0 67000 24 0 70000 26 0 74000 30 0 79000 34 0 89000 36 0 92000 39 0 93000 41 0 96000 51 0 97000 53 0 100000 55 0

Pre-12c

Height-Balanced Histograms

However, if NB < NDV, then a height-balanced histogram will be created instead of a frequency histogram: All histogram buckets contain an equal number of values When forecasting cardinality, the query optimizer

considers the bucket endpoints of the height-balanced histogram to determine just how “popular” values are

02 06 11 17 26 34 36 39 41 51 5500

5000

10000

Height-Balanced Histogram (Only 10 Buckets)

Pre-12c

Generating a Height-Balanced Histogram

BEGIN DBMS_STATS.GATHER_TABLE_STATS ( ownname => 'AP' ,tabname => 'RANDOMIZED_SORTED' ,method_opt => 'FOR COLUMNS KEY_STS SIZE 10' ,estimate_percent => 100 );END;/

Histogram Metadata for AP.RANDOMIZED_SORTED.KEY_STS (from DBA_TAB_COL_STATISTICS) # of # of Distinct HistogramHistogram Type Values Buckets--------------- --------- ---------HEIGHT BALANCED 17 10

Histogram Endpointsfor AP.RANDOMIZED_SORTED.KEY_STS (from DBA_HISTOGRAMS) Endpoint Endpoint Repeat Endpoint Value Count--------- --------- --------- 0 02 0 8 06 0 9 16 0 10 17 0 11 26 0 12 34 0 14 36 0 15 51 0 16 55 0

Because NDV > NB, only a height-balanced histogram can be generated …

… thus, many values are invisible within the

histogram, and will be misinterpreted when

cardinality is estimated.

10?

11?

30?

Pre-12c

Histograms: Unpleasant Conundrums

Histograms before 12c R1 still have some shortfalls: Even though DBMS_STATS should choose the best

histogram automatically by default, this can still be undone: DBAs sometimes chose the wrong value for NB This could result in a height-based instead of frequency

histogram being generated If NDV > 254, only a height-based histogram could

be generated Almost-popular values thus tended to be ignored Truly popular values might just barely miss spanning two or

more histogram buckets

Pre-12c

New Histograms in 12c R1

16

Needed: A New Breed of Histograms

Oracle 12c R1 adds two new histograms to address the shortcomings of Frequency and Height-Based histograms: Top Frequency histograms help address

situations when just a few distinct values dominate the population of NDVs

Hybrid histograms mitigate situations when many almost-popular values just barely miss being considered through a height-based histogram

12c R1

New Histogram Decision Tree 12c R1

New Histogram Formulas

Formula #1:% of Top Popular Values (%TPV) =

[SUM(Top N Buckets) / SUM(All Buckets)]

Formula #2:Popular Values Threshold (P) =

(1 - (1 / NB) ) * 100

Rule #1: When %TPV > P , a Top Frequency Histogram

will be generated …

Rule #2: … otherwise, a Hybrid Histogram will be

generated instead.

I’m really good at math - it’s just numbers I have trouble with.

- Anonymous

12c R1

Top Popular Values Percent: An Example

Popular Values Threshold (P) =

(1 - (1 / NB) ) * 100

% of Top Popular Values (%TPV) =

[SUM(Top N Buckets) / SUM(All Buckets)]

Whenever %TPV > P, a Top Frequency

Histogram will be generated …

… otherwise, a Hybrid Histogram will be generated

instead.

Highly Popular Values Factoring

NB %TPV P

17 100.0% 94.1%16 99.0% 93.8%15 98.0% 93.3%14 97.0% 92.9%13 96.0% 92.3%12 94.0% 91.7%11 92.0% 90.9%10 89.0% 90.0%9 86.0% 88.9%8 83.0% 87.5%7 80.0% 85.7%6 77.0% 83.3%5 74.0% 80.0%4 70.0% 75.0%3 65.0% 66.7%2 60.0% 50.0%

Data ValuesUS

StateFIPS Code

Count

CA 06 50,000NY 36 10,000IL 17 5,000NJ 34 5,000MT 30 4,000OH 39 3,000MI 26 3,000WI 55 3,000VA 51 3,000DE 10 3,000IN 18 3,000DC 11 2,000ID 16 2,000MD 24 1,000AK 02 1,000WA 53 1,000OR 41 1,000

12c R1

Top Frequency Histograms

A top frequency histogram is created whenever highly-popular values dominate: Most useful when a small number of distinct

values dominate the majority of NDVs Gathered using a full table scan Appropriate NB setting is automatically chosen Non-popular (i.e. statistically insignificant) values

are automatically discarded

12c R1

Data ValuesUS

StateFIPS Code

Count

NY 36 15,000IL 17 15,000CA 06 15,000OH 39 15,000VA 51 10,000MI 26 10,000ID 16 3,000DC 11 2,000WI 55 2,000NJ 34 2,000OR 41 2,000MT 30 2,000DE 10 2,000WA 53 2,000AK 02 1,000MD 24 1,000IN 18 1,000

Advantage of Top Frequency Histograms

Query against AP.RANDOMIZED_SORTED for only extremely popular values:

SQL> EXPLAIN PLAN FOR 2 SELECT 3 MIN(key_date), MAX(key_date) 4 FROM ap.randomized_sorted 5 WHERE key_sts IN (39,36,17,6,51,26);

Explained.

SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY());

Plan hash value: 4178216183

----------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |----------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 1 | 11 | 171 (1)| 00:00:01 || 1 | SORT AGGREGATE | | 1 | 11 | | ||* 2 | TABLE ACCESS FULL| RANDOMIZED_SORTED | 80000 | 859K| 171 (1)| 00:00:01 |----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):---------------------------------------------------

2 - filter("KEY_STS"=6 OR "KEY_STS"=17 OR "KEY_STS"=26 OR "KEY_STS"=36 OR "KEY_STS"=39 OR "KEY_STS"=51)

Using Standard Frequency Histogram:



Predicate Information (identified by operation id):---------------------------------------------------

2 - filter("KEY_STS"=6 OR "KEY_STS"=17 OR "KEY_STS"=26 OR "KEY_STS"=36 OR "KEY_STS"=39 OR "KEY_STS"=51)

Using Top Frequency Histogram:

Hybrid Histograms

When a top frequency histogram cannot be created, then a hybrid histogram will be created instead: As its name suggests, it combines the best

features of height-based and frequency-based histograms

Records the buckets’ endpoints as well as the frequency of NDVs within each bucket

Best when almost-popular values dominate multiple histogram buckets

12c R1

What About Height-Balanced Histograms?

As of Oracle 12c, Height-Balanced histograms have been renamed to Legacy histograms

Hybrid histograms are more effective than Height-Balanced histograms and are essentially taking their place

Legacy histograms can still be generated by requesting a full sample of all data and specifying a smaller NB than NDV

Generating Legacy Histograms

To create a Legacy (Height-Balanced) histogram, just specify a non-default sample size when calling the GATHER_STATS_* procedure of DBMS_STATS:

SQL> BEGIN DBMS_STATS.GATHER_TABLE_STATS ( ownname => 'AP' ,tabname => 'RANDOMIZED_SORTED' ,method_opt => 'FOR COLUMNS KEY_STS SIZE 16' ,sample_size = 100 );END;/

PL/SQL procedure completed.

`

Histogram Metadata for AP.RANDOMIZED_SORTED.KEY_STS (from DBA_TAB_COL_STATISTICS) # of # of Distinct HistogramHistogram Type Values Buckets--------------- --------- ---------HEIGHT BALANCED 17 16

`

12c R1

Data ValuesUS

StateFIPS Code

Count

CA 06 50,000NY 36 10,000IL 17 5,000NJ 34 5,000MT 30 4,000OH 39 3,000MI 26 3,000WI 55 3,000VA 51 3,000DE 10 3,000IN 18 3,000DC 11 2,000ID 16 2,000MD 24 1,000AK 02 1,000WA 53 1,000OR 41 1,000

Advantage of Hybrid vs. Legacy Histograms

Query against AP.RANDOMIZED_SORTED for popular and nearly-popular values:

SQL> EXPLAIN PLAN FOR SELECT MIN(key_date), MAX(key_date) FROM ap.randomized_sorted WHERE key_sts IN (06,36);Explained.

SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY());

Plan hash value: 4178216183----------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |----------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 1 | 11 | 171 (1)| 00:00:01 || 1 | SORT AGGREGATE | | 1 | 11 | | ||* 2 | TABLE ACCESS FULL| RANDOMIZED_SORTED | 48864 | 524K| 171 (1)| 00:00:01 |----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):--------------------------------------------------- 2 - filter("KEY_STS"=06 OR "KEY_STS"=36)

Using Legacy Histogram:



Predicate Information (identified by operation id):--------------------------------------------------- 2 - filter("KEY_STS"=06 OR "KEY_STS"=36)

Using Hybrid Histogram:

a smart optimizer gets even smarter: demystifying oracle 12c database’s new histograms

Documents