a smart optimizer gets even smarter: demystifying oracle 12c database’s new histograms
TRANSCRIPT
A Smart Optimizer Gets Even Smarter:
Demystifying Oracle 12c Database’s New Histograms
My Credentials
30+ years of database-centric IT experience Oracle DBA since 2001 Oracle 9i, 10g, 11g OCP Oracle ACE Director > 100 articles on databasejournal.com and ioug.org Oracle-centric blog (Generally, It Depends) Regular speaker at Oracle OpenWorld, IOUG
COLLABORATE, and OTN ACE Tours Oracle University instructor – core Oracle DBA
courses
Coming Soon To a Bookstore Near You …
Coming in June 2015 from Oracle Press:Oracle Database Upgrade, Migration &
Transformation Tips & Techniques
• Covers everything you need to know to upgrade, migrate, and transform any Oracle 10g or 11g database to Oracle 12c
• Discusses strategy and tactics of planning Oracle migration, transformation, and upgrade projects
• Explores latest transformation features:• Recovery Manager (RMAN)• Oracle GoldenGate• Cross-Platform Transportable Tablespaces• Cross-Platform Transport (CPT)• Full Transportable Export (FTE)
• Includes detailed sample code
Our Agenda
Histograms: What Good Are They? Frequency vs. Height-Balanced Histograms Histograms: Epic Fail? Top Frequency Histograms Hybrid Histograms Histograms: Practical Examples Q+A
It’s Vegas, Baby!
5
Bottom line: The Oracle query optimizer is like a Vegas oddsmaker.
If it gets the query’s execution plan right, we’re on Easy Street …
… but if it gets it wrong, then we get to meet Tony in a dark alley.
So ... how do we tilt the odds in the optimizer’s favor?
Simple! We teach the optimizer to “count cards”
Histograms: What Good Are They?
… but with a histogram, the skewed distribution of Trans-Gender is dramatically obvious and the optimizer now knows the proper cardinality whenever a predicate includes that value
M F X
646.8
558
0.01
Indian Gender (in Millions)
Without a histogram, the Oracle optimizer would decide all three genders are equally distributed across all values …
33.3
33.3
33.4
Male FemaleTransGender
FACT: Australia, India, and several other countries now legally recognize Trans-Gender as a completely distinct gender …
… but the number of people declaring themselves as Trans-Gender is microscopic when compared to those declaring themselves as either (M)ale or (F)emale
Histograms: Terminology
Histograms: Frequency distribution methods Frequency: Plots how often a value occurs Height-Balanced: Distributes counts of values equally
per bucket NDV: Number of Distinct Values
Calculated while statistics are gathered NB: Number of histogram Buckets
Maximum of 254, but can be lower Skewness: A measure of how evenly data values
are distributed within a population or sample set
Pre-12c
Classic (Pre-12c) Histograms
8
Histograms: A Sample Use Case
Table AP.RANDOMIZED_SORTED: 100,000 rows Column KEY_STS clustered around just two values:
06 (CA) = 50% and 36 (NY) = 10% All other KEY_STS values distributed randomly but at
much lower frequency among 15 other values Therefore, NDV = 17 and NB = 17
CA NY IL NJ MT OH MI WI VA DE IN DC ID MD AK WA OR0
10,000
20,000
30,000
40,000
50,000
Distribution of KEY_STS Values(Note: FIPS decoded to US State for clarity)
Pre-12c
Initial Test Data Population
. . . FOR ctr IN 1..100000 LOOP INSERT INTO ap.randomized_sorted VALUES( ctr ,(TO_DATE('12/31/2013','mm/dd/yyyy') - DBMS_RANDOM.VALUE(1,3650))` ,LPAD(' ',DBMS_RANDOM.VALUE(1,32) ,SUBSTR('abcdefghijklmnopqrstuvwxyz‘ ,DBMS_RANDOM.VALUE(1,26), 1)) ,DECODE(MOD(ROUND(DBMS_RANDOM.VALUE(1,100000),0),100) , 0,10, 1,30, 2,17, 3,10, 4,36 , 5,11, 6,02, 7,18, 8,11, 9,36 ,10,24, 11,30, 12,26, 13,17, 14,36 ,15,34, 16,41, 17,39, 18,34, 19,36 ,20,51, 21,34, 22,55, 23,51, 24,36 ,25,17, 26,10, 27,30, 28,17, 29,36 ,30,18, 31,16, 32,16, 33,18, 34,36 ,35,26, 36,55, 37,30, 38,26, 39,36 ,40,39, 41,34, 42,53, 43,39, 44,36 ,45,55, 46,51, 47,34, 48,17, 49,36, 06)); IF MOD(ctr, 5000) = 0 THEN COMMIT; END IF; END LOOP; COMMIT;. . .
Frequency Histograms
Histogram targets are automatically chosen: Oracle recommends leaving SAMPLE_SIZE clause at its default
(AUTO_SAMPLE_SIZE) whenever DBMS_STATS.GATHER_* procedures are used to gather statistics
Thus whenever possible, a frequency histogram will be created Frequency histograms are most effective whenever NDV <=
maximum NB of 254
06 36 17 34 30 39 26 55 51 10 18 11 16 24 02 53 410
10,000
20,000
30,000
40,000
50,000
Frequency Histogram Values Distribution
Pre-12c
Generating a Frequency Histogram
BEGIN DBMS_STATS.GATHER_TABLE_STATS ( ownname => 'AP' ,tabname => 'RANDOMIZED_SORTED' ,method_opt => 'FOR COLUMNS KEY_STS' );END;/
TTITLE "Histogram Endpoints| \ for AP.RANDOMIZED_SORTED.KEY_STS| \(from DBA_HISTOGRAMS)"SELECT endpoint_number ,endpoint_value ,endpoint_repeat_count FROM dba_histograms WHERE owner = 'AP' AND table_name = 'RANDOMIZED_SORTED' AND column_name = 'KEY_STS' ORDER BY 1;TTITLE OFF
TTITLE "Histogram Metadata| \ for AP.RANDOMIZED_SORTED.KEY_STS| \(from DBA_TAB_COL_STATISTICS)"SELECT histogram ,num_distinct ,num_buckets FROM dba_tab_col_statisticsWHERE owner = 'AP' AND table_name = 'RANDOMIZED_SORTED' AND column_name = 'KEY_STS';
Histogram Metadata for AP.RANDOMIZED_SORTED.KEY_STS (from DBA_TAB_COL_STATISTICS) # of # of Distinct HistogramHistogram Type Values Buckets--------------- --------- ---------FREQUENCY 17 17
Histogram Endpointsfor AP.RANDOMIZED_SORTED.KEY_STS (from DBA_HISTOGRAMS) Endpoint Endpoint Repeat Endpoint Value Count--------- --------- --------- 1000 02 0 51000 06 0 54000 10 0 56000 11 0 58000 16 0 63000 17 0 66000 18 0 67000 24 0 70000 26 0 74000 30 0 79000 34 0 89000 36 0 92000 39 0 93000 41 0 96000 51 0 97000 53 0 100000 55 0
Pre-12c
Height-Balanced Histograms
However, if NB < NDV, then a height-balanced histogram will be created instead of a frequency histogram: All histogram buckets contain an equal number of values When forecasting cardinality, the query optimizer
considers the bucket endpoints of the height-balanced histogram to determine just how “popular” values are
02 06 11 17 26 34 36 39 41 51 5500
5000
10000
Height-Balanced Histogram (Only 10 Buckets)
Pre-12c
Generating a Height-Balanced Histogram
BEGIN DBMS_STATS.GATHER_TABLE_STATS ( ownname => 'AP' ,tabname => 'RANDOMIZED_SORTED' ,method_opt => 'FOR COLUMNS KEY_STS SIZE 10' ,estimate_percent => 100 );END;/
Histogram Metadata for AP.RANDOMIZED_SORTED.KEY_STS (from DBA_TAB_COL_STATISTICS) # of # of Distinct HistogramHistogram Type Values Buckets--------------- --------- ---------HEIGHT BALANCED 17 10
Histogram Endpointsfor AP.RANDOMIZED_SORTED.KEY_STS (from DBA_HISTOGRAMS) Endpoint Endpoint Repeat Endpoint Value Count--------- --------- --------- 0 02 0 8 06 0 9 16 0 10 17 0 11 26 0 12 34 0 14 36 0 15 51 0 16 55 0
Because NDV > NB, only a height-balanced histogram can be generated …
… thus, many values are invisible within the
histogram, and will be misinterpreted when
cardinality is estimated.
10?
11?
30?
Pre-12c
Histograms: Unpleasant Conundrums
Histograms before 12c R1 still have some shortfalls: Even though DBMS_STATS should choose the best
histogram automatically by default, this can still be undone: DBAs sometimes chose the wrong value for NB This could result in a height-based instead of frequency
histogram being generated If NDV > 254, only a height-based histogram could
be generated Almost-popular values thus tended to be ignored Truly popular values might just barely miss spanning two or
more histogram buckets
Pre-12c
New Histograms in 12c R1
16
Needed: A New Breed of Histograms
Oracle 12c R1 adds two new histograms to address the shortcomings of Frequency and Height-Based histograms: Top Frequency histograms help address
situations when just a few distinct values dominate the population of NDVs
Hybrid histograms mitigate situations when many almost-popular values just barely miss being considered through a height-based histogram
12c R1
New Histogram Decision Tree 12c R1
New Histogram Formulas
Formula #1:% of Top Popular Values (%TPV) =
[SUM(Top N Buckets) / SUM(All Buckets)]
Formula #2:Popular Values Threshold (P) =
(1 - (1 / NB) ) * 100
Rule #1: When %TPV > P , a Top Frequency Histogram
will be generated …
Rule #2: … otherwise, a Hybrid Histogram will be
generated instead.
I’m really good at math - it’s just numbers I have trouble with.
- Anonymous
12c R1
Top Popular Values Percent: An Example
Popular Values Threshold (P) =
(1 - (1 / NB) ) * 100
% of Top Popular Values (%TPV) =
[SUM(Top N Buckets) / SUM(All Buckets)]
Whenever %TPV > P, a Top Frequency
Histogram will be generated …
… otherwise, a Hybrid Histogram will be generated
instead.
Highly Popular Values Factoring
NB %TPV P
17 100.0% 94.1%16 99.0% 93.8%15 98.0% 93.3%14 97.0% 92.9%13 96.0% 92.3%12 94.0% 91.7%11 92.0% 90.9%10 89.0% 90.0%9 86.0% 88.9%8 83.0% 87.5%7 80.0% 85.7%6 77.0% 83.3%5 74.0% 80.0%4 70.0% 75.0%3 65.0% 66.7%2 60.0% 50.0%
Data ValuesUS
StateFIPS Code
Count
CA 06 50,000NY 36 10,000IL 17 5,000NJ 34 5,000MT 30 4,000OH 39 3,000MI 26 3,000WI 55 3,000VA 51 3,000DE 10 3,000IN 18 3,000DC 11 2,000ID 16 2,000MD 24 1,000AK 02 1,000WA 53 1,000OR 41 1,000
12c R1
Top Frequency Histograms
A top frequency histogram is created whenever highly-popular values dominate: Most useful when a small number of distinct
values dominate the majority of NDVs Gathered using a full table scan Appropriate NB setting is automatically chosen Non-popular (i.e. statistically insignificant) values
are automatically discarded
12c R1
Data ValuesUS
StateFIPS Code
Count
NY 36 15,000IL 17 15,000CA 06 15,000OH 39 15,000VA 51 10,000MI 26 10,000ID 16 3,000DC 11 2,000WI 55 2,000NJ 34 2,000OR 41 2,000MT 30 2,000DE 10 2,000WA 53 2,000AK 02 1,000MD 24 1,000IN 18 1,000
Advantage of Top Frequency Histograms
Query against AP.RANDOMIZED_SORTED for only extremely popular values:
SQL> EXPLAIN PLAN FOR 2 SELECT 3 MIN(key_date), MAX(key_date) 4 FROM ap.randomized_sorted 5 WHERE key_sts IN (39,36,17,6,51,26);
Explained.
SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY());
Plan hash value: 4178216183
----------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |----------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 1 | 11 | 171 (1)| 00:00:01 || 1 | SORT AGGREGATE | | 1 | 11 | | ||* 2 | TABLE ACCESS FULL| RANDOMIZED_SORTED | 80000 | 859K| 171 (1)| 00:00:01 |----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):---------------------------------------------------
2 - filter("KEY_STS"=6 OR "KEY_STS"=17 OR "KEY_STS"=26 OR "KEY_STS"=36 OR "KEY_STS"=39 OR "KEY_STS"=51)
Using Standard Frequency Histogram:
Plan hash value: 4178216183
----------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |----------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 1 | 11 | 171 (1)| 00:00:01 || 1 | SORT AGGREGATE | | 1 | 11 | | ||* 2 | TABLE ACCESS FULL| RANDOMIZED_SORTED | 80477 | 864K| 171 (1)| 00:00:01 |----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):---------------------------------------------------
2 - filter("KEY_STS"=6 OR "KEY_STS"=17 OR "KEY_STS"=26 OR "KEY_STS"=36 OR "KEY_STS"=39 OR "KEY_STS"=51)
Using Top Frequency Histogram:
Hybrid Histograms
When a top frequency histogram cannot be created, then a hybrid histogram will be created instead: As its name suggests, it combines the best
features of height-based and frequency-based histograms
Records the buckets’ endpoints as well as the frequency of NDVs within each bucket
Best when almost-popular values dominate multiple histogram buckets
12c R1
What About Height-Balanced Histograms?
As of Oracle 12c, Height-Balanced histograms have been renamed to Legacy histograms
Hybrid histograms are more effective than Height-Balanced histograms and are essentially taking their place
Legacy histograms can still be generated by requesting a full sample of all data and specifying a smaller NB than NDV
Generating Legacy Histograms
To create a Legacy (Height-Balanced) histogram, just specify a non-default sample size when calling the GATHER_STATS_* procedure of DBMS_STATS:
SQL> BEGIN DBMS_STATS.GATHER_TABLE_STATS ( ownname => 'AP' ,tabname => 'RANDOMIZED_SORTED' ,method_opt => 'FOR COLUMNS KEY_STS SIZE 16' ,sample_size = 100 );END;/
PL/SQL procedure completed.
`
Histogram Metadata for AP.RANDOMIZED_SORTED.KEY_STS (from DBA_TAB_COL_STATISTICS) # of # of Distinct HistogramHistogram Type Values Buckets--------------- --------- ---------HEIGHT BALANCED 17 16
`
12c R1
Data ValuesUS
StateFIPS Code
Count
CA 06 50,000NY 36 10,000IL 17 5,000NJ 34 5,000MT 30 4,000OH 39 3,000MI 26 3,000WI 55 3,000VA 51 3,000DE 10 3,000IN 18 3,000DC 11 2,000ID 16 2,000MD 24 1,000AK 02 1,000WA 53 1,000OR 41 1,000
Advantage of Hybrid vs. Legacy Histograms
Query against AP.RANDOMIZED_SORTED for popular and nearly-popular values:
SQL> EXPLAIN PLAN FOR SELECT MIN(key_date), MAX(key_date) FROM ap.randomized_sorted WHERE key_sts IN (06,36);Explained.
SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY());
Plan hash value: 4178216183----------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |----------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 1 | 11 | 171 (1)| 00:00:01 || 1 | SORT AGGREGATE | | 1 | 11 | | ||* 2 | TABLE ACCESS FULL| RANDOMIZED_SORTED | 48864 | 524K| 171 (1)| 00:00:01 |----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):--------------------------------------------------- 2 - filter("KEY_STS"=06 OR "KEY_STS"=36)
Using Legacy Histogram:
Plan hash value: 4178216183
----------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |----------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 1 | 11 | 171 (1)| 00:00:01 || 1 | SORT AGGREGATE | | 1 | 11 | | ||* 2 | TABLE ACCESS FULL| RANDOMIZED_SORTED | 60305 | 647K| 171 (1)| 00:00:01 |----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):--------------------------------------------------- 2 - filter("KEY_STS"=06 OR "KEY_STS"=36)
Using Hybrid Histogram: