histograms in 12c era
TRANSCRIPT
Agenda
• Concepts
• Life pre-12c• Available histograms• Known challenges
• Life post-12c• Available histograms• How the old challenges are solved
2
What is a histogram?
• “Special” type of column statistics
• Provides info about data distribution in a column
• Allows the optimizer to get (usually) more accurate estimations
• Gathered *automatically* starting 10g
3
Why histograms have bad reputation?
• Not always very clear where to put them
• Histograms + bind peeking => plan instability in 10g
• Adaptive Cursor Sharing (usually) solves the problem in 11g/12c but has a “warm-up” phase
• Old histograms had limitations
4
Life pre-12c – Available histograms
• Two histogram type available before 12c
• Frequency histogram• Very accurate• Available only when number of distinct values (NDV) small
• Height-Balanced histogram• Less accurate• Always available but not always wanted
8
Life pre-12c – Frequency histogram
• Each column value gets its own bucket
• Available only when NDV <= num buckets (up to 254)
• Accurate sel/card estimation, just compute the bucket size
• Bucket size = endpoint_number – previous endpoint_number
• Value in range with no bucket gets 50% smallest bucket
9
Life pre-12c – Frequency histogram
select count(*),n1 from test_freq
group by n1
COUNT(*) N1-------- ----
300 050 1
100 2150 350 45 5
345 6
10
select endpoint_number, endpoint_number - lag( endpoint_number,1,0) over (order by endpoint_number) mybucketsize, endpoint_valuefrom user_tab_histograms
where table_name = 'TEST_FREQ' and column_name = 'N1';
ENDPOINT_NUMBER MYBUCKETSIZE ENDPOINT_VALUE--------------- ------------ --------------
300 300 0350 50 1450 100 2600 150 3650 50 4655 5 5
1000 345 6
Life pre-12c – Height-balanced histogram
• Available only when NDV > num_buckets (or NDV > 254)
• Range [min,max] divided into 254 buckets of equal size
• Values spanning two or more buckets are considered popular
• Good accuracy for popular values, everybody else gets density
11
Life pre-12c – Height-balanced histogram
select count(*),n1 from test_hb
group by n1
COUNT(*) N1---------- ---
50 010 1
100 2150 350 4
540 5100 6
12
select endpoint_number en, endpoint_value ep,case when endpoint_number - lag(endpoint_number) over (order by endpoint_number) > 1 then 'YES'
else 'NO' end popularfrom user_tab_histograms
where table_name = 'TEST_HB' and column_name = 'N1';
EN EP POP---- ---- ---
0 0 NO1 3 NO3 5 YES4 6 NO
Old limitations – Almost popular values in HB histograms
• In HB a popular value accounts for at least 0.78% of the data• Every other value gets a much smaller selectivity (density)
• “Almost popular value” is a value close to span two buckets
• Example: 1B rows table, 20M NDV• Popular value from aprox 3.8M rows up• Everybody else estimated as aprox 50k rows• Almost popular value could have 3M rows
13
Old limitations – 254 may not be enough
• Freq histograms provides MUCH better estimations than HB
• But Freq only available when NDV <= 254 then we go HB
• Even with HB 254 might not enough to describe a distribution with no big skewness
14
Old limitations – Buckets on first 32 chars
• Histogram buckets computed on substr(col,1,32)
• Incorrect buckets for values that start with same 32 chars
• Incorrect selectivity caused by incorrect bucket size
15
C1----------------------------------------/home/mpagano/Oracle/files/test/q1.sql/home/mpagano/Oracle/files/test/q2.sql/home/mpagano/Oracle/files/test/q3.sql/home/mpagano/Oracle/files/test/q4.sql/home/mpagano/Oracle/files/test/q5.sql/home/mpagano/Oracle/files/test/q6.sql/home/mpagano/Oracle/files/test/q7.sql/home/mpagano/Oracle/files/test/q8.sql/home/mpagano/Oracle/files/test/q9.sql
COLUMN_NAME NUM_DISTINCT NUM_BUCKETS HISTOGRAM------------- ------------ ----------- ----------C1 10000 1 FREQUENCY
It’s time for AUTO_SAMPLE_SIZE
• In 10g recommendation was to use a fixed percentage
• In 11g new NDV algorithm gives much better results
• Several new features requires AUTO_SAMPLE_SIZE• Incremental stats gathering• Top Frequency Histograms• Hybrid Histograms
16
Life post-12c Available histograms
• Three histogram type available starting 12c
• Frequency histogram
• Top Frequency Histogram• Accurate as a freq histogram for almost every value
• Hybrid histogram• Combines Freq and HB histograms approach
17
Life post-12c – Common improvements
• Number of buckets increased to 2048• Rarely useful when using new methods
• Buckets build on first 64 chars instead of 32• Reduce impact on having values starting with same chars
• Buckets number identified during first scan of the table• Same as how NDV is computed in 11g
18
Life post-12c – Top Frequency HistogramWhat’s the idea
• Each value in old Frequency histogram is important even if it’s very unpopular
• Old Frequency histo provide very accurate info for each value
• But limited to only 254 (2048) values
• What if most of the data belong to those 254 values?
19
Life post-12c – Top Frequency HistogramHow it works
• Frequency histogram created for the Top N values
• Each Top N value gets its own bucket => accurate estimations
• Every other value not statistically relevant is not stored
• Density used to all the non-popular values
20
Life post-12c – Top Frequency HistogramWhen it’s created
• estimate_percent => AUTO_SAMPLE_SIZE
• Number of Distinct Values in the column > Number buckets
• The top N values accounts for (1-(1/N)) * 100 of the data• Example: Number of rows 1M, N 200, NDV 10k
(1-1/200)*100 = 99.5% so when top 200 values >= 99.5% of the data
21
Life post-12c – Top Frequency Histogram
22
select count(*),n1 from test_top_freq
group by n1
COUNT(*) N1-------- ----
300 050 1
100 2150 350 45 5
345 6
Top Frequency => (1-(1/6))*100 => 83.3% => 833 rows
select endpoint_number, endpoint_number - lag( endpoint_number,1,0) over (order by endpoint_number) mybucketsize, endpoint_value from user_tab_histograms where table_name = 'TEST_TOP_FREQ' and column_name = 'N1';
ENDPOINT_NUMBER MYBUCKETSIZE ENDPOINT_VALUE--------------- ------------ --------------
300 300 0350 50 1450 100 2600 150 3650 50 4995 345 6
Life post-12c – Hybrid HistogramWhat’s the idea
• Biggest limitation in HB is almost popular value => poor plans
• Split the range in buckets of equal size is unfair
• Need to account for popularity
• What if we can “adjust” the bucket size based onpopularity?
23
Life post-12c – Hybrid HistogramHow it works
• Create HB histogram but no value spans more than one bucket
• For each endpoint value track how popular the value is
• Accurate estimation for endpoint values
• Non endpoint values get the density
24
Life post-12c – Hybrid HistogramWhen it’s created
25
• estimate_percent => AUTO_SAMPLE_SIZE
• Number of Distinct Values in the column > Number buckets
• Top Frequency not available• Example: Number of rows 1M, N 200, NDV 10k
(1-1/200)*100 = 99.5% so when top 200 values < 99.5% of the data
Life post-12c – Hybrid Histogram
26
select count(*),n1 from test_hybrid
group by n1
COUNT(*) N1-------- ----
118 051 189 2
113 359 466 5
126 6131 7131 8116 9
select endpoint_number en, endpoint_value ep,endpoint_repeat_count
from user_tab_histogramswhere table_name = 'TEST_HYBRID'
and column_name = 'N1';
EN EP ERC---------- ---------- -------
118 0 118371 3 113430 4 59496 5 66622 6 126753 7 131
1000 9 116
Histogram selection
• Assuming default parameters then usually:• If NDV <= 254 then Frequency Histogram• Else if top values account for most of the data then Top Freq Histogram• Else Hybrid Histogram
• No HB histogram considered when using AUTO_SAMPLE_SIZE
• Max number of buckets is 254 in AUTO (manual limit is 2048)
27
Summary
• Three types of histograms in 12c • introduced two new types (top frequency and hybrid)• Obsolete height-balanced
• Leveraged only when using AUTO_SAMPLE_SIZE
• Most of the old limitations have been addressed
• Additional granularity (2048 buckets) available if needed
31
Reference
• Oracle® Database SQL Tuning Guide 12c Release 1 (12.1)
• Master Note: Optimizer Statistics (Doc ID 1369591.1)• How to Gather Optimizer Statistics on 12c (Doc ID 1445302.1)
33