histograms in 12c era

Histograms life in a pre and post 12c eraMauro Pagano

Agenda

• Concepts

• Life pre-12c• Available histograms• Known challenges

• Life post-12c• Available histograms• How the old challenges are solved

2

What is a histogram?

• “Special” type of column statistics

• Provides info about data distribution in a column

• Allows the optimizer to get (usually) more accurate estimations

• Gathered *automatically* starting 10g

3

Why histograms have bad reputation?

• Not always very clear where to put them

• Histograms + bind peeking => plan instability in 10g

• Adaptive Cursor Sharing (usually) solves the problem in 11g/12c but has a “warm-up” phase

• Old histograms had limitations

4

Do we need a histogram here?

5

0

5

10

15

20

25

1 2 3 4 5 6 7 8 9 10


6

0

50

100

150

200

250

1 2 3 4 5 6 7 8 9 10


7

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10

Life pre-12c – Available histograms

• Two histogram type available before 12c

• Frequency histogram• Very accurate• Available only when number of distinct values (NDV) small

• Height-Balanced histogram• Less accurate• Always available but not always wanted

8

Life pre-12c – Frequency histogram

• Each column value gets its own bucket

• Available only when NDV <= num buckets (up to 254)

• Accurate sel/card estimation, just compute the bucket size

• Bucket size = endpoint_number – previous endpoint_number

• Value in range with no bucket gets 50% smallest bucket

9

Life pre-12c – Frequency histogram

select count(*),n1 from test_freq

group by n1

COUNT(*) N1-------- ----

300 050 1

100 2150 350 45 5

345 6

10

select endpoint_number, endpoint_number - lag( endpoint_number,1,0) over (order by endpoint_number) mybucketsize, endpoint_valuefrom user_tab_histograms

where table_name = 'TEST_FREQ' and column_name = 'N1';

ENDPOINT_NUMBER MYBUCKETSIZE ENDPOINT_VALUE--------------- ------------ --------------

300 300 0350 50 1450 100 2600 150 3650 50 4655 5 5

1000 345 6

Life pre-12c – Height-balanced histogram

• Available only when NDV > num_buckets (or NDV > 254)

• Range [min,max] divided into 254 buckets of equal size

• Values spanning two or more buckets are considered popular

• Good accuracy for popular values, everybody else gets density

11

Life pre-12c – Height-balanced histogram

select count(*),n1 from test_hb

group by n1

COUNT(*) N1---------- ---

50 010 1

100 2150 350 4

540 5100 6

12

select endpoint_number en, endpoint_value ep,case when endpoint_number - lag(endpoint_number) over (order by endpoint_number) > 1 then 'YES'

else 'NO' end popularfrom user_tab_histograms

where table_name = 'TEST_HB' and column_name = 'N1';

EN EP POP---- ---- ---

0 0 NO1 3 NO3 5 YES4 6 NO

Old limitations – Almost popular values in HB histograms

• In HB a popular value accounts for at least 0.78% of the data• Every other value gets a much smaller selectivity (density)

• “Almost popular value” is a value close to span two buckets

• Example: 1B rows table, 20M NDV• Popular value from aprox 3.8M rows up• Everybody else estimated as aprox 50k rows• Almost popular value could have 3M rows

13

Old limitations – 254 may not be enough

• Freq histograms provides MUCH better estimations than HB

• But Freq only available when NDV <= 254 then we go HB

• Even with HB 254 might not enough to describe a distribution with no big skewness

14

Old limitations – Buckets on first 32 chars

• Histogram buckets computed on substr(col,1,32)

• Incorrect buckets for values that start with same 32 chars

• Incorrect selectivity caused by incorrect bucket size

15

C1----------------------------------------/home/mpagano/Oracle/files/test/q1.sql/home/mpagano/Oracle/files/test/q2.sql/home/mpagano/Oracle/files/test/q3.sql/home/mpagano/Oracle/files/test/q4.sql/home/mpagano/Oracle/files/test/q5.sql/home/mpagano/Oracle/files/test/q6.sql/home/mpagano/Oracle/files/test/q7.sql/home/mpagano/Oracle/files/test/q8.sql/home/mpagano/Oracle/files/test/q9.sql

COLUMN_NAME NUM_DISTINCT NUM_BUCKETS HISTOGRAM------------- ------------ ----------- ----------C1 10000 1 FREQUENCY

It’s time for AUTO_SAMPLE_SIZE

• In 10g recommendation was to use a fixed percentage

• In 11g new NDV algorithm gives much better results

• Several new features requires AUTO_SAMPLE_SIZE• Incremental stats gathering• Top Frequency Histograms• Hybrid Histograms

16

Life post-12c Available histograms

• Three histogram type available starting 12c

• Frequency histogram

• Top Frequency Histogram• Accurate as a freq histogram for almost every value

• Hybrid histogram• Combines Freq and HB histograms approach

17

Life post-12c – Common improvements

• Number of buckets increased to 2048• Rarely useful when using new methods

• Buckets build on first 64 chars instead of 32• Reduce impact on having values starting with same chars

• Buckets number identified during first scan of the table• Same as how NDV is computed in 11g

18

Life post-12c – Top Frequency HistogramWhat’s the idea

• Each value in old Frequency histogram is important even if it’s very unpopular

• Old Frequency histo provide very accurate info for each value

• But limited to only 254 (2048) values

• What if most of the data belong to those 254 values?

19

Life post-12c – Top Frequency HistogramHow it works

• Frequency histogram created for the Top N values

• Each Top N value gets its own bucket => accurate estimations

• Every other value not statistically relevant is not stored

• Density used to all the non-popular values

20

Life post-12c – Top Frequency HistogramWhen it’s created

• estimate_percent => AUTO_SAMPLE_SIZE

• Number of Distinct Values in the column > Number buckets

• The top N values accounts for (1-(1/N)) * 100 of the data• Example: Number of rows 1M, N 200, NDV 10k

(1-1/200)*100 = 99.5% so when top 200 values >= 99.5% of the data

21

Life post-12c – Top Frequency Histogram

22

select count(*),n1 from test_top_freq

group by n1

COUNT(*) N1-------- ----

300 050 1

100 2150 350 45 5

345 6

Top Frequency => (1-(1/6))*100 => 83.3% => 833 rows

select endpoint_number, endpoint_number - lag( endpoint_number,1,0) over (order by endpoint_number) mybucketsize, endpoint_value from user_tab_histograms where table_name = 'TEST_TOP_FREQ' and column_name = 'N1';

ENDPOINT_NUMBER MYBUCKETSIZE ENDPOINT_VALUE--------------- ------------ --------------

300 300 0350 50 1450 100 2600 150 3650 50 4995 345 6

Life post-12c – Hybrid HistogramWhat’s the idea

• Biggest limitation in HB is almost popular value => poor plans

• Split the range in buckets of equal size is unfair

• Need to account for popularity

• What if we can “adjust” the bucket size based onpopularity?

23

Life post-12c – Hybrid HistogramHow it works

• Create HB histogram but no value spans more than one bucket

• For each endpoint value track how popular the value is

• Accurate estimation for endpoint values

• Non endpoint values get the density

24

Life post-12c – Hybrid HistogramWhen it’s created

25

• estimate_percent => AUTO_SAMPLE_SIZE

• Number of Distinct Values in the column > Number buckets

• Top Frequency not available• Example: Number of rows 1M, N 200, NDV 10k

(1-1/200)*100 = 99.5% so when top 200 values < 99.5% of the data

Life post-12c – Hybrid Histogram

26

select count(*),n1 from test_hybrid

group by n1

COUNT(*) N1-------- ----

118 051 189 2

113 359 466 5

126 6131 7131 8116 9

select endpoint_number en, endpoint_value ep,endpoint_repeat_count

from user_tab_histogramswhere table_name = 'TEST_HYBRID'

and column_name = 'N1';

EN EP ERC---------- ---------- -------

118 0 118371 3 113430 4 59496 5 66622 6 126753 7 131

1000 9 116

Histogram selection

• Assuming default parameters then usually:• If NDV <= 254 then Frequency Histogram• Else if top values account for most of the data then Top Freq Histogram• Else Hybrid Histogram

• No HB histogram considered when using AUTO_SAMPLE_SIZE

• Max number of buckets is 254 in AUTO (manual limit is 2048)

27

SQLT and histograms - Frequency

28

SQLT and histograms – Top Frequency

29

SQLT and histograms – Hydrid

30

Summary

• Three types of histograms in 12c • introduced two new types (top frequency and hybrid)• Obsolete height-balanced

• Leveraged only when using AUTO_SAMPLE_SIZE

• Most of the old limitations have been addressed

• Additional granularity (2048 buckets) available if needed

31

Reference

• Oracle® Database SQL Tuning Guide 12c Release 1 (12.1)

• Master Note: Optimizer Statistics (Doc ID 1369591.1)• How to Gather Optimizer Statistics on 12c (Doc ID 1445302.1)

33

Contact info

• Email [email protected]

• Twitter @mautro

• Blog http://mauro-pagano.com

34

histograms in 12c era

Software