interpreting extended statistics

Interpreting Extended Statistics

Chinar Aliyev

As you know in oracle 11g introduced extended statistics to improve selectivity estimation of correlated columns. But when and how query optimizer (QO) use these statistics? What are its restrictions? Let’s see step by step

Correlated Columns

We will use customers table in SH schema

SQL> select count (*) from customers 2 where CUST_STATE_PROVINCE=‘CA’ and COUNTRY_ID=52790; COUNT (*) ---------- 3341 SQL>

Without histogram QO estimate cardinality as below SQL> BEGIN 2 DBMS_STATS.gather_table_stats ('SH', 3 'CUSTOMERS', 4 5 estimate_percent=>null, 6 cascade => true, 7 method_opt=>'FOR ALL COLUMNS SIZE 1' 8 ); 9 END; 10 / SQL> SELECT * 2 FROM customers a 3 where CUST_STATE_PROVINCE='CA' and COUNTRY_ID=52790; (Q1) Execution Plan ---------------------------------------------------------- Plan hash value: 2008213504 ------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 20 | 3620 | 406 (1)| 00:00:05 | |* 1 | TABLE ACCESS FULL| CUSTOMERS | 20 | 3620 | 406 (1)| 00:00:05 | ------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter ("CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790) SQL>

With histograms QO estimate cardinality as below SQL> BEGIN 2 DBMS_STATS.gather_table_stats ('SH', 3 'CUSTOMERS', 4 estimate_percent=>null, 5 cascade => true, 6 method_opt=>'FOR ALL COLUMNS SIZE skewonly'

7 ); 8 END; 9 / SQL> SELECT * 2 FROM customers a 3 where CUST_STATE_PROVINCE='CA' and COUNTRY_ID=52790; Execution Plan ---------------------------------------------------------- Plan hash value: 2008213504 ------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1115 | 197K| 406 (1)| 00:00:05 | |* 1 | TABLE ACCESS FULL| CUSTOMERS | 1115 | 197K| 406 (1)| 00:00:05 | ------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter ("CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790) SQL>

Even we use histograms QO cannot estimate correct cardinality due to here is column correlation. To solving this problem RDBMS should be gather statistics for combined (concatenate) of these columns. Like one of the most statistic is NDV (number of distinct values), to find and storing this statistic in data dictionary RDBMS should be analyze both columns together (number of distinct row groups) like below. SQL> SELECT COUNT (*) ndv 2 FROM (SELECT cust_state_province, country_id 3 FROM customers 4 GROUP BY cust_state_province, country_id) 5 SQL> / NDV ---------- 145 SQL>

But just NDV is not enough to estimate correct cardinality because there can be skew data for concatenated correlation columns. Still we talking about just simple “where col1=a1 and col2=a2 and … Coln=an” (p1) predicate, QO should be estimate selectivity of this column groups which contain skewed data. Therefore in Oracle RDBMS these column groups (which are correlated) mapped as equivalent virtual column then according this virtual column statistics QO use to estimate selectivity of predicate p1. So what is happen when creating extended statistics? To create this you can use CREATE_EXTENDED_STATS or GATHER_TABLE_STATS with METHOD_OPT option of DBMS_STATS package. In our example cust_state_province and country_id columns are correlated then we can create column groups for these column as: SQL> begin 2 DBMS_STATS.GATHER_TABLE_STATS ( 3 'SH', 4 'CUSTOMERS', 5 estimate_percent=>null, 6 METHOD_OPT =>'FOR COLUMNS (CUST_STATE_PROVINCE, COUNTRY_ID) size 1'); 7 end; 8 / PL/SQL procedure successfully completed. SQL>

When creating column groups if we enable sql trace then we can be found below statement.

Alter table "SH"."CUSTOMERS" add (SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ as (sys_op_combined_hash (CUST_STATE_PROVINCE, COUNTRY_ID)) virtual BY USER for statistics)

It means when creating extended statistics oracle first add virtual column, and then gather statistics for this column. Why use hash function? We just see p1 (known as “Point” correlation). It means every col1, col2 … coln columns values must according only one unique value when virtual column creating. So here must be Y=F (col1, col2 … coln) relationship between correlated columns and virtual columns therefore if this exists then QO can estimate selectivity of these column groups. Every input values function F must generate unique values. Of course just columns concatenate can be enough for most of cases but using hash function provides guarantee unique values and this is best option. Now we have one column group without histogram and separately column has histograms. Let’s see what is happen in this case for query (Q1) Plan hash value: 2008213504 ------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1115 | 210K| 406 (1)| 00:00:05 | |* 1 | TABLE ACCESS FULL| CUSTOMERS | 1115 | 210K| 406 (1)| 00:00:05 | ------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter ("CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790)

And from trace file SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for CUSTOMERS[A] Column (#11): NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145, NDV:145 Column (#11): CUST_STATE_PROVINCE ( AvgLen: 11 NDV: 145 Nulls: 0 Density: 0.000144 Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145 Column (#13): NewDensity:0.000676, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:19, NDV:19 Column (#13): COUNTRY_ID( AvgLen: 5 NDV: 19 Nulls: 0 Density: 0.000676 Min: 52769 Max: 52791 Histogram: Freq #Bkts: 19 UncompBkts: 55500 EndPtVals: 19 Column (#24): SYS_STU#S#WF25Z#QAHIHE#MOFFMM_( AvgLen: 12 NDV: 145 Nulls: 0 Density: 0.006897 ColGroup (#1, VC) SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ Col#: 11 13 CorStregth: 19.00 ColGroup Usage:: PredCnt: 2 Matches Full: Partial: Table: CUSTOMERS Alias: A Card: Original: 55500.000000 Rounded: 1115 Computed: 1114.87 Non Adjusted: 1114.87 Access Path: TableScan Cost: 405.71 Resp: 405.71 Degree: 0 Cost_io: 404.00 Cost_cpu: 35392510 Resp_io: 404.00 Resp_cpu: 35392510

As you see QO detected column groups but it is not use virtual column statistics due to in this case QO is not estimate selectivity(in trace file “Partial” for this column group(CG) is null). It use traditional method and estimated selectivity. SQL> select num_rows,blocks from user_tables where table_name='CUSTOMERS'; NUM_ROWS BLOCKS ---------- ---------- 55500 1486 SQL> select num_distinct, histogram from user_tab_col_statistics 2 where table_name='CUSTOMERS' 3 and column_name in ('CUST_STATE_PROVINCE','COUNTRY_ID'); NUM_DISTINCT HISTOGRAM ------------ --------------- 145 FREQUENCY 19 FREQUENCY SQL>

From Histogram for column CUST_STATE_PROVINCE

Endpoint_number Endpoint_Actual_value 7650 Brittany 7980 Buenos Aires 11321 CA 12098 CO 12255 CT

From Histogram for column COUNTRY_ID

Endpoint_number Endpoint_value 29169 52786 29244 52787 29335 52788 36892 52789 55412 52790 55500 52791

They are frequency histograms. Therefore selectivity will be: Sel(cust_state_province)=(11321-7980) /55500=3341/55500=0.06019 Sel(country_id)=(55412-36892) /55500=18520/55500=0.33369 Sel(cust_state_province and country_id)= 0.06019*0.33369=0.02008 Card=num_rows*sel=55500*0.02008=1114.706~1115

Now let’s gather histogram statistics for this column group and see what is happen.

SQL> BEGIN 2 DBMS_STATS.gather_table_stats 3 ('SH', 4 'CUSTOMERS', 5 estimate_percent => NULL, 6 method_opt => 'FOR COLUMNS (CUST_STATE_PROVINCE,COUNTRY_ID) size skewonly' 7 ); 8 END; 9 / SQL> select column_name,num_distinct,histogram from user_tab_col_statistics 2 where table_name='CUSTOMERS' 3 and column_name='SYS_STU#S#WF25Z#QAHIHE#MOFFMM_';

COLUMN_NAME NUM_DISTINCT HISTOGRAM ------------------------------ ------------ --------------- SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ 145 FREQUENCY SQL> Execution Plan ---------------------------------------------------------- Plan hash value: 2008213504 ------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 3341 | 629K| 406 (1)| 00:00:05 | |* 1 | TABLE ACCESS FULL| CUSTOMERS | 3341 | 629K| 406 (1)| 00:00:05 | ------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790) SQL>

As you see in this case QO estimate cardinality correctly and from trace file SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for CUSTOMERS[A] Column (#11): NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145, NDV:145 Column (#11): CUST_STATE_PROVINCE( AvgLen: 11 NDV: 145 Nulls: 0 Density: 0.000144 Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145 Column (#13): NewDensity:0.000676, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:19, NDV:19 Column (#13): COUNTRY_ID( AvgLen: 5 NDV: 19 Nulls: 0 Density: 0.000676 Min: 52769 Max: 52791 Histogram: Freq #Bkts: 19 UncompBkts: 55500 EndPtVals: 19 Column (#24): NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145, NDV:145 Column (#24): SYS_STU#S#WF25Z#QAHIHE#MOFFMM_( AvgLen: 12 NDV: 145 Nulls: 0 Density: 0.000144 Min: 22231259 Max: 9992664766 Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145 ColGroup (#1, VC) SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ Col#: 11 13 CorStregth: 19.00 ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.0602 Table: CUSTOMERS Alias: A Card: Original: 55500.000000 Rounded: 3341 Computed: 3341.00 Non Adjusted: 3341.00 Access Path: TableScan Cost: 405.74 Resp: 405.74 Degree: 0 Cost_io: 404.00 Cost_cpu: 35837710 Resp_io: 404.00 Resp_cpu: 35837710

How QO estimate selectivity? This is frequency histogram, when creating histogram if we enable sql trace then we can see oracle build histogram using below statement. select substrb(dump(val,16,0,32),1,120) ep, cnt from (select /*+ no_expand_table(t) index_rs(t) no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring no_substrb_pad

*/mod("SYS_STU#S#WF25Z#QAHIHE#MOFFMM_",9999999999) val,count(*) cnt from "SH"."CUST" t where mod("SYS_STU#S#WF25Z#QAHIHE#MOFFMM_",9999999999) is not null group by mod("SYS_STU#S#WF25Z#QAHIHE#MOFFMM_",9999999999)) order by val

Therefore we can use below sql and result can be SELECT endpoint_number e FROM user_tab_histograms WHERE table_name = 'CUSTOMERS' AND column_name = 'SYS_STU#S#WF25Z#QAHIHE#MOFFMM_' AND endpoint_value = MOD (sys_op_combined_hash ('CA', 52790), 9999999999)) column_name endpoint_number endpoint_value SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ 20225 4701058945 SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ 21244 4752431017 SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ 24585 4800861232 SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ 24813 4861997875 Sel=(24585-21244)/55500=0.06019 Card=num_rows*sel=3341

Two Column Groups Case 1

Now assume we have two column groups.

CG1= (cust_state_province, country_id) CG2= (CUST_CITY_ID, cust_state_province, country_id)

Consider for column group CG1 gathered statistics but for CG2 is not. So CG2 to be estimate selectivity. In this case execution plan was

SQL> SELECT * 2 FROM customers a 3 where CUST_STATE_PROVINCE='CA' and COUNTRY_ID=52790 and cust_city_id=51919; (Q2) Execution PlanPlan hash value: 2008213504 ------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 39 | 8424 | 406 (1)| 00:00:05 | |* 1 | TABLE ACCESS FULL| CUSTOMERS | 39 | 8424 | 406 (1)| 00:00:05 | ------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("CUST_CITY_ID"=51919 AND "CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790) SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for CUSTOMERS[A] Column (#11): NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145, NDV:145 Column (#11): CUST_STATE_PROVINCE( AvgLen: 11 NDV: 145 Nulls: 0 Density: 0.000144 Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145 Column (#13): NewDensity:0.000676, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:19, NDV:19 Column (#13): COUNTRY_ID( AvgLen: 5 NDV: 19 Nulls: 0 Density: 0.000676 Min: 52769 Max: 52791

Histogram: Freq #Bkts: 19 UncompBkts: 55500 EndPtVals: 19 Column (#10): NewDensity:0.001189, OldDensity:0.002179 BktCnt:254, PopBktCnt:77, PopValCnt:34, NDV:620 Column (#10): CUST_CITY_ID( AvgLen: 5 NDV: 620 Nulls: 0 Density: 0.001189 Min: 51040 Max: 52531 Histogram: HtBal #Bkts: 254 UncompBkts: 254 EndPtVals: 212 Column (#26): SYS_STU4RAPXUESG1VO3#Q7ZH365D7( NO STATISTICS (using defaults) AvgLen: 13 NDV: 1734 Nulls: 0 Density: 0.000577 Column (#25): NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145, NDV:145 Column (#25): SYS_STU#S#WF25Z#QAHIHE#MOFFMM_( AvgLen: 12 NDV: 145 Nulls: 0 Density: 0.000144 Min: 22231259 Max: 9992664766 Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145 ColGroup (#1, VC) SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ Col#: 11 13 CorStregth: 19.00 ColGroup Usage:: PredCnt: 3 Matches Full: #1 Partial: Sel: 0.0602 Table: CUSTOMERS Alias: A Card: Original: 55500.000000 Rounded: 39 Computed: 39.46 Non Adjusted: 39.46 Access Path: TableScan Cost: 405.70 Resp: 405.70 Degree: 0 Cost_io: 404.00 Cost_cpu: 35045008 Resp_io: 404.00 Resp_cpu: 35045008

For column group CG1 we already know how calculated and

sel(CG2)=sel(CG1)*sel(cust_city_id) SQL> select column_name,num_distinct,histogram from user_tab_col_statistics 2 where table_name='CUSTOMERS' 3 and column_name='CUST_CITY_ID'; COLUMN_NAME NUM_DISTINCT HISTOGRAM ------------------------------ ------------ --------------- CUST_CITY_ID 620 HEIGHT BALANCED SQL>

From the histogram information for this column Endpoint_number endpoint_value 157 51916 158 51917 161 51919 162 51924 163 51930 165 51934 166 51971 sel(CUST_CITY_ID)=(161-158)/num_buckets=0.0118110236 sel(cg2)=sel(cg1)*sel(CUST_CITY_ID)=7.1102362204724409448818897637795e-4 card=sel(cg2)*num_rows=39.46

Actually this is clear method, without CG2 QO will estimate cardinality also as 39 because our predicate is CUST_STATE_PROVINCE='CA' and COUNTRY_ID=52790 and cust_city_id=51919 then optimizer detected here is column group and have sufficient statistics therefore we can write predicate SYS_STU#S#WF25Z#QAHIHE#MOFFMM_= MOD (sys_op_combined_hash ('CA', 52790), 9999999999) and cust_city_id=51919 Due to selectivity will be sel (SYS_STU#S#WF25Z#QAHIHE#MOFFMM_)*sel (cust_city_id)

Two Column Groups Case 2

Assume we have three column groups like below

CG1 = ("CUST_STATE_PROVINCE","COUNTRY_ID","CUST_CITY_ID") CG2 = ("CUST_STATE_PROVINCE","COUNTRY_ID") CG3 = ("CUST_STATE_PROVINCE","CUST_CITY_ID")

And our predicate is CUST_STATE_PROVINCE='CA' and COUNTRY_ID=52790 and CUST_CITY=51919 (P2). So in this case how QO will estimate selectivity? There sel (CG1) to be estimated, CG1 has not statistics. But for two column group CG2 and CG3 statistics were gathered. According previous example selectivity of CG1 can be estimated as below Sel (CG1) =sel (CG2)*sel (CUST_CITY_ID) (F1)

Or Sel (CG1) =sel (CG3)*sel (COUNTRY_ID) (F2)

So which formula QO will choose and what based on? Let’s see execution plan and trace file SQL> select * from customers 2 where CUST_STATE_PROVINCE='CA' and CUST_CITY_ID=51919 and COUNTRY_ID=52790; Execution Plan ---------------------------------------------------------- Plan hash value: 2008213504 ------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 219 | 43800 | 406 (1)| 00:00:05 | |* 1 | TABLE ACCESS FULL| CUSTOMERS | 219 | 43800 | 406 (1)| 00:00:05 | ------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("CUST_CITY_ID"=51919 AND "CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790) SQL> SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for CUSTOMERS[CUSTOMERS] Column (#13): NewDensity:0.000676, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:19, NDV:19 Column (#13): COUNTRY_ID( AvgLen: 5 NDV: 19 Nulls: 0 Density: 0.000676 Min: 52769 Max: 52791 Histogram: Freq #Bkts: 19 UncompBkts: 55500 EndPtVals: 19 Column (#11): NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145, NDV:145 Column (#11): CUST_STATE_PROVINCE( AvgLen: 11 NDV: 145 Nulls: 0 Density: 0.000144 Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145 Column (#10): NewDensity:0.001189, OldDensity:0.002179 BktCnt:254, PopBktCnt:77, PopValCnt:34, NDV:620 Column (#10): CUST_CITY_ID( AvgLen: 5 NDV: 620 Nulls: 0 Density: 0.001189 Min: 51040 Max: 52531 Histogram: HtBal #Bkts: 254 UncompBkts: 254 EndPtVals: 212 Column (#26): SYS_STU14HX98$V3_$3Z$ZSWQ0O8O0( AvgLen: 12 NDV: 620 Nulls: 0 Density: 0.001613 Column (#25): NewDensity:0.001277, OldDensity:0.002351 BktCnt:254, PopBktCnt:62, PopValCnt:28, NDV:620 Column (#25): SYS_STULHUROKG217F9$OWA1IEIZLA( AvgLen: 12 NDV: 620 Nulls: 0 Density: 0.001277 Min: 29269004 Max: 9981124071

Histogram: HtBal #Bkts: 254 UncompBkts: 254 EndPtVals: 221 Column (#24): NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145, NDV:145 Column (#24): SYS_STU#S#WF25Z#QAHIHE#MOFFMM_( AvgLen: 12 NDV: 145 Nulls: 0 Density: 0.000144 Min: 22231259 Max: 9992664766 Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145 ColGroup (#1, VC) SYS_STU14HX98$V3_$3Z$ZSWQ0O8O0 Col#: 10 11 13 CorStregth: 2755.00 ColGroup (#2, VC) SYS_STULHUROKG217F9$OWA1IEIZLA Col#: 10 11 CorStregth: 145.00 ColGroup (#3, VC) SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ Col#: 11 13 CorStregth: 19.00 ColGroup Usage:: PredCnt: 3 Matches Full: #2 Partial: Sel: 0.0118 Table: CUSTOMERS Alias: CUSTOMERS Card: Original: 55500.000000 Rounded: 219 Computed: 218.74 Non Adjusted: 218.74 Access Path: TableScan Extension_name Extension Histogram SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ ("CUST_STATE_PROVINCE","COUNTRY_ID") FREQUENCY SYS_STULHUROKG217F9$OWA1IEIZLA ("CUST_STATE_PROVINCE","CUST_CITY_ID") HEIGHT BALANCED SQL> select column_name, num_distinct from user_tab_col_statistics 2 where table_name='CUSTOMERS' 3 and column_name in ('CUST_STATE_PROVINCE','COUNTRY_ID','CUST_CITY_ID', 4 ‘SYS_STULHUROKG217F9$OWA1IEIZLA','SYS_STU#S#WF25Z#QAHIHE#MOFFMM_') 5; COLUMN_NAME NUM_DISTINCT HISTOGRAM ------------------------------ ------------ --------------- SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ 145 FREQUENCY SYS_STULHUROKG217F9$OWA1IEIZLA 620 HEIGHT BALANCED CUST_CITY_ID 620 HEIGHT BALANCED CUST_STATE_PROVINCE 145 FREQUENCY COUNTRY_ID 19 FREQUENCY

As you see QO choose SYS_STULHUROKG217F9$OWA1IEIZLA virtual column (Matches Full: #2) why this? Because this CG correlation strength is more than other (145>19). CorStength is indicating deeply of correlation between columns (in column groups). It seems QO identify correlation strength using NDV`s .So CorStrengh(col1,col2,…,coln)=NDV(col1)*NDV(col2)*…*NDV(coln)/NDV(col1,col2,…,coln) CorStrengh (cust_state_province, cust_city_id) = 145*620/620=145 CorStrengh (cust_state_province, country_id) = 145*19/145=19

From histogram for country_id Endpoint_number Endpoint_value 36892 52789 55412 52790 55500 52791

Therefore sel(country _id)= (55412-36892)/55500= 0.33369. sel(p2)=sel()*sel(country_id)= 0.00393758 Card=sel (p2)*num_rows= 0.00393758*55500=218.536~219

If CorStrengh(col1,col2,…,coln)=1 it mean is they col1,col2,…,coln columns are not correlated.

Extended Statistics and equijoin

QO also can detect extended statistics in equijoin operation and can use this.

SQL> create table t 2 as 3 select 4 trunc(dbms_random.value(0,25)) n1, 5 trunc(dbms_random.value(0,20)) n2, 6 lpad(rownum,10,'0') small_vc 7 from 8 all_objects 9 where 10 rownum <= 10000 11 ; Table created. SQL> update t set n2=n1 where rownum<=9955; 9955 rows updated. SQL> commit; Commit complete. SQL> begin 2 dbms_stats.gather_table_stats( 3 user, 4 't', 5 cascade => true, 6 estimate_percent=>null, 7 method_opt=> 'for all columns size 1 FOR COLUMNS (n1,n2) size 1 '); 8 9 end; 10 / PL/SQL procedure successfully completed. SQL> select 2 count(*) 3 from 4 t t1, 5 t t2 6 where 7 t1.n1 = t2.n1 8 and t1.n2 = t2.n2 9 ; Execution Plan ---------------------------------------------------------- Plan hash value: 791582492 ---------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 12 | 32 (25)| 00:00:01 | | 1 | SORT AGGREGATE | | 1 | 12 | | | |* 2 | HASH JOIN | | 1470K| 16M| 32 (25)| 00:00:01 | | 3 | TABLE ACCESS FULL| T | 10000 | 60000 | 12 (0)| 00:00:01 | | 4 | TABLE ACCESS FULL| T | 10000 | 60000 | 12 (0)| 00:00:01 |

---------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("T1"."N1"="T2"."N1" AND "T1"."N2"="T2"."N2")

I do not full text of trace file here but some need information here.

SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for T[T2] Table: T Alias: T2 Card: Original: 10000.000000 Rounded: 10000 Computed: 10000.00 Non Adjusted: 10000.00 Access Path: TableScan Cost: 12.09 Resp: 12.09 Degree: 0 Cost_io: 12.00 Cost_cpu: 1956372 Resp_io: 12.00 Resp_cpu: 1956372 Best:: AccessPath: TableScan Cost: 12.09 Degree: 1 Resp: 12.09 Card: 10000.00 Bytes: 0 Column (#4): SYS_STUBZH0IHA7K$KEBJVXO5LOHAS( AvgLen: 12 NDV: 68 Nulls: 0 Density: 0.014706 ColGroup (#1, VC) SYS_STUBZH0IHA7K$KEBJVXO5LOHAS Col#: 1 2 CorStregth: 9.19 Column (#4): SYS_STUBZH0IHA7K$KEBJVXO5LOHAS( AvgLen: 12 NDV: 68 Nulls: 0 Density: 0.014706 ColGroup (#1, VC) SYS_STUBZH0IHA7K$KEBJVXO5LOHAS Col#: 1 2 CorStregth: 9.19 Join ColGroups for T[T1] and T[T2] : (#1, #1)

Therefore Join selectivity will be 0.014706 and final cardinality 0.014706*10000*10000=1470600

Projections

QO can estimate cardinality using extended statistics during GROUP BY operation.

SQL> select count(*) from ( 2 select count(*) from customers 3 group by CUST_STATE_PROVINCE,COUNTRY_ID); COUNT(*) ---------- 145 SQL> select column_name,num_distinct from user_tab_col_statistics 2 where table_name='CUSTOMERS' 3 and column_name in ('CUST_STATE_PROVINCE','COUNTRY_ID'); COLUMN_NAME NUM_DISTINCT ------------------------------ ------------ CUST_STATE_PROVINCE 145 COUNTRY_ID 19 SQL> select count(*) from customers 2 group by CUST_STATE_PROVINCE,COUNTRY_ID; Execution Plan ---------------------------------------------------------- Plan hash value: 1577413243 -------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1949 | 31184 | 408 (1)| 00:00:05 |

| 1 | HASH GROUP BY | | 1949 | 31184 | 408 (1)| 00:00:05 | | 2 | TABLE ACCESS FULL| CUSTOMERS | 55500 | 867K| 406 (1)| 00:00:05 | -------------------------------------------------------------------------------- SQL>

Without extended statistics QO estimate cardinality as 145*19/sqrt (2)~1948.097

But with extended statistics estimated cardinality will correct

SQL> begin DBMS_STATS.GATHER_TABLE_STATS(

2 'SH',

3 'CUSTOMERS',

4 estimate_percent=>null,

5 METHOD_OPT =>'FOR COLUMNS (cust_state_proince,country_id) size 1');

6 end;

7 /

PL/SQL procedure successfully completed.

SQL> select count(*) from customers

2 group by CUST_STATE_PROVINCE,COUNTRY_ID;

Execution Plan

----------------------------------------------------------

Plan hash value: 1577413243

-------------------------------------------------------------------------------

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time

-------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 145 | 2320 | 408 (1)| 00:00:05

| 1 | HASH GROUP BY | | 145 | 2320 | 408 (1)| 00:00:05

| 2 | TABLE ACCESS FULL| CUSTOMERS | 55500 | 867K| 406 (1)| 00:00:05

Identifying candidate columns to creating column groups based on workload statistics.

Oracle provides some procedures to finding candidate columns for column groups. But this method is not work based on statistics or real data. It is looks works just find candidate columns from dynamic performance views (like v$sql,v$sql_plan). It means in this case oracle do not investigate/find real columns correlation. Let’s see below example.

SQL> create table t_candidate

2 as

3 select

4 trunc(dbms_random.value(0,25)) p1,

5 trunc(dbms_random.value(0,20)) p2,

6 lpad(rownum,10,'0') padding

7 from

8 all_objects

9 where

10 rownum <= 10000

11 ;

Table created.

SQL>

SQL> begin

2 dbms_stats.gather_table_stats(

3 user,

4 't_candidate',

5 cascade => true,

6 estimate_percent=>null,

7 method_opt=> 'for all columns size 1');

8 end;

9 /


SQL> Exec DBMS_STATS.SEED_COL_USAGE(null,null,120);


SQL> select count(*) from t_candidate where p1=19 and p2=14;

COUNT(*)

----------

19

SQL>

SQL> select * from table(dbms_xplan.display_cursor);

PLAN_TABLE_OUTPUT

SQL_ID 9g4vdacy7pc62, child number 0

-------------------------------------

select count(*) from t_candidate where p1=19 and p2=14



|

--------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | | | 12 (100)|

| 1 | SORT AGGREGATE | | 1 | 6 | |

|* 2 | TABLE ACCESS FULL| T_CANDIDATE | 20 | 120 | 12 (0)| 00:00:01

--------------------------------------------------------------------------------

Predicate Information (identified by operation id):

2 - filter(("P1"=19 AND "P2"=14))

19 rows selected.

SQL> SET LONG 7000

SQL> SET LONGCHUNKSIZE 7000

SQL> SET LINESIZE 500

SQL> Select dbms_stats.report_col_usage('SH','t_candidate') from dual ;

DBMS_STATS.REPORT_COL_USAGE('SH','T_CANDIDATE')

--------------------------------------------------------------------------------

--------------------

LEGEND:

.......

EQ : Used in single table EQuality predicate

RANGE : Used in single table RANGE predicate

LIKE : Used in single table LIKE predicate

NULL : Used in single table is (not) NULL predicate

EQ_JOIN : Used in EQuality JOIN predicate

NONEQ_JOIN : Used in NON EQuality JOIN predicate

FILTER : Used in single table FILTER predicate

JOIN : Used in JOIN predicate


--------------------------------------------------------------------------------

--------------------

GROUP_BY : Used in GROUP BY expression

...............................................................................

###############################################################################

COLUMN USAGE REPORT FOR SH.T_CANDIDATE

......................................

1. P1 : EQ

2. P2 : EQ

3. (P1, P2) : FILTER


SQL> select dbms_stats.create_extended_stats('SH','t_candidate') from dual;

DBMS_STATS.CREATE_EXTENDED_STATS('SH','T_CANDIDATE')

--------------------------------------------------------------------------------

--------------------

###############################################################################

EXTENSIONS FOR SH.T_CANDIDATE

.............................

1. (P1, P2) : SYS_STUIV1F__U9NUVZ7#MDKL81$SY created

###############################################################################

SQL> exec dbms_stats.gather_table_stats('SH','t_candidate',method_opt=>'for all columns size skewonly for columns (p1,p2) size skewonly');


SQL> select column_name,num_distinct,histogram from user_tab_col_statistics where table_name='T_CANDIDATE';

COLUMN_NAME NUM_DISTINCT HISTOGRAM

------------------------------ ------------ ---------------

P1 25 FREQUENCY

P2 20 FREQUENCY

PADDING 10000 HEIGHT BALANCED

SYS_STUIV1F__U9NUVZ7#MDKL81$SY 500 NONE

SQL> select count(*) from t_candidate where p1=19 and p2=14;

COUNT(*)

----------

19

SQL> select * from table(dbms_xplan.display_cursor);

PLAN_TABLE_OUTPUT

------------------------------------------------------------

SQL_ID 9g4vdacy7pc62, child number 0

-------------------------------------

select count(*) from t_candidate where p1=19 and p2=14


--------------------------------------------------------------------------------


| 0 | SELECT STATEMENT | | | | 12 (100)|

| 1 | SORT AGGREGATE | | 1 | 6 | |

|* 2 | TABLE ACCESS FULL| T_CANDIDATE | 20 | 120 | 12 (0)| 00:00:01

Predicate Information (identified by operation id):

2 - filter(("P1"=19 AND "P2"=14))

SQL>

So this method is not discover only correlated, result is: candidate columns also contain non-correlated/independent columns.

Also QO can use column group’s statistics through composite index without obviously adding to data dictionary. In this case selectivity will calculate based on DISTINCT_KEYS of this index (but I fully not investigate that). Another question can be related sql profile (SQP) and correlation data. Of course if here is columns correlation then you can use SQP if in result of sql tuning adviser task appear “accepting sql profile”. SQP is collection of internal hints (like opt_estimate), using “offline” optimization method it estimate selectivity/cardinality accurately and give information for “online” optimizer to choosing best plan. Finally note that still Oracle`s QO cannot (estimate selectivity of correlated columns) use extended statistics for non-equal, range and out of bound predicate, may be in such cases need additional statistics(also gathering method) and will solve future releases.

interpreting extended statistics

Documents

wf25z qahihe moffmm col

wf25z qahihe moffmm

sys stubzh0iha7k kebjvxo5lohas col

enable sql trace

table access full

9 sql

columns size skewonly

column name num distinct histogram