interpreting extended statistics
TRANSCRIPT
Interpreting Extended Statistics
Chinar Aliyev
As you know in oracle 11g introduced extended statistics to improve selectivity estimation of correlated columns. But when and how query optimizer (QO) use these statistics? What are its restrictions? Let’s see step by step
Correlated Columns
We will use customers table in SH schema
SQL> select count (*) from customers 2 where CUST_STATE_PROVINCE=‘CA’ and COUNTRY_ID=52790; COUNT (*) ---------- 3341 SQL>
Without histogram QO estimate cardinality as below SQL> BEGIN 2 DBMS_STATS.gather_table_stats ('SH', 3 'CUSTOMERS', 4 5 estimate_percent=>null, 6 cascade => true, 7 method_opt=>'FOR ALL COLUMNS SIZE 1' 8 ); 9 END; 10 / SQL> SELECT * 2 FROM customers a 3 where CUST_STATE_PROVINCE='CA' and COUNTRY_ID=52790; (Q1) Execution Plan ---------------------------------------------------------- Plan hash value: 2008213504 ------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 20 | 3620 | 406 (1)| 00:00:05 | |* 1 | TABLE ACCESS FULL| CUSTOMERS | 20 | 3620 | 406 (1)| 00:00:05 | ------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter ("CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790) SQL>
With histograms QO estimate cardinality as below SQL> BEGIN 2 DBMS_STATS.gather_table_stats ('SH', 3 'CUSTOMERS', 4 estimate_percent=>null, 5 cascade => true, 6 method_opt=>'FOR ALL COLUMNS SIZE skewonly'
7 ); 8 END; 9 / SQL> SELECT * 2 FROM customers a 3 where CUST_STATE_PROVINCE='CA' and COUNTRY_ID=52790; Execution Plan ---------------------------------------------------------- Plan hash value: 2008213504 ------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1115 | 197K| 406 (1)| 00:00:05 | |* 1 | TABLE ACCESS FULL| CUSTOMERS | 1115 | 197K| 406 (1)| 00:00:05 | ------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter ("CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790) SQL>
Even we use histograms QO cannot estimate correct cardinality due to here is column correlation. To solving this problem RDBMS should be gather statistics for combined (concatenate) of these columns. Like one of the most statistic is NDV (number of distinct values), to find and storing this statistic in data dictionary RDBMS should be analyze both columns together (number of distinct row groups) like below. SQL> SELECT COUNT (*) ndv 2 FROM (SELECT cust_state_province, country_id 3 FROM customers 4 GROUP BY cust_state_province, country_id) 5 SQL> / NDV ---------- 145 SQL>
But just NDV is not enough to estimate correct cardinality because there can be skew data for concatenated correlation columns. Still we talking about just simple “where col1=a1 and col2=a2 and … Coln=an” (p1) predicate, QO should be estimate selectivity of this column groups which contain skewed data. Therefore in Oracle RDBMS these column groups (which are correlated) mapped as equivalent virtual column then according this virtual column statistics QO use to estimate selectivity of predicate p1. So what is happen when creating extended statistics? To create this you can use CREATE_EXTENDED_STATS or GATHER_TABLE_STATS with METHOD_OPT option of DBMS_STATS package. In our example cust_state_province and country_id columns are correlated then we can create column groups for these column as: SQL> begin 2 DBMS_STATS.GATHER_TABLE_STATS ( 3 'SH', 4 'CUSTOMERS', 5 estimate_percent=>null, 6 METHOD_OPT =>'FOR COLUMNS (CUST_STATE_PROVINCE, COUNTRY_ID) size 1'); 7 end; 8 / PL/SQL procedure successfully completed. SQL>
When creating column groups if we enable sql trace then we can be found below statement.
Alter table "SH"."CUSTOMERS" add (SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ as (sys_op_combined_hash (CUST_STATE_PROVINCE, COUNTRY_ID)) virtual BY USER for statistics)
It means when creating extended statistics oracle first add virtual column, and then gather statistics for this column. Why use hash function? We just see p1 (known as “Point” correlation). It means every col1, col2 … coln columns values must according only one unique value when virtual column creating. So here must be Y=F (col1, col2 … coln) relationship between correlated columns and virtual columns therefore if this exists then QO can estimate selectivity of these column groups. Every input values function F must generate unique values. Of course just columns concatenate can be enough for most of cases but using hash function provides guarantee unique values and this is best option. Now we have one column group without histogram and separately column has histograms. Let’s see what is happen in this case for query (Q1) Plan hash value: 2008213504 ------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1115 | 210K| 406 (1)| 00:00:05 | |* 1 | TABLE ACCESS FULL| CUSTOMERS | 1115 | 210K| 406 (1)| 00:00:05 | ------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter ("CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790)
And from trace file SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for CUSTOMERS[A] Column (#11): NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145, NDV:145 Column (#11): CUST_STATE_PROVINCE ( AvgLen: 11 NDV: 145 Nulls: 0 Density: 0.000144 Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145 Column (#13): NewDensity:0.000676, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:19, NDV:19 Column (#13): COUNTRY_ID( AvgLen: 5 NDV: 19 Nulls: 0 Density: 0.000676 Min: 52769 Max: 52791 Histogram: Freq #Bkts: 19 UncompBkts: 55500 EndPtVals: 19 Column (#24): SYS_STU#S#WF25Z#QAHIHE#MOFFMM_( AvgLen: 12 NDV: 145 Nulls: 0 Density: 0.006897 ColGroup (#1, VC) SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ Col#: 11 13 CorStregth: 19.00 ColGroup Usage:: PredCnt: 2 Matches Full: Partial: Table: CUSTOMERS Alias: A Card: Original: 55500.000000 Rounded: 1115 Computed: 1114.87 Non Adjusted: 1114.87 Access Path: TableScan Cost: 405.71 Resp: 405.71 Degree: 0 Cost_io: 404.00 Cost_cpu: 35392510 Resp_io: 404.00 Resp_cpu: 35392510
As you see QO detected column groups but it is not use virtual column statistics due to in this case QO is not estimate selectivity(in trace file “Partial” for this column group(CG) is null). It use traditional method and estimated selectivity. SQL> select num_rows,blocks from user_tables where table_name='CUSTOMERS'; NUM_ROWS BLOCKS ---------- ---------- 55500 1486 SQL> select num_distinct, histogram from user_tab_col_statistics 2 where table_name='CUSTOMERS' 3 and column_name in ('CUST_STATE_PROVINCE','COUNTRY_ID'); NUM_DISTINCT HISTOGRAM ------------ --------------- 145 FREQUENCY 19 FREQUENCY SQL>
From Histogram for column CUST_STATE_PROVINCE
Endpoint_number Endpoint_Actual_value 7650 Brittany 7980 Buenos Aires 11321 CA 12098 CO 12255 CT
From Histogram for column COUNTRY_ID
Endpoint_number Endpoint_value 29169 52786 29244 52787 29335 52788 36892 52789 55412 52790 55500 52791
They are frequency histograms. Therefore selectivity will be: Sel(cust_state_province)=(11321-7980) /55500=3341/55500=0.06019 Sel(country_id)=(55412-36892) /55500=18520/55500=0.33369 Sel(cust_state_province and country_id)= 0.06019*0.33369=0.02008 Card=num_rows*sel=55500*0.02008=1114.706~1115
Now let’s gather histogram statistics for this column group and see what is happen.
SQL> BEGIN 2 DBMS_STATS.gather_table_stats 3 ('SH', 4 'CUSTOMERS', 5 estimate_percent => NULL, 6 method_opt => 'FOR COLUMNS (CUST_STATE_PROVINCE,COUNTRY_ID) size skewonly' 7 ); 8 END; 9 / SQL> select column_name,num_distinct,histogram from user_tab_col_statistics 2 where table_name='CUSTOMERS' 3 and column_name='SYS_STU#S#WF25Z#QAHIHE#MOFFMM_';
COLUMN_NAME NUM_DISTINCT HISTOGRAM ------------------------------ ------------ --------------- SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ 145 FREQUENCY SQL> Execution Plan ---------------------------------------------------------- Plan hash value: 2008213504 ------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 3341 | 629K| 406 (1)| 00:00:05 | |* 1 | TABLE ACCESS FULL| CUSTOMERS | 3341 | 629K| 406 (1)| 00:00:05 | ------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790) SQL>
As you see in this case QO estimate cardinality correctly and from trace file SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for CUSTOMERS[A] Column (#11): NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145, NDV:145 Column (#11): CUST_STATE_PROVINCE( AvgLen: 11 NDV: 145 Nulls: 0 Density: 0.000144 Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145 Column (#13): NewDensity:0.000676, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:19, NDV:19 Column (#13): COUNTRY_ID( AvgLen: 5 NDV: 19 Nulls: 0 Density: 0.000676 Min: 52769 Max: 52791 Histogram: Freq #Bkts: 19 UncompBkts: 55500 EndPtVals: 19 Column (#24): NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145, NDV:145 Column (#24): SYS_STU#S#WF25Z#QAHIHE#MOFFMM_( AvgLen: 12 NDV: 145 Nulls: 0 Density: 0.000144 Min: 22231259 Max: 9992664766 Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145 ColGroup (#1, VC) SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ Col#: 11 13 CorStregth: 19.00 ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.0602 Table: CUSTOMERS Alias: A Card: Original: 55500.000000 Rounded: 3341 Computed: 3341.00 Non Adjusted: 3341.00 Access Path: TableScan Cost: 405.74 Resp: 405.74 Degree: 0 Cost_io: 404.00 Cost_cpu: 35837710 Resp_io: 404.00 Resp_cpu: 35837710
How QO estimate selectivity? This is frequency histogram, when creating histogram if we enable sql trace then we can see oracle build histogram using below statement. select substrb(dump(val,16,0,32),1,120) ep, cnt from (select /*+ no_expand_table(t) index_rs(t) no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring no_substrb_pad
*/mod("SYS_STU#S#WF25Z#QAHIHE#MOFFMM_",9999999999) val,count(*) cnt from "SH"."CUST" t where mod("SYS_STU#S#WF25Z#QAHIHE#MOFFMM_",9999999999) is not null group by mod("SYS_STU#S#WF25Z#QAHIHE#MOFFMM_",9999999999)) order by val
Therefore we can use below sql and result can be SELECT endpoint_number e FROM user_tab_histograms WHERE table_name = 'CUSTOMERS' AND column_name = 'SYS_STU#S#WF25Z#QAHIHE#MOFFMM_' AND endpoint_value = MOD (sys_op_combined_hash ('CA', 52790), 9999999999)) column_name endpoint_number endpoint_value SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ 20225 4701058945 SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ 21244 4752431017 SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ 24585 4800861232 SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ 24813 4861997875 Sel=(24585-21244)/55500=0.06019 Card=num_rows*sel=3341
Two Column Groups Case 1
Now assume we have two column groups.
CG1= (cust_state_province, country_id) CG2= (CUST_CITY_ID, cust_state_province, country_id)
Consider for column group CG1 gathered statistics but for CG2 is not. So CG2 to be estimate selectivity. In this case execution plan was
SQL> SELECT * 2 FROM customers a 3 where CUST_STATE_PROVINCE='CA' and COUNTRY_ID=52790 and cust_city_id=51919; (Q2) Execution PlanPlan hash value: 2008213504 ------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 39 | 8424 | 406 (1)| 00:00:05 | |* 1 | TABLE ACCESS FULL| CUSTOMERS | 39 | 8424 | 406 (1)| 00:00:05 | ------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("CUST_CITY_ID"=51919 AND "CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790) SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for CUSTOMERS[A] Column (#11): NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145, NDV:145 Column (#11): CUST_STATE_PROVINCE( AvgLen: 11 NDV: 145 Nulls: 0 Density: 0.000144 Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145 Column (#13): NewDensity:0.000676, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:19, NDV:19 Column (#13): COUNTRY_ID( AvgLen: 5 NDV: 19 Nulls: 0 Density: 0.000676 Min: 52769 Max: 52791
Histogram: Freq #Bkts: 19 UncompBkts: 55500 EndPtVals: 19 Column (#10): NewDensity:0.001189, OldDensity:0.002179 BktCnt:254, PopBktCnt:77, PopValCnt:34, NDV:620 Column (#10): CUST_CITY_ID( AvgLen: 5 NDV: 620 Nulls: 0 Density: 0.001189 Min: 51040 Max: 52531 Histogram: HtBal #Bkts: 254 UncompBkts: 254 EndPtVals: 212 Column (#26): SYS_STU4RAPXUESG1VO3#Q7ZH365D7( NO STATISTICS (using defaults) AvgLen: 13 NDV: 1734 Nulls: 0 Density: 0.000577 Column (#25): NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145, NDV:145 Column (#25): SYS_STU#S#WF25Z#QAHIHE#MOFFMM_( AvgLen: 12 NDV: 145 Nulls: 0 Density: 0.000144 Min: 22231259 Max: 9992664766 Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145 ColGroup (#1, VC) SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ Col#: 11 13 CorStregth: 19.00 ColGroup Usage:: PredCnt: 3 Matches Full: #1 Partial: Sel: 0.0602 Table: CUSTOMERS Alias: A Card: Original: 55500.000000 Rounded: 39 Computed: 39.46 Non Adjusted: 39.46 Access Path: TableScan Cost: 405.70 Resp: 405.70 Degree: 0 Cost_io: 404.00 Cost_cpu: 35045008 Resp_io: 404.00 Resp_cpu: 35045008
For column group CG1 we already know how calculated and
sel(CG2)=sel(CG1)*sel(cust_city_id) SQL> select column_name,num_distinct,histogram from user_tab_col_statistics 2 where table_name='CUSTOMERS' 3 and column_name='CUST_CITY_ID'; COLUMN_NAME NUM_DISTINCT HISTOGRAM ------------------------------ ------------ --------------- CUST_CITY_ID 620 HEIGHT BALANCED SQL>
From the histogram information for this column Endpoint_number endpoint_value 157 51916 158 51917 161 51919 162 51924 163 51930 165 51934 166 51971 sel(CUST_CITY_ID)=(161-158)/num_buckets=0.0118110236 sel(cg2)=sel(cg1)*sel(CUST_CITY_ID)=7.1102362204724409448818897637795e-4 card=sel(cg2)*num_rows=39.46
Actually this is clear method, without CG2 QO will estimate cardinality also as 39 because our predicate is CUST_STATE_PROVINCE='CA' and COUNTRY_ID=52790 and cust_city_id=51919 then optimizer detected here is column group and have sufficient statistics therefore we can write predicate SYS_STU#S#WF25Z#QAHIHE#MOFFMM_= MOD (sys_op_combined_hash ('CA', 52790), 9999999999) and cust_city_id=51919 Due to selectivity will be sel (SYS_STU#S#WF25Z#QAHIHE#MOFFMM_)*sel (cust_city_id)
Two Column Groups Case 2
Assume we have three column groups like below
CG1 = ("CUST_STATE_PROVINCE","COUNTRY_ID","CUST_CITY_ID") CG2 = ("CUST_STATE_PROVINCE","COUNTRY_ID") CG3 = ("CUST_STATE_PROVINCE","CUST_CITY_ID")
And our predicate is CUST_STATE_PROVINCE='CA' and COUNTRY_ID=52790 and CUST_CITY=51919 (P2). So in this case how QO will estimate selectivity? There sel (CG1) to be estimated, CG1 has not statistics. But for two column group CG2 and CG3 statistics were gathered. According previous example selectivity of CG1 can be estimated as below Sel (CG1) =sel (CG2)*sel (CUST_CITY_ID) (F1)
Or Sel (CG1) =sel (CG3)*sel (COUNTRY_ID) (F2)
So which formula QO will choose and what based on? Let’s see execution plan and trace file SQL> select * from customers 2 where CUST_STATE_PROVINCE='CA' and CUST_CITY_ID=51919 and COUNTRY_ID=52790; Execution Plan ---------------------------------------------------------- Plan hash value: 2008213504 ------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 219 | 43800 | 406 (1)| 00:00:05 | |* 1 | TABLE ACCESS FULL| CUSTOMERS | 219 | 43800 | 406 (1)| 00:00:05 | ------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("CUST_CITY_ID"=51919 AND "CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790) SQL> SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for CUSTOMERS[CUSTOMERS] Column (#13): NewDensity:0.000676, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:19, NDV:19 Column (#13): COUNTRY_ID( AvgLen: 5 NDV: 19 Nulls: 0 Density: 0.000676 Min: 52769 Max: 52791 Histogram: Freq #Bkts: 19 UncompBkts: 55500 EndPtVals: 19 Column (#11): NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145, NDV:145 Column (#11): CUST_STATE_PROVINCE( AvgLen: 11 NDV: 145 Nulls: 0 Density: 0.000144 Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145 Column (#10): NewDensity:0.001189, OldDensity:0.002179 BktCnt:254, PopBktCnt:77, PopValCnt:34, NDV:620 Column (#10): CUST_CITY_ID( AvgLen: 5 NDV: 620 Nulls: 0 Density: 0.001189 Min: 51040 Max: 52531 Histogram: HtBal #Bkts: 254 UncompBkts: 254 EndPtVals: 212 Column (#26): SYS_STU14HX98$V3_$3Z$ZSWQ0O8O0( AvgLen: 12 NDV: 620 Nulls: 0 Density: 0.001613 Column (#25): NewDensity:0.001277, OldDensity:0.002351 BktCnt:254, PopBktCnt:62, PopValCnt:28, NDV:620 Column (#25): SYS_STULHUROKG217F9$OWA1IEIZLA( AvgLen: 12 NDV: 620 Nulls: 0 Density: 0.001277 Min: 29269004 Max: 9981124071
Histogram: HtBal #Bkts: 254 UncompBkts: 254 EndPtVals: 221 Column (#24): NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145, NDV:145 Column (#24): SYS_STU#S#WF25Z#QAHIHE#MOFFMM_( AvgLen: 12 NDV: 145 Nulls: 0 Density: 0.000144 Min: 22231259 Max: 9992664766 Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145 ColGroup (#1, VC) SYS_STU14HX98$V3_$3Z$ZSWQ0O8O0 Col#: 10 11 13 CorStregth: 2755.00 ColGroup (#2, VC) SYS_STULHUROKG217F9$OWA1IEIZLA Col#: 10 11 CorStregth: 145.00 ColGroup (#3, VC) SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ Col#: 11 13 CorStregth: 19.00 ColGroup Usage:: PredCnt: 3 Matches Full: #2 Partial: Sel: 0.0118 Table: CUSTOMERS Alias: CUSTOMERS Card: Original: 55500.000000 Rounded: 219 Computed: 218.74 Non Adjusted: 218.74 Access Path: TableScan Extension_name Extension Histogram SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ ("CUST_STATE_PROVINCE","COUNTRY_ID") FREQUENCY SYS_STULHUROKG217F9$OWA1IEIZLA ("CUST_STATE_PROVINCE","CUST_CITY_ID") HEIGHT BALANCED SQL> select column_name, num_distinct from user_tab_col_statistics 2 where table_name='CUSTOMERS' 3 and column_name in ('CUST_STATE_PROVINCE','COUNTRY_ID','CUST_CITY_ID', 4 ‘SYS_STULHUROKG217F9$OWA1IEIZLA','SYS_STU#S#WF25Z#QAHIHE#MOFFMM_') 5; COLUMN_NAME NUM_DISTINCT HISTOGRAM ------------------------------ ------------ --------------- SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ 145 FREQUENCY SYS_STULHUROKG217F9$OWA1IEIZLA 620 HEIGHT BALANCED CUST_CITY_ID 620 HEIGHT BALANCED CUST_STATE_PROVINCE 145 FREQUENCY COUNTRY_ID 19 FREQUENCY
As you see QO choose SYS_STULHUROKG217F9$OWA1IEIZLA virtual column (Matches Full: #2) why this? Because this CG correlation strength is more than other (145>19). CorStength is indicating deeply of correlation between columns (in column groups). It seems QO identify correlation strength using NDV`s .So CorStrengh(col1,col2,…,coln)=NDV(col1)*NDV(col2)*…*NDV(coln)/NDV(col1,col2,…,coln) CorStrengh (cust_state_province, cust_city_id) = 145*620/620=145 CorStrengh (cust_state_province, country_id) = 145*19/145=19
From histogram for country_id Endpoint_number Endpoint_value 36892 52789 55412 52790 55500 52791
Therefore sel(country _id)= (55412-36892)/55500= 0.33369. sel(p2)=sel()*sel(country_id)= 0.00393758 Card=sel (p2)*num_rows= 0.00393758*55500=218.536~219
If CorStrengh(col1,col2,…,coln)=1 it mean is they col1,col2,…,coln columns are not correlated.
Extended Statistics and equijoin
QO also can detect extended statistics in equijoin operation and can use this.
SQL> create table t 2 as 3 select 4 trunc(dbms_random.value(0,25)) n1, 5 trunc(dbms_random.value(0,20)) n2, 6 lpad(rownum,10,'0') small_vc 7 from 8 all_objects 9 where 10 rownum <= 10000 11 ; Table created. SQL> update t set n2=n1 where rownum<=9955; 9955 rows updated. SQL> commit; Commit complete. SQL> begin 2 dbms_stats.gather_table_stats( 3 user, 4 't', 5 cascade => true, 6 estimate_percent=>null, 7 method_opt=> 'for all columns size 1 FOR COLUMNS (n1,n2) size 1 '); 8 9 end; 10 / PL/SQL procedure successfully completed. SQL> select 2 count(*) 3 from 4 t t1, 5 t t2 6 where 7 t1.n1 = t2.n1 8 and t1.n2 = t2.n2 9 ; Execution Plan ---------------------------------------------------------- Plan hash value: 791582492 ---------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 12 | 32 (25)| 00:00:01 | | 1 | SORT AGGREGATE | | 1 | 12 | | | |* 2 | HASH JOIN | | 1470K| 16M| 32 (25)| 00:00:01 | | 3 | TABLE ACCESS FULL| T | 10000 | 60000 | 12 (0)| 00:00:01 | | 4 | TABLE ACCESS FULL| T | 10000 | 60000 | 12 (0)| 00:00:01 |
---------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("T1"."N1"="T2"."N1" AND "T1"."N2"="T2"."N2")
I do not full text of trace file here but some need information here.
SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for T[T2] Table: T Alias: T2 Card: Original: 10000.000000 Rounded: 10000 Computed: 10000.00 Non Adjusted: 10000.00 Access Path: TableScan Cost: 12.09 Resp: 12.09 Degree: 0 Cost_io: 12.00 Cost_cpu: 1956372 Resp_io: 12.00 Resp_cpu: 1956372 Best:: AccessPath: TableScan Cost: 12.09 Degree: 1 Resp: 12.09 Card: 10000.00 Bytes: 0 Column (#4): SYS_STUBZH0IHA7K$KEBJVXO5LOHAS( AvgLen: 12 NDV: 68 Nulls: 0 Density: 0.014706 ColGroup (#1, VC) SYS_STUBZH0IHA7K$KEBJVXO5LOHAS Col#: 1 2 CorStregth: 9.19 Column (#4): SYS_STUBZH0IHA7K$KEBJVXO5LOHAS( AvgLen: 12 NDV: 68 Nulls: 0 Density: 0.014706 ColGroup (#1, VC) SYS_STUBZH0IHA7K$KEBJVXO5LOHAS Col#: 1 2 CorStregth: 9.19 Join ColGroups for T[T1] and T[T2] : (#1, #1)
Therefore Join selectivity will be 0.014706 and final cardinality 0.014706*10000*10000=1470600
Projections
QO can estimate cardinality using extended statistics during GROUP BY operation.
SQL> select count(*) from ( 2 select count(*) from customers 3 group by CUST_STATE_PROVINCE,COUNTRY_ID); COUNT(*) ---------- 145 SQL> select column_name,num_distinct from user_tab_col_statistics 2 where table_name='CUSTOMERS' 3 and column_name in ('CUST_STATE_PROVINCE','COUNTRY_ID'); COLUMN_NAME NUM_DISTINCT ------------------------------ ------------ CUST_STATE_PROVINCE 145 COUNTRY_ID 19 SQL> select count(*) from customers 2 group by CUST_STATE_PROVINCE,COUNTRY_ID; Execution Plan ---------------------------------------------------------- Plan hash value: 1577413243 -------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1949 | 31184 | 408 (1)| 00:00:05 |
| 1 | HASH GROUP BY | | 1949 | 31184 | 408 (1)| 00:00:05 | | 2 | TABLE ACCESS FULL| CUSTOMERS | 55500 | 867K| 406 (1)| 00:00:05 | -------------------------------------------------------------------------------- SQL>
Without extended statistics QO estimate cardinality as 145*19/sqrt (2)~1948.097
But with extended statistics estimated cardinality will correct
SQL> begin DBMS_STATS.GATHER_TABLE_STATS(
2 'SH',
3 'CUSTOMERS',
4 estimate_percent=>null,
5 METHOD_OPT =>'FOR COLUMNS (cust_state_proince,country_id) size 1');
6 end;
7 /
PL/SQL procedure successfully completed.
SQL> select count(*) from customers
2 group by CUST_STATE_PROVINCE,COUNTRY_ID;
Execution Plan
----------------------------------------------------------
Plan hash value: 1577413243
-------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 145 | 2320 | 408 (1)| 00:00:05
| 1 | HASH GROUP BY | | 145 | 2320 | 408 (1)| 00:00:05
| 2 | TABLE ACCESS FULL| CUSTOMERS | 55500 | 867K| 406 (1)| 00:00:05
Identifying candidate columns to creating column groups based on workload statistics.
Oracle provides some procedures to finding candidate columns for column groups. But this method is not work based on statistics or real data. It is looks works just find candidate columns from dynamic performance views (like v$sql,v$sql_plan). It means in this case oracle do not investigate/find real columns correlation. Let’s see below example.
SQL> create table t_candidate
2 as
3 select
4 trunc(dbms_random.value(0,25)) p1,
5 trunc(dbms_random.value(0,20)) p2,
6 lpad(rownum,10,'0') padding
7 from
8 all_objects
9 where
10 rownum <= 10000
11 ;
Table created.
SQL>
SQL> begin
2 dbms_stats.gather_table_stats(
3 user,
4 't_candidate',
5 cascade => true,
6 estimate_percent=>null,
7 method_opt=> 'for all columns size 1');
8 end;
9 /
PL/SQL procedure successfully completed.
SQL> Exec DBMS_STATS.SEED_COL_USAGE(null,null,120);
PL/SQL procedure successfully completed.
SQL> select count(*) from t_candidate where p1=19 and p2=14;
COUNT(*)
----------
19
SQL>
SQL> select * from table(dbms_xplan.display_cursor);
PLAN_TABLE_OUTPUT
SQL_ID 9g4vdacy7pc62, child number 0
-------------------------------------
select count(*) from t_candidate where p1=19 and p2=14
Plan hash value: 374408457
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time
|
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 12 (100)|
| 1 | SORT AGGREGATE | | 1 | 6 | |
|* 2 | TABLE ACCESS FULL| T_CANDIDATE | 20 | 120 | 12 (0)| 00:00:01
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
2 - filter(("P1"=19 AND "P2"=14))
19 rows selected.
SQL> SET LONG 7000
SQL> SET LONGCHUNKSIZE 7000
SQL> SET LINESIZE 500
SQL> Select dbms_stats.report_col_usage('SH','t_candidate') from dual ;
DBMS_STATS.REPORT_COL_USAGE('SH','T_CANDIDATE')
--------------------------------------------------------------------------------
--------------------
LEGEND:
.......
EQ : Used in single table EQuality predicate
RANGE : Used in single table RANGE predicate
LIKE : Used in single table LIKE predicate
NULL : Used in single table is (not) NULL predicate
EQ_JOIN : Used in EQuality JOIN predicate
NONEQ_JOIN : Used in NON EQuality JOIN predicate
FILTER : Used in single table FILTER predicate
JOIN : Used in JOIN predicate
DBMS_STATS.REPORT_COL_USAGE('SH','T_CANDIDATE')
--------------------------------------------------------------------------------
--------------------
GROUP_BY : Used in GROUP BY expression
...............................................................................
###############################################################################
COLUMN USAGE REPORT FOR SH.T_CANDIDATE
......................................
1. P1 : EQ
2. P2 : EQ
3. (P1, P2) : FILTER
DBMS_STATS.REPORT_COL_USAGE('SH','T_CANDIDATE')
SQL> select dbms_stats.create_extended_stats('SH','t_candidate') from dual;
DBMS_STATS.CREATE_EXTENDED_STATS('SH','T_CANDIDATE')
--------------------------------------------------------------------------------
--------------------
###############################################################################
EXTENSIONS FOR SH.T_CANDIDATE
.............................
1. (P1, P2) : SYS_STUIV1F__U9NUVZ7#MDKL81$SY created
###############################################################################
SQL> exec dbms_stats.gather_table_stats('SH','t_candidate',method_opt=>'for all columns size skewonly for columns (p1,p2) size skewonly');
PL/SQL procedure successfully completed.
SQL> select column_name,num_distinct,histogram from user_tab_col_statistics where table_name='T_CANDIDATE';
COLUMN_NAME NUM_DISTINCT HISTOGRAM
------------------------------ ------------ ---------------
P1 25 FREQUENCY
P2 20 FREQUENCY
PADDING 10000 HEIGHT BALANCED
SYS_STUIV1F__U9NUVZ7#MDKL81$SY 500 NONE
SQL> select count(*) from t_candidate where p1=19 and p2=14;
COUNT(*)
----------
19
SQL> select * from table(dbms_xplan.display_cursor);
PLAN_TABLE_OUTPUT
------------------------------------------------------------
SQL_ID 9g4vdacy7pc62, child number 0
-------------------------------------
select count(*) from t_candidate where p1=19 and p2=14
Plan hash value: 374408457
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time
| 0 | SELECT STATEMENT | | | | 12 (100)|
| 1 | SORT AGGREGATE | | 1 | 6 | |
|* 2 | TABLE ACCESS FULL| T_CANDIDATE | 20 | 120 | 12 (0)| 00:00:01
Predicate Information (identified by operation id):
2 - filter(("P1"=19 AND "P2"=14))
SQL>
So this method is not discover only correlated, result is: candidate columns also contain non-correlated/independent columns.
Also QO can use column group’s statistics through composite index without obviously adding to data dictionary. In this case selectivity will calculate based on DISTINCT_KEYS of this index (but I fully not investigate that). Another question can be related sql profile (SQP) and correlation data. Of course if here is columns correlation then you can use SQP if in result of sql tuning adviser task appear “accepting sql profile”. SQP is collection of internal hints (like opt_estimate), using “offline” optimization method it estimate selectivity/cardinality accurately and give information for “online” optimizer to choosing best plan. Finally note that still Oracle`s QO cannot (estimate selectivity of correlated columns) use extended statistics for non-equal, range and out of bound predicate, may be in such cases need additional statistics(also gathering method) and will solve future releases.