index clustering factor - causing full table scan for small subset of rows

What is Clustering Factor ? It is the statistic of an Index which is calculated at the time of collecting statistics of Index. This is the value which make Optimizer to decide how efficient the Index is. This is the value which determines the number of IO’s that needed to be done to access rows from the table in the order the rows are stored in an Index. How this is calculated ? It all comes down to rowids. If we know how rows are fetched from Index and using that how rows are fetched from Tables, we know that it is ROWID of the row which is being used. If the Block number of current ROWID is same as the block number of previous one, clustering factor value is NOT incremented But if the block number is different than the previous one, Value for clustering factor gets incremented. The increase in the value of CF means that one extra visit to one extra block (atleast) is needed to get the second row. Lets take an example here. Table Name: CF_TEST Number of rows: 10M Number of Blks: 100000 So, In an ideal case, to fetch all the rows via index, we would need to access 100000 BLKS (Same in case of FTS as well). But as I said, this is an ideal case.

In the above figure, you can see, to fetch 1st row, it needs to do 1 IO to table. But to fetch 2nd row, which is in block 6, there will be 6 blocks read. So, the above data in the tables shows that data is highly clustered and Index is not ideally in sync in terms of how data is spread with the Table which leads to poor clustering factor calculation. In the worst case, CF for Index can go close to the number of rows. So in this case, to fetch 100 rows from a table through Index, 100 Blocks needed to be read from table. Why we need to discuss CF in detail? How often we see that even though Where clause filter is selecting only 5% data from table(Large table), but Optimizer is still choosing FTS over Index scan. This can be because, the Cost of selecting via index is higher than doing FTS. But we can question, since we are selecting only 5% records, how the cost can be more than Selecting 100% records via FTS. This is because, CF is an integral part of Cost calculation for Optimizer and if we have high CF, it will lead to high Cost. Below picture is an example where Index Scan is costlier than FTS

Check the cost of Index Scan: Cost (approx) = CF* Selectivity

= 25052763 * 0.057833 = 1448876.442579 ( This cost is just to access rows from Table) If you add Cost of Index scan into that, you will come to the figure of Total Index Scan cost encircled in Red. So, this justifies the Optimizer decision on Why FTS is choosen over Index Scan.

How it can be rectified? One simple solution to solve this is do Partitioning. Although, it needs Extra License to do that and also it would need that your current Table and its dependent Objects like indexes get Dropped and recreated as Partitioned objects. Another way is to Reorganize the data in the table to match it with the Index. To do this, what you can do is,

● Create a new table (Copy existing table). ● Empty the current table. ● ReInsert the data into original table with “Order by clause”. Mostly it should be done for

Primary Keys since we know these are unique and can be in sorted form. ● Rebuit the indexes(REcommended) in case you have choosen (Delete over Truncate). ● Gather Statistics for table with “Cascade=> True” option. The idea is to gather the Index

stats which would eventually recalucate the CF for index. Test case for this: We had a case where Delete statement was taking 50 + minutes in production deleting 6M records out of 110 M (roughly 5 %) DELETE FROM ADMUSER.udfvalue WHERE ( udf_type_id =:"SYS_B_0" OR udf_type_id=:"SYS_B_1" OR udf_type_id=:"SYS_B_2" OR udf_type_id=:"SYS_B_3" OR udf_type_id=:"SYS_B_4" OR udf_type_id=:"SYS_B_5" OR udf_type_id=:"SYS_B_6" OR udf_type_id=:"SYS_B_7" OR udf_type_id=:"SYS_B_8" OR udf_type_id=:"SYS_B_9"

)

● UDF_TYPE_ID is a part of Primary key So it has an unique index defined on this. ● Number of Distinct Rows: 390 (approx) ● Histograms: Frequency ● Skewed Data ● Table: 9G ● Number of Rows: 110M (approx) ● Table Blocks: 1070863 ● Index: PK_UDFVALUE ● CF: 23023767 (21% more than Total Number of Blocks of Table)

Execution Plan choosen by Optimizer

Estimated Time: 48 Minutes Cost: 242K Since it has an index define on it, I tried to force it Here is the new Plan:

Times Increased to Almost 4 Hours Cost: 1161K (If you see Cost is 21% more than the FTS) Remember CF is 21% more than Total number of Blocks as well. So FTS over Index Range scan Make sense. I then created a new table with ordered data on the primary key and created a same Index. For New Tables No. Of Blocks: 1062156 CF for Index: 1406980 ( Almost close to Number of Rows) Check the new plan

Cost has came down to 18539 from 1161K. That is huge improvement.

Note: We cannot use this theory for all the indexes without going in details about the data and table and other factors like Constraints etc.

index clustering factor - causing full table scan for small subset of rows

Technology