tuning the optimizer statistics >eric miner senior engineer data server technology...

Tuning the Optimizer Statistics

>Eric MinerSenior EngineerData Server [email protected]

There are Two Kinds of Optimizer Statistics

> Table/Index level- describes a table and its index(es)

> Page/row counts, cluster ratios, deleted and forwarded rows

> Some are updated dynamically as DML occurs

• page/ row counts, deleted rows, forwarded rows, cluster ratios

> Stored in systabstats

> Column level - describes the data to the optimizer

> Histogram (distribution), density values, default selectivity values

> Static, need to be updated or written directly

> Stored in sysstatistics

> This presentation deals with the column level statistics

What Are These Column Level Statistics Used For?

> The Histogram Values

> Describes the distribution of values in the column

> Belongs to a column, not an index

> Used in costing SARGs

> A step is the point in the column where a value is read to obtain a ‘boundary value’

> A cell represents the rows that fall between two steps

> Each cell has a weight which is the fraction of rows in the column it represents - read as a percentage of all rows

> There are approximately the same number of rows in each cell - except Frequency count cells

Some Quick Definitions

Range cell density: 0.0037264745412389 Total density: 0.3208892191740000 Range selectivity: default used (0.33) In between selectivity: default used (0.25) Histogram for column: “A"Column datatype: integerRequested step count: 20Actual step count: 10 Step Weight Value 1 0.00000000 <= 154143023 2 0.05263700 <= 154220089 3 0.05245500 <= 171522361 4 0.00000000 < 800000000 5 0.34489399 = 800000000 6 0.04968300 <= 859388217 7 0.00000000 < 860000000

What Are These Column Level Statistics Used For? cont.

> A cell can represent either a single value or multiple values of the column

> Range cell - more than one value 2 0.05263982 <= 31423 3 0.05263316 <= 63045

• All values between 31424 and 63045 are in cell 3

> Frequency count cell - only one value (very accurate) 5 0.00000000 < 170016 6 0.15908001 = 170016 7 0.05263815 <= 201861 8 0.10263316 = 201862 9 0.05264576 <= 317462

• Cells 6 and 8 represent only one value

> Cell (step) 1 represents the NULL values in the column

Statistics On Inner Columns of Composite Indexes

Stats on inner columns of composite indexes

Think of a composite index as a 3D object, columns with statistics are transparent, those without statistics are opaque

> Columns with statistics give the optimizer a clearer picture of an index – sometimes good, sometimes not

> This is a fairly common practice

> Does add maintenance

> update index statistics most commonly used to do this

update index statistics tab_name [ind_name]

Statistics On Inner Columns of Composite Indexes cont.

Index on columns E and B – No statistics on column bselect * from TW4 where E = "yes" and b >= 959789065 and id >= 600000 and F > "May 14, 2002“ and A_A = 959000000

Beginning selection of qualifying indexes for table TW4',varno = 0, objectid 464004684. The table (Allpages) has 1000000 rows, 24098 pages,

Estimated selectivity for E, selectivity = 0.527436, upper limit = 0.527436.

No statistics available for B,using the default range selectivity to estimate selectivity.

Estimated selectivity for B, selectivity = 0.330000.


The best qualifying index is ‘E_B' (indid 7)costing 49264 pages, with an estimate of 191rows to be returned per scan of the table

FINAL PLAN (total cost = 481960):

varno=0 (TW4) indexid=0 ()path=0xfbccc120 pathtype=sclausemethod=NESTED ITERATION

Table: TW4 scan count 1, logical reads:(regular=24098apf=0 total=24098)physical reads: (regular=16468 apf=0 total=16468), apf IOs used=0


Statistics are now on column BEstimated selectivity for E, selectivity = 0.527436, upper limit = 0.527436.

Estimated selectivity for B, selectivity = 0.022199, upper limit = 0.074835.

The best qualifying index is ‘E_B' (indid 7)costing 3317 pages,with an estimate of 13 rows tobe returned per scan of the table


varno=0 (TW4) indexid=7 (E_B)path=0xfbd1da08 pathtype=sclausemethod=NESTED ITERATION

Table: TW4 scan count 1, logicalreads:(regular=4070 apf=0 total=4070),physical reads: (regular=820 apf=0 total=820),

Statistics On Non-Indexed Columns and Joins

Stats on non-indexed columns

Can’t help with index selection but can affect join ordering

> Columns with statistics give the optimizer a clearer picture of the column – no hard coded assumptions have to be used

> When costing joins of non-indexed columns having statistics may result in better plans than using the default values

> Without statistics there will be no Total density or histogram that the optimizer can use to cost the column in the join

> Yes, in some circumstances histograms can be used in costing joins – if there is a SARG on the joining column and that column is also in the join table then the SARG from the joining table can be used to filter the join table

> If there is no SARG on the join column or on the joining column the Total density value (with stats) or the default value (w/o stats) will be used

Statistics On Non-Indexed Columns and Joins cont.

SARG exampleselect ....from TW1, TW4where TW1.A = TW4.A and TW1.A = 10

Selecting best index for the JOIN CLAUSE: TW4.A = TW1.A TW4.A = 10Estimated selectivity for a, selectivity = 0.003726,upper limit = 0.049683. Histogram values used

select ....from TW1, TW4where TW1.A = TW4.A and TW1.B = 10

Selecting best index for the JOIN CLAUSE: TW4.A = TW1.A

Estimated selectivity for a, selectivity = 0.320889. Total density value used

Statistics On Non-Indexed Columns and Joins - Example

select * from TW1,TW2 where TW1.A=TW2.A and TW1.A =805975090 A simple join with a SARG on the join column of one table

Table TW2 column A has no statistics, TW1 column A doesSelecting best index for the JOIN CLAUSE: (for TW2.A)

TW2.A = TW1.A TW2.A = 805975090 Inherited from SARG on TW1

But, can’t help…no statsEstimated selectivity for A, selectivity = 0.100000.

The best qualifying access is a table scan, costing 13384 pages, with an estimate of 50000 rows to be returned per scan of the table, using no data prefetch (size 2K I/O), in data cache 'default data cache' (cacheid 0) with MRU replacementJoin selectivity is 0.100000.

Inherited SARG from other table doesn’t help in this case

Statistics On Non-Indexed Columns and Joins – Example cont.

Without statistics on TW2.A the plan includes a reformat with TW1 as the outer table


varno=0 (TW1) indexid=2 (A_E_F)path=0xfbd46800 pathtype=sclausemethod=NESTED ITERATION

varno=1 (TW2) indexid=0 ()path=0xfbd0bb10 pathtype=joinmethod=REFORMATTING

> Not the best plan – but the optimizer had little to go on


Table TW2 column A now has statistics. The inherited SARG on TW1.A can now be used to help

filter the join on TW2.ASelecting best index for the JOIN CLAUSE:

TW2.A = TW1.A TW2.A = 805975090

Estimated selectivity for A, selectivity = 0.001447, upper limit = 0.052948.

The best qualifying access is a table scan, costing 13384 pages, with an estimate of 724 rows to be returned per scan of the table, using no data prefetch (size 2K I/O), in data cache 'default data cache' (cacheid 0) with MRU replacement

Join selectivity is 0.001447.


With statistics on TW2.A reformatting is not used and the join order has changed


varno=1 (TW2) indexid=0 ()path=0xfbd0b800 pathtype=sclausemethod=NESTED ITERATION

varno=0 (TW1) indexid=2 (A_E_F)path=0xfbd46800 pathtype=sclausemethod=NESTED ITERATION

The Effects of Changing the Number of Steps (Cells)

The Number of Cells (steps) Affects SARG Costing –

As the Number Of Steps Changes Costing Does Too

Cell weights and Range cell density are used in costing SARGs

> Cell weight is used as column’s ‘upper limit’ Range cell density is used as ‘selectivity’ for Equi-SARGs – as seen in 302 output

> Result(s) of interpolation is used as column ‘selectivity’ for Range SARGs

> Increasing the number of steps narrows the average cell width, thus the weight of Range cells decreases

> Can also result in more Frequency count cells and thus change the Range cell density value

> More cells means more granular cells

The Effects of Changing the Number of Steps (Cells) cont.

Average cell width = # of rows/(# of requested steps –1)

> Table has 1 million rows, requested 20 steps -

> 1,000,000/19 = 52,632 rows per cell

> 1,000,000/199 = 5,025 rows per cell

> What does this mean?

> As you increase the number of steps (cells) they become narrower – representing fewer values

> We’ll see that this has an effect on how the optimizer estimates the cost of a SARG

> “update statistics ……. using X values“create index ….. using X values


Changing the number of steps – effects on Equi-SARGs

select A from TW2 where B = 842000000

With 20 cells (steps) in the histogramRange cell density: 0.0012829768785739

9 0.05263200 <= 82556933710 0.05264200 <= 842084405

SARG value falls into cell 10Estimated selectivity for B, selectivity = 0.001283, upper limit = 0.052642. Range cell weight of density qualifying cell


With 200 cells (steps) in the histogramRange cell density: 0.0002303825911991 77 0.00507200 <= 839463989 78 0.00506000 <= 842019895

> SARG value falls into cell 78

Estimated selectivity for B,

selectivity = 0.000230, upper limit = 0.005060.

In this case more cells result in a lower estimated selectivity

> Increasing the number of steps has decreased the average width and lowered the Range cell density and the average cell weight.

> Range cell density decreased because Frequency count cells appeared in the histogram


Changing the number of steps – effects on Range SARGs -select * from TW2 where B between 825570000 and 830000000

With 20 cells (steps) in the histogram

Range cell density: 0.0012829768785739

9 0.05263200 <= 82556933710 0.05264200 <= 842084405

> SARG values fall into cell 10

Estimated selectivity for B,

selectivity = 0.014121, upper limit = 0.052642.

> Here ‘selectivity’ is the product of interpolation, ‘upper limit’ is the weight of the qualifying cell.

> Interpolation estimates how much of cell will qualify for the range SARG


select * from TW2 where B between 825570000 and 830000000

With 200 cells (steps) in the histogram

Range cell density: 0.0002303825911991 67 0.00505200 <= 825505843 68 0.00503000 <= 825570611 69 0.00508000 <= 825635378 70 0.00504000 <= 825690418 71 0.00506400 <= 825702450 72 0.00503200 <= 825767218 73 0.00510200 <= 825831945 74 0.00425800 <= 825833785 75 0.00598400 <= 839462921

Estimated selectivity for B, selectivity = 0.029624, upper limit = 0.034606.

> The SARG values now span multiple cells> Interpolation estimates amount of cells 68 and 75 to use since not

all of those two cells qualify

Some Statistics Related FAQs cont.

How many steps should I request?

> It will depend on your data and your queries

> Increase requested steps to get Frequency count cells when there are highly duplicated values

> FC only represents one value - very accurate weight

> Range SARGs will estimate what portion of a cell qualifies for the SARG

> More cells means narrower cells (represent fewer values)

> Narrower cells mean more accurate estimates

> Can have an affect on equi-SARGs - lower selectivity

Removing Statistics Can Effect Query Plans

Sometimes no statistics are better then having them

This will usually be an issue when very dense columns are involved

Histogram for column: “E" Step Weight Value 1 0.00000000 < "no" 2 0.47256401 = "no" 3 0.00000000 < "yes" 4 0.52743602 = "yes“

This can also show up when you have ‘spikes’ (Frequency count cells) in the distribution

Removing Statistics Can Effect Query Plans cont.

select count(*) from TW4 where E = “yes” and C = 825765940

The table…has 1000000 rows, 24098 pages,

Estimated selectivity for E, selectivity = 0.527436, upper limit = 0.527436.

Estimating selectivity of index ‘E_AA_B', indid 6scan selectivity 0.52743602,filter selectivity 0.527436 527436 rows, 174107 pages

The best qualifying index is ‘E_AA_B' (indid 6) costing 174107 pages, with an estimate of 526 rows

FROM TABLE TW4 Nested iteration. Table Scan.

Removing Statistics Can Effect Query Plans cont.

delete statistics TW4(E)

Estimated selectivity for E, selectivity = 0.100000.

Estimating selectivity of index ‘E_AA_B', indid 6scan selectivity 0.100000,filter selectivity 0.100000 100000 rows, 20584 pages

The best qualifying index is ‘E_AA_B (indid 6) costing 20584 pages, with an estimate of 92 rows

FROM TABLE TW4 Nested iteration. Index : E_AA_B Forward scan. Positioning by key.

Maintaining Tuned Statistics

Tuned statistics will add to your maintenance

Any statistical value you write to sysstatistics either via optdiag or sp_modifystats will be overwritten by update statistics

> Keep optdiag input files for reuse

> If needed get an optdiag output file, edit it and read it in

> Keep scripts that run sp_modifystats

> Rewrite tuned statistics after running update statistics that affects the column with the modified statistics

Monitoring Table/Index Level Fragmentation Using The Statistics

Can Be Both An Optimizer and Space Management Concern

The more fragmentation the less efficient page reads are

> Deleted rows – fewer rows per page, affects costing

> Forwarded rows – 2 I/O each, optimizer adds to costing

> Empty data/leaf pages – more reads may be necessary

> Clustering can get worse

> Watch the DPCR of the table or APL clustered index

> In general the Cluster Ratios are not a good indicator of fragmentation since they are often normally low

> Use optdiag outputs to monitor these values

Monitoring Table/Index Level Fragmentation Using The Statistics cont.

> ASE 12.0 and above check the ‘Space utilization’ value

> ‘Large I/O efficiency’ is another value to watchEmpty data page count: 0Forwarded row count: 0.0000000000000000Deleted row count: 0.0000000000000000

Derived statistics: Data page cluster ratio: 0.9994653835872761 Space utilization: 0.9403543288085808 Large I/O efficiency: 1.0000000000000000

> ‘Space utilization’ and ‘Large I/O efficiency’ are not used by the optimizer

> The further from 1 the more fragmentation there is

Maintaining the Statistics

When data changes the statistics become out of date

In general up to date statistics are needed to get the best query plans

> Statistics are usually updated using update statistics commands

> The more statistics you have the more maintenance

> It’s a trade off between the gain in query performance and the increased statistics maintenance

> There’s no point in updating statistics if the table is static

Update Statistics

> Update statistics has been extended to allow for placement of statistics on columns

> update statistics table_name (col_name)

> update index statistics table_name [ind_name]

> update all statistics table_name

> Specify the requested number of steps (cells) to use when building the column’s histogram

> update statistics table_name (col_name) using 200 values

How Update Statistics Works

Column and table/index values have to be read in order to gather the statistics

> What does it do?

> Reads the column to gather information for density and histogram, writes the column level statistics

> While reading the column it gathers index/table level statistics – row & page count, forwarded rows, deleted rows, the cluster ratios, etc.

> Takes a sample value every X rows for a histogram boundary value - (based on the number of rows and requested steps)

• If same value for multiple steps save it to make an FC

How Update Statistics Works cont.

> Values have to be in sorted order for statistics gathering

> If it’s the leading column of an index no sort is necessary

> Just scan index leaf for statistics

> If not the leading column of an index - create a worktable, read values in, sort and scan for statistics

update statistics tab_name (col_name)- a table scan will be done to read the column

update index statistics (ind_name)- then only an index scan (with a sort of the inner columns)

> The sort is done in a worktable in tempdb.

update index and update all statistics will use a lot of tempdb space unless sampling is used

Some Statistics Related Myths & Legends

“Update statistics will result in improved performance”

> Only guarantees up to date statistics

> Due to distribution statistics may not give a ‘pretty’ picture of the column

“Always use update all statistics”

> Rarely need statistics on all columns of a table

> Can take a VERY long time to run, makes maintenance difficult at best

> Should consider adding stats to composite index columns

Statistics Tools

> Some useful tools for working with the statistics

> Some are by Sybase some are by users

> Optdiag - read, write and simulate the statistics

> Well known and documented

> sp_modifystats - make modifications to density values (more functionality coming soon - 11.9.2.4, 12.0.0.4, 12.5)

> sp__optdiag (that’s a double underscore) -

> by Kevin Sherlock

> Displays the statistics ala optdiag output - very handy

> http://www.sypron.nl/download.html

Sampling for Update Statistics

A new feature in 12.5.0.3

Designed to dramatically reduce the time it takes to update statistics – can dramatically speed up the running of update statistics

> ‘Opens’ your maintenance window

> Decreases the cost of using ASE

Randomly selected pages are read instead of reading all pages to gather the column level statistics – less I/O

> The percentage of pages to be sampled can be specified

update statistics tab_name with sampling = X percent

> X is the percentage of pages you want to sample

• Can be between 1 and 100

Definitions

> Column Level Statistics – those statistics that describe the values in the column to the optimizer – an attribute of a colum (i.e.; the histogram and density values)

> Sampling – randomly reading rows from a specified percentage (subset) of pages rather than all pages of the table in order to gather column level statistics

> Sampling Rate – the specified percentage of pages to read

> Full Scan – to gather statistics by reading all pages of the object (table or index)

> Major Attribute of an Index – the ‘leading’ column of an index as listed in the create index command

Sampling for Update Statistics cont.

> Unofficial tests show that a sampling rate of 10% on a 1 million row numeric column reduces the time for update statistics to run from 9 minutes to 30 seconds

Sampling for Update Statistics cont.

> The Resulting histogram will be based on the values that are sampled

• It will differ from a histogram obtained from a ‘full scan’ update statistics

• The lower the specified percentage of sampling the more the histogram will differ from a full scan histogram

• Test your queries against sampled statistics. In most cases you won’t see any major changes

• Density values not updated by sampling

> In most cases this won’t be an issue.

Why Sampling for Update Statistics?

> As datasets have grown the time it takes to run update statistics has also grown – Dramatically!!

> This became more of an issue with ‘update index statistics’ introduced in 11.9.x due to extra sort in worktable

> TCO and auto-tuning/admin require a faster way to run update statistics

> Without a faster update statistics neither efforts would succeed

> Speeding up update statistics is a long standing Customer feature request

> Random page sampling is the most I/O efficient method

> Dramatically decreased the time to run update statistics

Why Sampling for Update Statistics? cont.

> Some time test results – Not official, not for general release

> ‘your mileage may vary’

> Timings are from tests run by Sybase QA

> 1 million row int colum – timings based on ‘elapsed time’

20% sampling rate – Full scan time :2465850Sampling time : 398783Percentage of savings time(elapsed time):83%

10% sampling rate –Full scan time :2139013Sampling time : 153130Percentage of savings time(elapsed time):92% > Variations in full scan time are taken into account

How Does It Work?

> Specify the percentage of pages to read via update statistics

> ‘with sampling = X percent’• Percent value can be between 1 and 100

> ‘with’ extensions must follow ‘using’ – • with sampling = x percent and/or with consumers = x must follow

using X values

update statistics authors(auth_id) using 40 values with percent = 10

> Sampling reads all rows from each page read

> Row values are moved to the worktable to be sorted and the statistics gathered

> This saves tempdb space since the sampled sets of values are smaller than if the whole column was read into the worktable

How Does It Work? cont.

Specific update statistics syntax and their affects

update statistics table_name [index_name] with sampling = X percent

> Will full scan index pages to update/create statistics on the major attribute(s) of the specified index or all indexes

on the table ignoring the specified sampling rate – sampling will not be done


update index statistics tab_name [ind_name] with sampling = X percent

> Will full scan index pages to update/create statistics for the major attribute(s) of the indexes or specified index on the table, ignoring sampling.

> For minor index attributes will use sampling to scan the requested percentage of pages, read those values into a worktable, sort and gather statistics from there.

> The space used in tempdb will decrease as the sampling rate decreases

update statistics tab_name (col_name) with sampling = X percent

> Will use sampling to update/create statistics for the specified column using the specified sampling rate. This applies to all columns whether major attributes of an index or not

> Will not affect multi-column density values


update all statistics table_name with sampling = X percent

> Will full scan index pages to gather statistics for the major attribute of all indexes – will not use sampling on these columns

> Will use sampling to gather statistics for all columns that are not the major attribute of an index

> The space used in tempdb will decrease as the sampling rate decreases


> Sampling is not used for create index

> Since a full scan is required to build an index there is no additional cost for building the statistics

Trade Offs

> A sampled set of anything is not as accurate as examining the most effective sampling rate for a given dataset

> A histogram created with sampling is not likely to match a histogram created via a full scan

> Histogram boundary values will vary> Cell weights will vary> Minimum and maximum histogram boundary values will vary

> Since cell weight(s) and Range cell density are used to cost all SARGs a histogram from a sampled set will have an affect on SARG costing

> Variations in the upper and lower histogram values may result in ‘out of bounds costing’ by the optimizer

> The smaller the sampling rate the greater the variance is likely to be

Trade Offs cont.

> If there are existing density values they will not be overwritten. If there are no density values a default value of 0.100000 will be used for both Range cell and Total density values

> There is currently no information saved about the use of sampling (whether or not it was used and the sampling rate)

> Different cell types may appear

> As the sampling rate decreases it is possible that Frequency count and/or Range cells may appear where they didn’t exist prior to sampling

> The same pages will be resampled if the dataset is static and the same sampling rate used

Examples of Variations in the Histogram

Full scan histogram -

Step Weight Value

1 0.00000000 <= 154218543

2 0.05315000 <= 805909305

3 0.05305000 <= 808793353

4 0.05311000 <= 822687028

5 0.05304000 <= 825700873

6 0.05314000 <= 839464505

7 0.05292000 <= 842544649

8 0.05305000 <= 858863369

<edited>

20 0.04621000 <= 960051465> Note boundary values, cell weights and the upper and lower boundary values> Variations within the histogram are the main issue that needs to be tested

Examples of Variations in the Histogram cont.

10% sampled histogram -

Step Weight Value

1 0.00000000 <= 154218799

2 0.05253968 <= 805909300

3 0.05253968 <= 808728585

4 0.05253968 <= 822686772

5 0.05269841 <= 825636617

6 0.05349206 <= 839464498

7 0.05253968 <= 842543113

8 0.05253968 <= 858797321

<edited>

20 0.04888889 <= 960050979

> Note variations in the boundary values, cell weights and the upper and lower boundary values

Tuning and Troubleshooting

> “Trial-and-error” testing/tuning will need to be done to determine the most optimal sampling rate for a given dataset

> In most cases variations in the statistics will have no affectIn other cases small variations may change query plans

> There is no ‘rule of thumb’ on what sampling rate to use

> In some cases the same sampling rate may be fine across all or most tables/columns.

> In some cases sampling may not result in efficient plans

> Use showplan and traceon 302/310 outputs to track changes to the query plan as the sampling rate changes

> Using sample queries get above outputs from statistics gathered by a full scan. Update statistics with the sampling rate, rerun query and compare outputs

Tuning and Troubleshooting cont.

>Use optdiag to monitor changes to the histogram

>Check optdiag of full scan histogram for upper and lower boundary values these can be inserted into the histogram if needed

>Keep a copy of optdiag output file as a backup of statistics in case old values need to be reloaded

Future Enhancements

> This first implementation of sampling will require some enhancements

> ‘Scale’ density values gathered by sampling so that they are more accurate

> Track the min/max values in the column in order to maintain the upper and lower boundary values of the histogram

> Sampling index pages

> Will help decrease the time of running update statistics even further

> Add a mechanism to record if sampling was used and what sampling rate was last used

> Add this information to optdiag and traceon 302 (and future optimizer diagnostics)

Where To Get More Information

> The Sybase Customer newsgroups

> http://support.sybase.com/newsgroups

> The Sybase list server

> [email protected]

> The external Sybase FAQ

> http://www.isug.com/Sybase_FAQ/

> Join the ISUG, ISUG Technical Journal, feature requests

> http://www.isug.com

Where To Get More Information

> The latest Performance and Tuning Guide

> Don’t be put off by the ASE 12.0 in the title, it covers the 11.9.2 features/functionality too

> http://sybooks.sybase.com/onlinebooks/group-as/asg1200e

> Any “What’s New” docs for a new ASE release

> Tech Docs at Sybase Support

> http://techinfo.sybase.com/css/techinfo.nsf/Home

> Upgrade/Migration help page

> http://www.sybase.com/support/techdocs/migration

Sybase Developer Network (SDN)

Additional Resources for Developers/DBAs

> Single point of access to developer software, services, and up-to-date technical information:

> White papers and documentation

> Collaboration with other developers and Sybase engineers

> Code samples and beta programs

> Technical recordings

> Free software

tuning the optimizer statistics >eric miner senior engineer data server technology...

Documents