unlocking the mysteries behind update statistics

34
06/16/22 1 Unlocking the Mysteries Behind Update Statistics John F. Miller III

Upload: phillip-whitfield

Post on 31-Dec-2015

44 views

Category:

Documents


2 download

DESCRIPTION

Unlocking the Mysteries Behind Update Statistics. John F. Miller III. The Dice Problem. Throw dice, how many will be 1?. Questions about the Dice. How many dice are you throwing? How many sides does each dice have? Are all the dice the same?. The better the information, - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Unlocking the Mysteries Behind Update Statistics

04/19/23 1

Unlocking the Mysteries

Behind Update Statistics

John F. Miller III

Page 2: Unlocking the Mysteries Behind Update Statistics

04/19/23 2

The Dice Problem

• Throw dice, how many will be 1?

Page 3: Unlocking the Mysteries Behind Update Statistics

04/19/23 3

Questions about the Dice

• How many dice are you throwing?

• How many sides does each dice have?

• Are all the dice the same?

The better the information,the more accurate the estimate.

Page 4: Unlocking the Mysteries Behind Update Statistics

04/19/23 4

What does Update Statistics do?

• Collects information for the optimizer– Statistics LOW– Distributions MEDIUM & HIGH

• Drop Distributions

• Compile stored procedures

Page 5: Unlocking the Mysteries Behind Update Statistics

04/19/23 5

Statistics Collected

• systables• systables • syscolumns• syscolumns• sysindexes• sysindexes

Number of Rows

Number of pages to store the data

Second largest value for a column

Second smallest value for a column

# of unique values for the lead key

How highly clustered the values for the lead key

Page 6: Unlocking the Mysteries Behind Update Statistics

04/19/23 6

Update Statistics LowBasic Algorithm

• Walk the leaf pages in each index• Submit btree cleaner requests when deleted

items are found causing re-balancing of indexes• Collects the following information

– Number of unique items– Number of leave pages– How clustered the data is– Second highest and lowest value

Page 7: Unlocking the Mysteries Behind Update Statistics

04/19/23 7

--- DISTRIBUTION --- ( -11: ( 868317, 70, 75)2: ( 868317, 24, 100)3: ( 868317, 12, 116)4: ( 868317, 30, 147)5: ( 868317, 39, 194)6: ( 868317, 28, 222) --- OVERFLOW ---1: ( 779848, 43)2: ( 462364, 45)

How to Read Distributions

To get the range of values look at the highest value in the previous bin.

# of rows represented in this bin

# of unique values

Highest Value in this bin

# of rows for this value

The value

Page 8: Unlocking the Mysteries Behind Update Statistics

04/19/23 8

Example - Approximating a Value

--- DISTRIBUTION --- ( -11: ( 868317, 70, 75)2: ( 868317, 24, 100)3: ( 868317, 12, 116)4: ( 868317, 30, 147)5: ( 868317, 39, 194)6: ( 868317, 28, 222) --- OVERFLOW ---1: ( 779848, 43)2: ( 462364, 45)

• There are 868317 rows containing a value between -1 and 75

• There are 70 unique values in this range

• The optimizer will deduce 868317 / 70 = 12,404 records for each value between -1 and 75

Page 9: Unlocking the Mysteries Behind Update Statistics

04/19/23 9

Example - Dealing with Data Skew

--- DISTRIBUTION --- ( -11: ( 868317, 70, 75)2: ( 868317, 24, 100)3: ( 868317, 12, 116)4: ( 868317, 30, 147)5: ( 868317, 39, 194)6: ( 868317, 28, 222) --- OVERFLOW ---1: ( 779848, 43)2: ( 462364, 45)

• Data skew• For the value 43 how many

records will the optimizer estimate will exist?

• Answer 779848 values• Any value that exceeds 25%

of the bin size will be placed in an overflow bin

Page 10: Unlocking the Mysteries Behind Update Statistics

04/19/23 10

Basic Algorithmfor Medium and High

• Develop scan plan based on available resources

• Scan table – High = All rows– Medium = Sample

of rows• Sort each column

• Build distributions• Begin transaction

– Delete old columns distributions

– Insert new columns distributions

• Commit transaction

Page 11: Unlocking the Mysteries Behind Update Statistics

04/19/23 11

Scan

• The table is scanned in its entirety for update stats high, while it is only sampled for update stats medium (see Sample Size)

• The reading of rows is done in dirty read isolation, regardless of what the user has set for their transaction level.

Page 12: Unlocking the Mysteries Behind Update Statistics

04/19/23 12

Scan

• This scan of the table may occur several times depending on the amount of sort memory available and the number of columns to collect statistics about.

• The approximate number of table scans is defined by the (size of the data to sort) / (amount of sort memory)

Page 13: Unlocking the Mysteries Behind Update Statistics

04/19/23 13

Sort

• The rows processed by the scan phase are passed directly to the sort package.

• Each column in the row for which statistics are being generated is passed to a unique invocation of a sort.

Page 14: Unlocking the Mysteries Behind Update Statistics

04/19/23 14

Build

• After the sort is completed we read the sorted column data finding out the number of duplicates and unique values creating approximately 200 range bins by default.

• Any count of a duplicates value that exceeds 25% the size of a bin will be placed in an overflow bin.

Page 15: Unlocking the Mysteries Behind Update Statistics

04/19/23 15

Insert

• Now we have to delete the old distributions and insert the new distributions. As long as the user was not in a transaction this will be done as its own transaction. This transaction will last for less than 1 second and will hold NO locks on the tables, but locks on the system catalogs while the update occurs.

Page 16: Unlocking the Mysteries Behind Update Statistics

04/19/23 16

Sample Size

• HIGH– The entire tables is scanned and all rows are used.

• Medium– Misconception about the number of rows sampled

is based on the number of rows in the table, this is incorrect.

– The number of samples depends on the Confidence and Resolution. See the following chart.

Page 17: Unlocking the Mysteries Behind Update Statistics

04/19/23 17

Update Statistics Medium Sample Size

Resolution Confidence Samples

2.5 .95 2,963

2.5 .99 4,273

1.0 .95 18,516

1.0 .99 26,569

0.5 .95 74,064

0.5 .99 106,276

Resolution Confidence Samples

0.25 .95 296,255

0.25 .99 425,104

0.1 .95 1,851,593

0.1 .99 2,656,900

0.05 .95 7,406,375

0.05 .99 10,627,600

Page 18: Unlocking the Mysteries Behind Update Statistics

04/19/23 18

Update Statistics Medium Memory Requirements

Confidence .99

ResolutionRowSize 2.5 2.0 1.5 1.0

100 .96 MB 1.2 MB 1.8 MB 3.5 MB

200 1.3 MB 1.9 MB 3.0 MB 6.1 MB

300 1.8 MB 2.5 MB 4.2 MB 8.7 MB

400 2.2 MB 3.2 MB 5.3 MB 11.3 MB

500 2.6 MB 3.9 MB 6.4 MB 13.9 MB

Confidence .99

ResolutionRowSize 2.5 2.0 1.5 1.0

600 3.0 MB 4.5 MB 7.6 MB 16.5 MB

700 3.4 MB 5.1 MB 8.7 MB 19.1 MB

800 3.8 MB 5.8 MB 9.9 MB 21.7 MB

900 4.3 MB 6.4 MB 11.1 MB 24.2 MB

1000 4.7 MB 7.1 MB 12.2 MB 26.9 MB

Page 19: Unlocking the Mysteries Behind Update Statistics

04/19/23 19

Update Statistics High Memory Requirements

• In memory sort– Approximate Memory = number of rows *

sum(column widths + 2 * sizeof(pointer) )

Page 20: Unlocking the Mysteries Behind Update Statistics

04/19/23 20

Memory Rules

• Estimated Update Stats memory is below 100MB – Hard coded limit of 4MB– Attempts to minimize the scans by fitting as many

columns into 4MB• Estimated Update Stats memory is above 100MB

– Memory is requested from MGM– Attempt to minimize the scans by fitting as many

columns in the MGM memory

Page 21: Unlocking the Mysteries Behind Update Statistics

04/19/23 21

Examples

• Customer TableCust_id integer

Fname char(50)

Lname char(50)

Address1 char(200)

Address2 char(200)

State char(2)

zipcode integer

• Number of Rows 500,000

Page 22: Unlocking the Mysteries Behind Update Statistics

04/19/23 22

ExamplesMemory for Incore Sort

Column Data Type Size Sort Memory

Cust_id Integer 4 bytes 2 MB

Fname Char(50) 50 bytes 25 MB

Lname Char(50) 50 bytes 25 MB

Address1 Char(200) 200 bytes 100 MB

Address2 Char(200) 200 bytes 100 MB

State Char(2) 2 bytes 1 MB

Zipcode Integer 4 bytes 2 MB

Page 23: Unlocking the Mysteries Behind Update Statistics

04/19/23 23

ExamplesNumber of Table Scans

PDQPRIORITY 0 PDQPRIORITY 0With 100 MB of Memory

Scan #1 Cust_idState

Scan #1 Cust_id FnameLname StateZipCode

Scan #2 Fname Scan #2 Address1

Scan #3 Lname Scan #3 Address2

Scan #4 Address1

Scan #5 Address2

Scan #6 ZipCode

Page 24: Unlocking the Mysteries Behind Update Statistics

04/19/23 24

Confidence

• A factor in the number of samples used by update statistics medium

Page 25: Unlocking the Mysteries Behind Update Statistics

04/19/23 25

Resolution

• Percentage of data that is represented in a distribution bin

• Example– 100,000 rows in the table– Resolution of 2%– Each bin will represent 2,000 rows

Page 26: Unlocking the Mysteries Behind Update Statistics

04/19/23 26

• UPDATE STATISTICS CAN NOT ALLOCATE SORT MEMORY BETWEEN 4 AND 100 MB– The default has been raised from 4MB to 15MB

– User can now configure the amount of memory

• UPDATE STATISTICS USES SLOW SCANNING TECHNOLOGY WHEN SCANNING A TABLE -- ENABLE LIGHT SCANS– Implemented light scans

– Set oriented reads

Improvements in update statistics in 7.31.UD2

Page 27: Unlocking the Mysteries Behind Update Statistics

04/19/23 27

Improvements in update statistics in 7.31.UD2

• THE PLAN UPDATE STATISTICS MAKE WHEN SCANNING IS NOT VIEWABLE BY THE DBA– Set explain will now print the scan path and resource usage

• UPDATE STATISTICS LOW ON FRAGMENT INDEXES RUNS SERIALLY AND VERY SLOW– With PDQ turned on each index fragment will be scanned in

parallel

– PDQ at 1 means 10% of the index fragments scanned in parallel, while PDQ at 10 means all the index fragments will be scanned in parallel

Page 28: Unlocking the Mysteries Behind Update Statistics

04/19/23 28

Improvements in update statistics in 7.31.UD2

• ERROR 126 WHEN EXECUTING UPDATE STATISTICS AND STORED PROCECURE (ALSO ERRORS 312/100)– Errors when trying to insert the distributions because set lock mode to

wait was not handled properly inside update statistics

• SCANNING AN INDEX WHICH IS FRAGMENT IS SLOW DUE TO THE INEFFICIENT MERGE IN THE FRAGMENT MANAGER– Binary search used instead of the previous nest loop merge when

ordering index fragments – Most noticeable when the number of fragments in an index is large

Page 29: Unlocking the Mysteries Behind Update Statistics

04/19/23 29

Tuning with the New Statistics

• Turn on PDQ when running update statistics, but only for tables– Avoid PDQ when updating statistics for procedures

• When running high or medium increase the memory update statistics has to work with

• Enable parallel sorting (i.e. PSORT_NPROCS)

Page 30: Unlocking the Mysteries Behind Update Statistics

04/19/23 30

Considerations

• Change the RESOLUTION to 1.5– Increasing the number of bins for the distributions

– Increasing the sample size for update statistics medium

Page 31: Unlocking the Mysteries Behind Update Statistics

04/19/23 31

Example

• Following Example– Table size 215,000 rows

– Row size 445 bytes

– Uniprocessor

Page 32: Unlocking the Mysteries Behind Update Statistics

04/19/23 32

Example of the current update statistics

Table: jmiller.t9

Mode: HIGH

Number of Bins: 267 Bin size 1082

Sort data 101.4 MB

Sort memory granted 4.0 MB

Estimated number of table scans 10

PASS #1 c9

PASS #2 c5

PASS #3 c7

PASS #4 c6

…..

PASS #10 c4

Completed pass 1 in 0 minutes 24 seconds

Completed pass 2 in 0 minutes 20 seconds

Completed pass 3 in 0 minutes 17 seconds

Completed pass 4 in 0 minutes 17 seconds

Completed pass 5 in 0 minutes 17 seconds

Completed pass 6 in 0 minutes 15 seconds

Completed pass 7 in 0 minutes 14 seconds

Completed pass 8 in 0 minutes 15 seconds

Completed pass 9 in 0 minutes 16 seconds

Completed pass 10 in 0 minutes 14 seconds

Total Time 146 seconds

Page 33: Unlocking the Mysteries Behind Update Statistics

04/19/23 33

The new Defaults in 7.31.UD2

Completed pass 1 in 0 minutes 34 seconds

Completed pass 2 in 0 minutes 19 seconds

Completed pass 3 in 0 minutes 16 seconds

Completed pass 4 in 0 minutes 14 seconds

Completed pass 5 in 0 minutes 15 seconds

Total Time 98 secondsNew Memory Default

Table: jmiller.t9

Mode: HIGH

Number of Bins: 267 Bin size 1082

Sort data 101.4 MB

Sort memory granted 15.0 MB

Estimated number of table scans 7

PASS #1 c9,c8,c10,c5,c7

PASS #2 c6,c1

PASS #3 c3

PASS #4 c2

PASS #5 c4

Page 34: Unlocking the Mysteries Behind Update Statistics

04/19/23 34

Enabling PDQ with update statistics

Table: jmiller.t9

Mode: HIGH

Number of Bins: 267 Bin size 1082

Sort data 101.4 MB

PDQ memory granted 106.5 MB

Estimated number of table scans 1

PASS #1 c1,c2,c3,c4,c5,c6,c7,c8,c9,c10

Index scans disabled

Light scans enabled

Completed pass 1 in 0 minutes 29 seconds Total Time 29 seconds

PDQ Memory

Features Enabled