ibm db2 for linux, unix, and windows best practices data...

IBM® DB2® for Linux®, UNIX®, and Windows®

Best Practices Data Life Cycle Management

Christopher Tsounis

Executive IT Specialist

Information Management Technical Sales

Enzo Cialini

Senior Technical Staff Member

DB2 Data Server Development

Last updated: 2009-10-23

IBM®

Data Life Cycle Management

Executive Summary............................................................................................. 4

Introduction .......................................................................................................... 5

Partitioning techniques ....................................................................................... 6

What is database partitioning?..................................................................... 6

What is table partitioning?............................................................................ 6

Multi-dimensional clustering............................................................................. 8

Features of MDC that benefit roll-in and roll-out of data........................ 9

Using database partitioning, table partitioning and multi-dimensional

clustering in the same database design .......................................................... 10

Additional techniques to support life cycle management ........................... 11

Large table spaces ........................................................................................ 11

SET INTEGRITY operation......................................................................... 12

Asynchronous index cleanup..................................................................... 12

Designing and implementing your table partitioning strategy .................. 13

Design best practices: .................................................................................. 13

Maximizing the benefits of partition elimination: .................................. 16

Operational considerations:........................................................................ 16

Rolling in data: Which solution to use? .......................................................... 18

Best practices for roll-in of compressed table partitions: ............................. 20

Best practice for roll-in and roll-out with continuous updates: .................. 21

After roll-out: How to manage data growth and retention? ....................... 22

Using UNION ALL views .......................................................................... 22

Using IBM Optim Data Growth Solution................................................. 23

Best Practices....................................................................................................... 30

Conclusion .......................................................................................................... 32

Further reading................................................................................................... 33

Contributors.................................................................................................. 33

Notices ................................................................................................................. 34


Trademarks ................................................................................................... 35


Executive Summary Today’s database applications frequently require scalability and rapid roll-in and roll-out

of data with minimal disruption to data access by applications. Roll-in of data refers to

the addition of new data as it becomes available while roll-out of data refers to the

moving out (usually archiving) of historic data. Many applications today are accessed 24

X 7, therefore eliminating the previously available batch window for data updates. Also,

many applications require the continuous feed of data updates while applications

concurrently access the data.

The DB2 database system provides a variety of facilities that enable scalability and

facilitate the continuous feed or roll-in and roll-out of data, with minimal interruption of

data access. This document recommends best practices to design and implement these

DB2 facilities to achieve these goals.


Introduction This paper describes the best DB2 design practices to facilitate the life-cycle management

of DB2 data. Life-cycle management is the efficient addition (roll-in) of new data, and

archival (roll-out) of data no longer required in the main database. The DB2 database

system provides the following features that you can use in combination to facilitate life

cycle management:

• Database partitioning

• Table partitioning

• Multi-dimensional clustering

• UNION ALL views

In addition to these DB2 features, the IBM Optim Data Growth solution facilitates

archiving for data life cycle management.

An important benefit of DB2 database system partitioning facilities is the ability to

deploy and modify these facilities without impacting existing application code.

This paper is part of a family of related best practice papers, you would also benefit by

reading the following best practice papers:

• Physical Database Design

• Minimizing Planned Outages

• Row Compression.

The target audience for this paper is personnel responsible for database design for DB2

applications. Database personnel who want to achieve scalability and efficient life cycle

management of data should also find it valuable. This paper assumes you have moderate

experience in designing DB2 databases.

This paper is based on the facilities available in DB2 Version 9.5 and DB2 Version 9.7.

Subsequent releases of the DB2 database system might provide enhancements that alter

the best practices recommendations in this document.


Partitioning techniques

What is database partitioning? Database partitioning (formerly known as DPF) distributes data across logical nodes of

the database by using a key-hashing algorithm. The goal of database partitioning is to

maximize scalability by evenly distributing data across clusters of computers. Database

partitioning further enhances scalability by reducing the granularity of DB2 utility

operations. It also parallelizes query and update operations on the database.

The following example demonstrates how to specify database partitioning:

CREATE TABLE Test

(Account_Number INTEGER,

Trade_date DATE )

DISTRIBUTE BY (Account Number) USING HASHING

Note: In DB2 Version 9.1, the PARTITION KEY clause is renamed DISTRIBUTE BY.

Database partitioning is completely transparent, so it does not impact existing

application code. Also, you can modify partitioning online using the redistribution

utility, without affecting application code.

When you design your database partitioning strategy, use a partitioning key column

with high cardinality to help ensure even distribution of data across logical nodes. A

column with high cardinality has many unique values (rather than most values being the

same). Also, unique indexes must be a superset of the partitioning key.

Try to use the same partitioning key on tables that get joined. This increases the

collocation of joins.

What is table partitioning? Table partitioning (frequently called range partitioning) splits data by specific ranges of

key values over one or more physical objects within a logical database partition. The

goal of table partitioning is to organize data to facilitate optimal data access and the

rollout of data. Table partitioning might also facilitate the roll-in of data for certain

applications, however, often multi-dimensional clustering (discussed in the section

“Multi-Dimensional Clustering”, next) is a better choice to enhance roll-in. Database

partitioning is the best practice for reducing the granularity of utility operations for the

scalability of very large databases.

Table partitioning has the following benefits:


• Improved query performance by eliminating irrelevant partitions. The optimizer

can limit SQL access to only relevant partitions in the WHERE clause.

• Optimized roll-in and roll-out processing of ranges. Table partitioning allows

easy addition and removal of table partitions with no data movement.

Applications can perform read and write operations against older data when

partitions are added (queries are drained for a brief period).

• Maintained compression ratios across data that changes over time. Each table

partition has its own compression dictionary. Thus, compressed data in older

partitions is not affected by the changing characteristics of newly inserted data.

• Optimized management of very large tables. Table-partitioned tables can be

virtually unlimited in size, because the limits are per partition (not per table).

You can place ranges of data across multiple table spaces to facilitate the backup

and restore of this data.

• Greater flexibility of index placement in SMS table spaces. You can store indexes

in separate SMS large table spaces (supported for table-partitioned tables only).

Separate index placement into DMS table spaces is available for all tables.

DB2 Version 9.7 enhances table partitioning with the ability to create partitioned indexes.

Partitioned indexes are stored locally with the table partitions. The benefits of partition

indexes are:

• Avoids the overhead of maintaining global indexes during Set Integrity

processing when attaching table partitions.

• Avoids Asynchronous Index Cleanup when detaching table partitions

• Improves the performance of reorganize by partition operations

• May improve query performance by reducing the cost of index processing due to

more compact indexes.

The following example demonstrates specifying table partitioning:

CREATE TABLE Test


Trade_date DATE)

IN ts1, ts2, ts3

PARTITION BY RANGE (Trade_date)

( STARTING '1/1/2000' ENDING '3/31/2000',

STARTING '4/1/2000' ENDING '6/30/2000',


STARTING '7/1/2000' ENDING '9/30/2000')

The following example demonstrates the creation of a partitioned table with a partitioned

index. Partitioned indexes are created by default in DB2 Version 9.7 whenever possible:

CREATE TABLE T1 (I1 INTEGER, I2 INTEGER) PARTITION BY RANGE(I1) (STARTING(1) ENDING (10) EVERY (5));

CREATE INDEX IND1 ON T1(I1) PARTITIONED;

The following example demonstrates the creation of a non-partitioned index:

CREATE INDEX IND2 ON T1(I2) NOT PARTITIONED;

There are many additional techniques available to specify how a table is partitioned that

are described in your DB2 documentation.

Multi-dimensional clustering Multi-dimensional clustering (MDC) is a unique capability available only with the DB2

database system. MDC organizes data in a table by multiple key values (cells). The goal

of MDC is to facilitate access to data by using multiple dimensions, therefore keeping

data access only to selected relevant cells. MDC helps to ensure that the data is always

clustered by dimensions avoiding the need for reorganization of data (MDC is designed

to keep data in order).

MDC also utilizes block indexes on each dimension (and the combined dimensions)

versus row ID (RID) indexes. This can result in a substantial reduction in the index size

and index levels. For example, if 100 rows can fit into a DB2 cell, the block index will

only point to the cell rather than each of the 100 rows. This results in a reduction in I/O

for reading and updating data (the index is only updated when the block is full).

MDC facilities the roll-in and roll-out of data and is completely transparent to

applications.

The following example demonstrates how to specify multi-dimensional clustering:

CREATE TABLE order


Trade_Date DATE,

Region CHAR(10,

order_month INTEGER generated always as month(orde r_dt))

IN ts1


ORGANIZED BY DIMENSIONS (region, order_month)

When designing your MDC strategy, specify low-cardinality columns to avoid sparsely

populated cells. Sparsely populated cells can significantly increase disk space usage. A

column with low cardinality is likely to have many values that are the same (rather than

many unique values). You can also use a generated column to produce a highly clustered

dimension. For example, a generated column or built-in-function can convert a date into

a month. This reduces the cardinality significantly (for a year of data, the cardinality is

reduced from 365 to 12).

Features of MDC that benefit roll-in and roll-out of data MDC is designed to maintain clustering in all dimensions avoiding the need for

reorganization of data. This can greatly reduce I/O during the roll-in process (but does

use sequential big block I/O). Also, because indexes on MDC dimensions are block

indexes, this allows MDC to avoid excessive index I/O during the roll-in process. Block

indexes are smaller and shallower than a normal RID-based index, because the index

entries point to a block rather than an entire row.

Also, during the roll-in process, MDC reduces index maintenance because the block

index is only updated once when the block is full (not for each row inserted as with other

indexes). This also helps to reduce I/O.

INSERT statements run faster when you use MDC, because MDC reuses existing empty

blocks without the need for page splitting. Locking is also reduced for inserts because

they occur at a block level rather than a row level.

MDC improves the roll-out of data, because entire pages are deleted rather than each

row. Logging is also reduced with MDC deletes (just a few bytes per page).

Use a single-column MDC design to facilitate roll-in and roll-out and minimize an

increase in disk space usage.

See the section called “Best practice for roll-in and roll-out with continuous updates” for

a hypothetical application with characteristics that benefit from using MDC for rolling in

data.


Using database partitioning, table partitioning and

multi-dimensional clustering in the same database

design The best practice approach for deploying large scale applications is to implement

database partitioning, table partitioning, and MDC simultaneously in the same database

design. Database partitioning provides scalability and helps ensure the even distribution

of data across logical partitions; table partitioning facilitates query partition elimination

and rollout of data; and MDC improves query performance and facilitates the roll-in of

data.

For example:

CREATE TABLE Test

(A INT, B INT, C INT, D INT …)

IN Tablespace A, Tablespace B, Tablespace C …

INDEX IN Tablespace B

DISTRIBUTE BY HASH (A)

PARTITION BY RANGE (B) (STARTING FROM (100) ENDING (300) EVERY (100))

ORGANIZE BY DIMENSIONS (C,D)

Table partitioning may not fully solve scaling issues in DB2. Continue to use database

partitioning to solve scalability issues for large scale data warehouses. DB2 database

partitioning and the shared nothing architecture is the best way to provide linear scaling

of your application while minimizing the software bottleneck.


Additional techniques to support life cycle

management

Large table spaces Using large table spaces (the default for DB2 Version 9.1) better accommodates larger

tables or indexes. It also allows more rows per page within the DB2 server.

Use large table spaces for tables for deep compression (many rows per page) and table

partition global indexes (when expected to exceed 64 GB in a 4K page). If you are not

affected by these issues, then large table spaces do not have to be utilized. Also, you can

avoid the need for large table spaces by placing each table partition global index into a

separate table space (highly recommended). Local partition indexes further reduce the

need for the deployment of large table spaces.

The following diagram compares the space available, in terms of the number of records

that can be stored on various-sized pages, for a regular table space and a large table

space.

Page Size REG TBSP

Max Size

RID – 4 Bytes

REG TBSP

Max

Records/Min

Record

Length

LARGE TBSP

Max Size

RID – 6 Bytes

LARGE TBSP

Max

Records/Min

Record Length

4 KB 64 GB 251 / 14 2 TB 287 / 12

8 KB 128 GB 253 / 30 4 TB 580 / 12

16 KB 256 GB 254 / 62 8 TB 1165 / 12

32 KB 512 GB 253 / 127 16 TB 2335 / 12

Note: If you alter a table space to large, it does not take effect until all indexes for the

tables have been reorganized.


SET INTEGRITY operation Running SET INTEGRITY is required when you attach a new partition to a table and

when you detach a table partition with a materialized query table (MQT). (Note that data

in the new partition is not visible until the SET INTEGRITY process completes.) SET

INTEGRITY is a potentially long-running operation that validates data and maintains

global indexes. This maintenance activity is logged and might produce a large volume of

log entries. DB2 Version 9.7 supports partitioned indexes that can be created prior to

attaching a new partition. The greatly reduces the time required for the SET INTEGRITY

operation.

The key benefit of SET INTEGRITY is that existing data is available for read and write

access during its operation. You can minimize the impact of SET INTEGRITY for large

volumes of data by using MDC, implementing partition indexes and by minimizing your

use of global indexes and MQTs. User-maintained MQTs are an alternative that you can

specify to speed up SET INTEGRITY.

The section “Designing and implementing your table partitioning strategy” contains

recommendations on the use of SET INTEGRITY.

The section “Best practices for roll-in of compressed table partitions” describes how to

attach a table partition without requiring the execution of SET INTEGRITY.

Asynchronous index cleanup Asynchronous index cleanup (AIC) is a new DB2 feature that reclaims space in an index

after a table partition is detached (the cleanup runs at a low priority as a background

process). The detachment of a table partition is near instantaneous because of the AIC

feature. Detachment does not have to wait for index cleanup to complete. AIC is a

background process that is invoked automatically by the DB2 database system. AIC is

not performed for partitioned local indexes.


Designing and implementing your table partitioning

strategy Applications that benefit from table partitioning use the following kinds of tables:

• Very large tables

• Tables with queries accessing range-subsets of a table

• Tables with roll-out requirements

• Tables with roll-in requirements. As an alternative, consider MDC for roll-in of

data.

Design best practices: When designing your table partitioning strategy, consider the following table

partitioning design best practices:

• Partition a column (or columns) on dates (to facilitate roll-out)

• Partition a column (or columns) that assist partition elimination, as discussed

later in this section

• Match the granularity of ranges with roll-in and roll-out criteria. This avoids the

need for reorganization to reclaim space when you run DETACH.

• Consider placing different ranges in separate table spaces to facilitate back up

and recovery. The DB2 database system can backup and restore an entire range

partition when it is placed in a separate table space.

• Consider separating active from historical data.

• To position for enhancements in a future release of the DB2 database system,

consider making unique indexes a superset of the table partitioning key. Non-

unique indexes can be on any columns.

• Specify the placement of each of your global indexes into their own table space

(use large table spaces, if required). It is a good practice to minimize the size of

the table spaces containing global indexes, in order to improve backup time.

Also, use database partitioning to reduce the granularity of global indexes.

• Use partitioned indexes instead of global partition indexes where possible.

• Split up the global index into multiple table spaces to ensure that a single table

space does not grow too large.


• Consider strategies to minimize the impact of SET INTEGRITY. Consider the

logging impact and elapsed time of SET INTEGRITY when attaching large

ranges. SET INTEGRITY can also impact restart time if there is a failure.

Prototype in your environment to see if the elapsed time is acceptable.

Otherwise, consider alternative design strategies, discussed in the sections

“Rolling in data: Which solution to use?” and “Best practices for roll-in of

compressed table partitions”.

• For deep compression, DB2 Version 9.5 is strongly recommended, because of its

ability to automatically build compression dictionaries during LOAD, IMPORT

or INSERT operations. DB2 Version 9.1 requires table reorganization to compress

data in a table partition if a compression dictionary is not present.

With DB2 Version 9.7, consider the following partitioned index design best practices

• Partitioned indexes improve ATTACH and DETACH processing time and may

improve query performance in a large database environment. The design

guidelines for the creation of partitioned local indexes are:

o Non-Unique indexes are partitioned indexes by default. There are no

design restrictions to create non-unique indexes partitioned.

o Unique Indexes can be partitioned only if the key is a superset of the

table partitioning key (For DPF configurations – the key must be a

superset of the database partitioning key). For example:

Database partition key: Account_Num

Table partition key: Sales_Month

Potential unique index that can be partitioned:

Account_Num, Sales_Month, Store_Num

• To gain the benefits of partitioned indexes verify that uniqueness is required for

the application:

o A downstream data source may enforce uniqueness

o Non-unique indexes may increase sorting time for DISTINCT, ORDER

BY, and GROUP BY predicates

o Uniqueness may not be required.

• Unique partitioned indexes have to be created prior to the attachment to avoid

index maintenance overhead

• Placing index partitions in a separate table space is a best practice


A major benefit of placing partition indexes in their own table space is that there

is no data movement when attaching a partition if the separate index was built

prior to the ATTACH.

Partition indexes are placed in the same table space as the table by default.

Partition indexes may be placed in a separate table space. To place a partition

index in a separate table space, use the partition level INDEX IN clause of

CREATE TABLE DDL or ALTER ADD the partition (DMS storage only). The table

level INDEX IN clause of the CREATE TABLE DDL is for non-partition indexes

only.

• Partition index migration considerations and best practices

Partition indexes created in DB2 Version 9.5 and migrated to DB2 Version 9.7 are

placed in the same table space as the table. Data movement is required to put the

index into a separate table space after a migration from DB2 Version 9.5.

To migrate to partition indexes in separate table spaces in DB2 Version 9.7

1. Create a new partition index in a separate table space.

create index date_part on sales(date, status) partitioned;

2. Drop the existing original partition index from the same table space as the

data.

drop index dateidx;

3. Rename the new partition index in a separate table space to the same name

as the original partition index.

rename index date_part to dateidx;

An alternative method is to create a new table with partitioned indexes and

move the data using an online table move in order to place the partitioned

indexes into a separate tablespace.

• Other partition index design considerations

o Indexes on the source table to be attached need to match indexes on the

target table

� Some correction possible before a failure occurs

� Indexes that do not match will be dropped


o Although several catalog statistics are moved during an ATTACH, the

best practice is to run RUNSTATS after an ATTACH.

Maximizing the benefits of partition elimination: Table partitioning provides a powerful facility to limit data access to the partitions that

are required to satisfy the SQL WHERE clause. In order to benefit from partition

elimination, do the following tasks:

• Prefix the cluster key with the partition key

• Ensure the range partitioning column is frequently used in the WHERE clause

• Ensure the leading columns of the composite partition key are in the WHERE

clause

• If you are using generated columns, use them where appropriate to assist in

partition elimination. Generated columns can be partition keys.

• Use generated columns as MDC dimensions, where appropriate to reduce the

granularity of the dimension.

• Use multiple, separate ranges to eliminate unnecessary searches, if possible. For

example, partition elimination could access only the months of January and

December instead of the whole year.

• If you are using joins, partition elimination is used for inner access of the nested

loop join only.

• Partition elimination of parameter markers is pushed down at run time when

values are bound at execution time

Operational considerations: Use the following best practices to enhance the operational characteristics of table

partitioning:

• Issue a COMMIT statement after each step of your roll-in or roll-out procedure to

release locks (for example, after ATTACH, DETACH, or SET INTEGRITY, and so

on)

• Explicitly name each table partition. These names are easier to manage than the

system generated names.

• Always terminate a failed LOAD utility run. Subsequent operations (for

example, DROP TABLE) cannot proceed until LOAD is terminated.

• If you are appending data to a partition, specify LOAD INSERT. Performing

LOAD REPLACE of a partition replaces an entire table (all partitions).


• Avoid attaching a partition with the same name as a detached partition. This

results in a duplicate name until asynchronous index cleanup (AIC) completes.


Rolling in data: Which solution to use? There are several factors that affect how you choose the best roll-in solution for your

installation:

• Minimizing the time it takes to bring new data into the system and make it

available

• Minimizing the amount of logging activity that occurs as part of the SET

INTEGRITY operation during roll in

• Whether you have a requirement for continuous updates rather than daily batch

process.

• Maximizing compression for new ranges to effectively manage data skew

The following methods are the two different techniques for the roll-in of data with table

partitions.

1. ALTER/ATTACH

With the ALTER/ATTACH method you first populate the table offline, and then

attach the partition. You must run SET INTEGRITY (a potentially long-running

operation for large data volumes). The impact of running SET INTEGRITY may

be reduced by using partitioned indexes in DB2 version 9.7.

Advantages:

• Concurrent access

• All previous partitions are available for updates

• No partial data view (new data cannot be seen until Set Integrity

completes)

Disadvantages:

• Additional log space is required

• Long elapsed times

• Draining of queries is required

2. ALTER/Add

With the ALTER/Add method, you attach an empty table partition, and then

populate it using the LOAD utility or INSERT statements.

You do not need to run SET INTEGRITY.


Advantages:

• Faster elapsed times

• SET INTEGRITY is not required

• Less log space for global index maintenance

Disadvantages:

• Partial data view occurs when you use INSERT statements (not with

LOAD utility).

• LOAD utility allows read-only access to older partitions

Recommendation:

For larger data volumes, utilize the ALTER/Add method for roll-in of a table partition or

utilize MDC for roll-in if many non-partitioned indexes are deployed.


Best practices for roll-in of compressed table

partitions: These best practices use the ALTER/Attach method of attaching a table partition, which

are described in the preceding section.

For Version 9.1, rapidly attach a table partition with compressed data (large data

volumes) by using the following technique:

1. Load a subset of data (a true random sample) into a separate DB2 table

2. Alter the standalone table to enable compression

3. Reorganize the subset of data to build a compression dictionary

4. Empty the table or retain minimal data (so the dictionary is retained)

5. ALTER/ATTACH the table as a new table partition (the dictionary is retained)

6. Execute SET INTEGRITY (this is rapid, due to minimal data)

7. Populate data by using the LOAD utility or INSERT statements (compression

will occur). For applications with continuous updates, load data into a staging

table using the LOAD utility. Then, use an insert with a sub-select from the

staging table or run an ETL (extract, transform, and load) job to update the

primary tables (compression will occur). The roll-in of data can be improved

further if you exploit the benefits of MDC within the table partition.

For Version 9.5, the technique to rapidly attach a table partition is simplified by

automatic dictionary creation:

1. ALTER/Add the empty table.

2. Populate the table with data, by using the LOAD utility or an INSERT/SELECT

statement (data is compressed with automatic dictionary creation).

Note that a full offline reorganization of a fully-loaded partition is likely to achieve better

compression than can be achieved with this method. DB2 Version 9.7 fix pack 1 supports

rapid reorganization by partition when using partitioned indexes to improve

compression results.


Best practice for roll-in and roll-out with continuous

updates: This database design combines various features of the DB2 database system to facilitate

roll-in and roll-out of data with continuous update requirements.

This design is for applications with the following characteristics:

• Continuous updates occur all day long (which prevents performing ALTER/Add

to attach a partition).

• Data is added daily.

• Queries frequently access a certain day.

• Table partitioning on day results in too many partitions (for example, 365 days

times 3 years).

• Roll-out occurs weekly or monthly (typically on a reporting boundary).

Recommended database design:

To facilitate the roll-in of data, specify a single-dimension MDC on day (see the section

“Features of MDC that benefit roll-in and roll-out of data”).

To facilitate the roll-out of data, specify a table partition range per week or month. This

provides the same time dimension as MDC but at a coarser scale.

Applications with long running reports might not be able to drain queries for the

execution of the DB2 LOAD utility. The best practice in this case is to use the LOAD

utility to rapidly load data into staging tables. Then populate the primary tables using an

insert with a sub-select.


After roll-out: How to manage data growth and

retention? To satisfy corporate policy, government regulations, or audit requirements, you might

need to retain your data and keep it accessible for long durations of time. For example,

the Health Insurance Portability and Accountability Act (HIPAA) contains medical

record retention requirements for health-care organizations. The Sarbanes-Oxley Act sets

out certain record retention requirements for corporate accountants. Additionally, some

enterprises are also finding value in performing analytics on historical data and are

therefore retaining data for longer durations.

Therefore, in addition to implementing a suitable roll-in and roll-out strategy and an

appropriate database design, you need to consider the complete lifespan of your data

and include a policy for data retention and retrieval. You could do nothing and

continually add hardware capacity and resources to maintain the additional data growth

for retention purposes, however there are better practices for data retention, as described

in this paper.

Using UNION ALL views One practice is to keep all the data in the database but roll out certain ranges for retention

and create UNION ALL views over the ranges that require easy accessibility.

The following example demonstrates how to create a UNION ALL view:

CREATE VIEW all_sales AS

(

SELECT * FROM sales_0105

WHERE sales_date BETWEEN '01-01-2005' AND '01-31-20 05'

UNION ALL



UNION ALL

...

UNION ALL



);


Using UNION ALL views addresses data retention and real time accessibility while

keeping all the data maintained online in the database using primary storage. A problem

caused by this method is that you might be unnecessarily maintaining this data in

associated backup images. Also, historical data typically does not require high

performance, so does not need the indexing or other high-cost factors encountered with

your primary data.

There are a variety of ways you could use UNION ALL views:

• Access active data using UNION ALL views and keep your historical data

compressed in a range-partitioned table.

• Keep active data in a range-partitioned table and use a UNION ALL view to

access historical data in another a range-partitioned table.

Using UNION ALL views has some limitations. When you have a large number of

ranges, use range-partitioned tables because, for UNION ALL views some complex

predicates and joins are not pushed down.

However, in some situations UNION ALL views are advantageous. For example, a

UNION ALL view may work in a federated environment, whereas a range-partitioned

table does not.

Although UNION ALL views may be useful in some environments, DB2 version 9.7

users should strongly consider migrating to table partitioning.

Using IBM Optim Data Growth Solution Depending on your service level agreement (SLA) objectives for your historical data,

usually the best practice to address both data growth and retention is to implement data

archiving with IBM Optim™ Data Growth Solution.

IBM Optim Data Growth Solution is a leading solution for addressing growth,

compliance and management of data. It preserves application integrity by archiving

complete business objects, rather than single tables. For example, it retains foreign keys

and preserves metadata within the archive. These features enable you to have:

• Flexible access to data.

• The ability to selectively or fully restore archived data into the original database

table, or into a new table, or even into an alternate database.

The following steps guide you through the process of determining how best to

implement your archiving strategy.

STEP 1: Classify your applications

First, you need to classify your applications according to their archival requirements.

By understanding which transactions you need to retain from your application data,


you can group applications with similar data requirements for archive accessibility

and performance. Some applications require only current transactions be retained;

some require access to only historical transactions; and others require access to a mix

of current and historical transactions (with a varying current-to- historical ratio).

Also, consider the service level agreement (SLA) objectives for your archived data.

An SLA is a formal agreement between groups that defines the expectations between

them and includes objectives for items such as services, priorities, and

responsibilities. SLA objectives are often formulated using response time goals. For

example, a specific human resources report might need to run, on average, within 5

minutes.

STEP 2: Assess the temperature of your data:

Data derives its “temperature” from the following criteria:

• How frequently the data is accessed

• How long it takes to access the data

• How rapidly the data changes (volatility)

• User and application requirements

The temperature varies from enterprise to enterprise, but typically the data

temperatures fall into common classifications across industries. The following table

provides guidelines for data temperatures.

Regulatory Data that needs to be available on an exception basis.Dormant

Deep Historical Data – Queries rarely access this data but it must be available for periodic access.Cold

Traditional Decision Support Data – Queries access this data less frequently and data retrieval doesn’t require the urgency of a quick turnaround in response time.

Warm

Tactical Data – The bulk of the queries are for current data, accessed frequently, heavily and requiring quick response time turnaround.Hot

FactoidData

Temperature

Regulatory Data that needs to be available on an exception basis.Dormant

Deep Historical Data – Queries rarely access this data but it must be available for periodic access.Cold

Traditional Decision Support Data – Queries access this data less frequently and data retrieval doesn’t require the urgency of a quick turnaround in response time.

Warm

Tactical Data – The bulk of the queries are for current data, accessed frequently, heavily and requiring quick response time turnaround.Hot

FactoidData

Temperature

There are various means of assessing the temperature of data. Consider business and

application definitions and requirements, roll-out criteria, and workload and query

tracking statistics as potential methods for determining how to classify your data

according to temperature. Gather the following potential workload and query

information to assess the data temperature:

• Which objects are (and are not) being accessed


• The frequency each object is accessed

• The common time intervals at which objects are accessed,

For example: THIS_WEEK, LAST_WEEK, THIS_QUARTER,

LAST_QUARTER.

• Which data within an object is being accessed

You can use DB2 Version 9.5 workload management (WLM) to assist in discovering

data temperatures. The WLM historical analysis tool provides statistics on which

tables, indexes and columns have, or have not, been accessed, along with the

associated frequency.

The WLM historical analysis tool consists of 2 scripts:

• wlmhist.pl: generates historical data

• wlmhisrep.pl: produces reports from the historical data

To discover which data within an object is being accessed, analyze the SQL statement

using an ACTIVITIES event monitor to collect data on workload activities, including

the SQL statement text. You might want to collect information about workload

management objects such as workloads, service classes, and work classes (through

work actions). Enable activity collection using the COLLECT ACTIVITY DATA …

WITH DETAILS clause of the CREATE or ALTER statements for the workload

management objects for which you want to collect information, as shown in the

following example:

ALTER SERVICE CLASS sysdefaultsubclass

UNDER sysdefaultuserclass

COLLECT ACTIVITY DATA ON ALL WITH DETAILS

The WITH DETAILS clause enables collection of the statement text for both static and

dynamic SQL.

If applications make use of parameter markers within the statement text, you should

also include the AND VALUES clause, (so that you have COLLECT ACTIVITY

DATA … WITH DETAILS AND VALUES). The AND VALUES clause collects the

data values associated with the parameter markers in addition to the detailed

statement information.

STEP 3: Discover and classify your business objects

Business objects, such as insurance claims, invoices, or purchase orders, represent

business transactions. By classifying your business objects, you can begin to define


rules and associated business drivers for managing these objects at different stages in

the data life cycle.

From a database perspective, a business object represents a group of related rows

from related tables.

Simplified example of a business object:

Given the following three tables:

2/1/2006E11OPERATIONOP1010

2/1/2006E01OPERATION SUPPORTOP1000

12/1/2002D11W L PROD CONT PROGSMA2113

12/1/2002D11W L ROBOT DESIGNMA2112

12/1/2002D11W L PROGRAM DESIGNMA2111

2/1/2006D11W L PROGRAMMINGMA2110

2/1/2006D01WELD LINE AUTOMATIONMA2100

2/1/2006C01USER EDUCATIONIF2000

PRJENDATEDEPTNOPROJNAMEPROJNO

2/1/2006E11OPERATIONOP1010

2/1/2006E01OPERATION SUPPORTOP1000

12/1/2002D11W L PROD CONT PROGSMA2113

12/1/2002D11W L ROBOT DESIGNMA2112

12/1/2002D11W L PROGRAM DESIGNMA2111

2/1/2006D11W L PROGRAMMINGMA2110

2/1/2006D01WELD LINE AUTOMATIONMA2100

2/1/2006C01USER EDUCATIONIF2000

PRJENDATEDEPTNOPROJNAMEPROJNO

PROJECT

E11O’CONNELL310

D11CIALINI170

C01TYRRELL140

D11CASSELLS160

D11GOODMAN150

C01VINCENT130

D11TSOUNIS60

WORKDEPTLASTNAMEEMPNO

E11O’CONNELL310

D11CIALINI170

C01TYRRELL140

D11CASSELLS160

D11GOODMAN150

C01VINCENT130

D11TSOUNIS60

WORKDEPTLASTNAMEEMPNO

EMPLOYEE

OPERATIONSE11

ADMINISTRATION SYSTEMSD21

MANUFACTURING SYSTEMSD11

INFORMATION CENTERC01

DEPTNAMEDEPTNO

OPERATIONSE11

ADMINISTRATION SYSTEMSD21

MANUFACTURING SYSTEMSD11

INFORMATION CENTERC01

DEPTNAMEDEPTNO

DEPARTMENT

The business object is:

Project Department

Employee

For data retention and archiving purposes, you want the complete business object to

be represented such that you have a historical “point-in-time” snapshot of a business

transaction. Creating a historical snapshot requires both transactional detail and

related master information, which involves multiple tables in the database.

Archiving complete business objects allows the archives to be intact and accurate and

to provide a standalone repository of transaction history. To respond to inquiries or

discovery requests, you can query this repository without the need to access “hot”

data.


In this example, to ensure the complete object is available, the archived business

object must consist of associated data from the DEPARTMENT and EMPLOYEE

tables. After archiving, you would only want to delete the data in the production

PROJECT table and not in the associated EMPLOYEE and DEPARTMENT data.

You can discover business objects based on data relationships within the schema, as

demonstrated in this example. However, you might also want to include other

related tables that do not have any schema relationship, but, for example, might be

related through use of an application. In addition, you might elect to remove certain

discovered relationships from the business object.

STEP 4: Produce your comprehensive data classification:

After you have classified your applications and business objects and determined

their associated data temperatures, you can produce a data classification table to

summarize this information. This table articulates the aging of the data.

The following table provides a sample data classification:

>10yrs6-10yrs3-5yrs0-2yrsClaimsAppA

DeleteOffline Archive

Online Archive

ProductionBusiness Object

Application

>10yrs6-10yrs3-5yrs0-2yrsClaimsAppA

DeleteOffline Archive

Online Archive

ProductionBusiness Object

Application

STEP 5: Determine the post-archive storage type

To determine what storage type is most appropriate for your aged data, consider the

following questions:

• Who needs to access the archive data, and for what purpose?

• What are the response time expectations?

• How will the archive data age?

• How many storage tiers and what type of storage should be deployed, for

example, SAN, WORM, or tape?

For example, for online archive you could use ATA disks or large capacity slower

drives. For offline archive, you could use tape or WORM (IBM DR550, EMC

Centera).


Non DBMSRetention PlatformATA File ServerIBM DR550EMC Centera

CurrentData

0-2 years

Offline Retention Platform

CDTapeOptical

ProductionDatabase

Archive

OnlineArchive

3-5 years

OfflineArchive

6+ years

RestoreRestore

IBM Federation

Report WriterXMLODBC / JDBCNative Application

Universal Access to Application Data

Application Independent Access

STEP 6: Access to archived data

The Optim Data Growth Solution access layer uses SQL92 capability and various

protocols (as shown in the above figure) to provide access to the archived data. This

accessibility is out-of-line from the production database, and so does not use any

resources from the production database system.

Alternatively, you can use a federated system (using IBM DB2 Federated Server) to

provide transparent access to the archive from the production database.

Both methods allow for direct access to archived data, without the need to retrieve or

restore the archived data.

The following example demonstrates how to use a UNION ALL view to access both

active and archived data. The example renames the database table called project to a

different name, and then creates a UNION ALL view that is also named project.

RENAME TABLE project TO project_active

CREATE VIEW project AS

SELECT * FROM project_active

WHERE prjendate >= (CURRENT_DATE – 5 YEARS)

UNION ALL

SELECT * FROM project_arch


WHERE prjendate < (CURRENT_DATE – 5 YEARS)

As an alternative, the following example avoids the need to rename the table in the

database. Instead, the example creates a UNION ALL view called project_all that

the application can query from to get the complete project data set:

CREATE VIEW project_all AS

SELECT * FROM project

WHERE prjendate >= (CURRENT_DATE – 5 YEARS)

UNION ALL

SELECT * FROM project_arch

WHERE prjendate < (CURRENT_DATE – 5 YEARS)


Best Practices

• For database partitioning, use a partitioning key column with

high cardinality and frequently used by a join predicate.

• Use database partitioning to improve scalability for large scale

data warehouses.

• Use table partitioning for very large tables, tables with queries

that access range-subsets of data, and for roll-out requirements.

• For MDC, specify low-cardinality columns or use generated

columns to reduce cardinality.

• Use a single-column MDC design to facilitate roll-in and roll-out

to minimize increased disk space usage.

• For large scale applications, implement database partitioning,

table partitioning, and MDC simultaneously.

• Use large table spaces for tables with deep compression if you

believe you will have very small row sizes. For table partitioning,

place each table partition global index in a separate table space

(this might avoid the need for large table spaces) or use

partitioned local indexes.

• For larger data volumes, use the ALTER/Add method to roll-in a

table partition, or use MDC.

• For Version 9.1, to attach a table partition with compressed data

build a dictionary with minimal data prior to ALTER/ATTACH

to avoid table reorganization

• For Version 9.5, to attach a table partition, use the ALTER/Add

method.


• For continuous updates, facilitate roll-in of data by specifying a

single-dimension MDC on day

• Use federation to facilitate access to archived data from

production databases.

• Use UNION ALL views for transparent access to archived data.

• IBM Optim Data Growth Solution is the recommended tool for

data retention and retrieval.


Conclusion

Careful selection of the most appropriate partitioning method for your DB2 database,

and using the most efficient roll-in and roll-out technique for your system can maximize

your system’s overall performance and efficiency.

Devote sufficient time to analyzing and understanding your data so that you can make

the best use of the guidelines in this paper and take advantage of the features the DB2

database system provides to help make your system as efficient as possible.

You can use database partitioning to provide scalability and to help ensure even

distribution of data across partitions. Follow the guidelines in the section “Designing and

implementing your table partitioning strategy” to devise the most effective table

partitioning strategy. Use MDC to help improve the performance of queries and to

facilitate the roll-in of data.

If you need to roll-in large volumes of data from compressed table-partitions, upgrade to

Version 9.5 of the DB2 database system and use the ALTER/Add method to attach a table

partition.

If you need to accommodate continuous updates, your best strategy is to use MDC to

facilitate the roll-in process.

To determine how to handle the needs of your historical data, follow the guidelines in

the section “After roll-out: How to manage data growth and retention?”.

Before you are ready to roll out your data and archive it, you need to determine a policy

for data retention and retrieval-of-data-from-archive that suits your organization.

You can better understand your organization’s technical requirements for retention and

retrieval by analyzing the following factors:

The kind of transactions you need to retain

The “temperature” of your data

How your business objects are composed

Your policy should include what kind of post-archive storage is most appropriate, and

how best to access the archived data. The guidelines in the section “After roll-out: How

to manage data growth and retention?” can assist you in producing your policy.


Further reading • DB2 Best Practices - http://www.ibm.com/developerworks/db2/bestpractices/

• Leveraging DB2 Data Warehouse Edition for Business Intelligence -

http://www.redbooks.ibm.com/redbooks/SG247274/wwhelp/wwhimpl/java/html

/wwhelp.htm

• Database Partitioning, Table Partitioning, and MDC for DB2 9 -

http://www.redbooks.ibm.com/redbooks/SG247467/wwhelp/wwhimpl/java/html

/wwhelp.htm

• DB2 V9.5 Information Center -

http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp

• DB2 V9.7 Information Center -

http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp

• Optim Data Growth Management -

http://www.optimsolution.com/solutions/DataGrowth.asp

Contributors

Tim Vincent

Chief Architect DB2 LUW

Bill O’Connell

Data Warehousing CTO

Miriam Goodwin

Technical Sales Specialist

Tim Smith

Optim Product Manager

Phrederick Tyrrell

Data Warehousing Competitive Specialist

Aamer Sachedina

Senior Technical Staff Member

DB2 Technology Development

Matthew Huras

DB2 LUW Kernel, Chief Architect

Joyce Simmonds

DB2 Information Management


Notices This information was developed for products and services offered in the U.S.A.

IBM may not offer the products, services, or features discussed in this document in other

countries. Consult your local IBM representative for information on the products and services

currently available in your area. Any reference to an IBM product, program, or service is not

intended to state or imply that only that IBM product, program, or service may be used. Any

functionally equivalent product, program, or service that does not infringe any IBM

intellectual property right may be used instead. However, it is the user's responsibility to

evaluate and verify the operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter described in

this document. The furnishing of this document does not grant you any license to these

patents. You can send license inquiries, in writing, to:

IBM Director of Licensing

IBM Corporation

North Castle Drive

Armonk, NY 10504-1785

U.S.A.

The following paragraph does not apply to the United Kingdom or any other country where

such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES

CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER

EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-

INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do

not allow disclaimer of express or implied warranties in certain transactions, therefore, this

statement may not apply to you.

Without limiting the above disclaimers, IBM provides no representations or warranties

regarding the accuracy, reliability or serviceability of any information or recommendations

provided in this publication, or with respect to any results that may be obtained by the use of

the information or observance of any recommendations provided herein. The information

contained in this document has not been submitted to any formal IBM test and is distributed

AS IS. The use of this information or the implementation of any recommendations or

techniques herein is a customer responsibility and depends on the customer’s ability to

evaluate and integrate them into the customer’s operational environment. While each item

may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee

that the same or similar results will be obtained elsewhere. Anyone attempting to adapt

these techniques to their own environment do so at their own risk.

This document and the information contained herein may be used solely in connection with

the IBM products discussed in this document.

This information could include technical inaccuracies or typographical errors. Changes are

periodically made to the information herein; these changes will be incorporated in new

editions of the publication. IBM may make improvements and/or changes in the product(s)

and/or the program(s) described in this publication at any time without notice.

Any references in this information to non-IBM Web sites are provided for convenience only

and do not in any manner serve as an endorsement of those Web sites. The materials at

those Web sites are not part of the materials for this IBM product and use of those Web sites is

at your own risk.

IBM may use or distribute any of the information you supply in any way it believes

appropriate without incurring any obligation to you.

Any performance data contained herein was determined in a controlled environment.

Therefore, the results obtained in other operating environments may vary significantly. Some

measurements may have been made on development-level systems and there is no

guarantee that these measurements will be the same on generally available systems.

Furthermore, some measurements may have been estimated through extrapolation. Actual

results may vary. Users of this document should verify the applicable data for their specific

environment.


Information concerning non-IBM products was obtained from the suppliers of those products,

their published announcements or other publicly available sources. IBM has not tested those

products and cannot confirm the accuracy of performance, compatibility or any other

claims related to non-IBM products. Questions on the capabilities of non-IBM products should

be addressed to the suppliers of those products.

All statements regarding IBM's future direction or intent are subject to change or withdrawal

without notice, and represent goals and objectives only.

This information contains examples of data and reports used in daily business operations. To

illustrate them as completely as possible, the examples include the names of individuals,

companies, brands, and products. All of these names are fictitious and any similarity to the

names and addresses used by an actual business enterprise is entirely coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which illustrate

programming techniques on various operating platforms. You may copy, modify, and

distribute these sample programs in any form without payment to IBM, for the purposes of

developing, using, marketing or distributing application programs conforming to the

application programming interface for the operating platform for which the sample

programs are written. These examples have not been thoroughly tested under all conditions.

IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these

programs. The sample programs are provided "AS IS", without warranty of any kind. IBM shall

not be liable for any damages arising out of your use of the sample programs.

Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International

Business Machines Corporation in the United States, other countries, or both. If these and

other IBM trademarked terms are marked on their first occurrence in this information with a

trademark symbol (® or ™), these symbols indicate U.S. registered or common law

trademarks owned by IBM at the time this information was published. Such trademarks may

also be registered or common law trademarks in other countries. A current list of IBM

trademarks is available on the Web at “Copyright and trademark information” at

www.ibm.com/legal/copytrade.shtml

Windows is a trademark of Microsoft Corporation in the United States, other countries, or

both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

Other company, product, or service names may be trademarks or service marks of others.

ibm db2 for linux, unix, and windows best practices data...

Documents