tsm dedup best practices - v2.0

50
Effective Planning and Use of TSM V6 and V7 Deduplication Effective Planning and Use of IBM Tivoli Storage Manager V6 and V7 Deduplication 12/09/2013 2.0 Authors: Jason Basler Dan Wolfe Document: Effective Planning and use of TSM V6 and V7 Deduplication Date: 12/09/2013 Version: 2.0 Page 1 of 50

Upload: danilaix

Post on 22-Nov-2015

36 views

Category:

Documents


2 download

DESCRIPTION

tsm dedup

TRANSCRIPT

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    Effective Planning and Use of IBM Tivoli Storage Manager V6 and V7 Deduplication

    12 /09 /2013

    2.0

    Authors:Jason Basler

    Dan Wolfe

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 1 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    Document LocationThis is a snapshot of an on-line document. Paper copies are valid only on the day they are printed. The document is stored at the following location:

    https://www.ibm.com/developerworks/community/wikis/home?lang=en#/wiki/Tivoli Storage Manager/page/Deduplication

    Revision HistoryRevision Number

    Revision Date

    Summary of Changes

    1.0 08/17/12 Initial publication1.1 08/31/12 Clarification on deduprequiresbackup option and other minor

    edits1.2 06/10/13 General updates on best practices1.3 06/27/13 Add information covering deduplication of Exchange data2.0 12/09/13 Major revision to reflect scalability and best practice

    improvements provided by TSM 6.3.4.200 and 7.1.0

    DisclaimerThe information contained in this document is distributed on an "as is" basis without any warranty eitherexpressed or implied.

    This document has been made available as part of IBM developerWorks WIKI, and is hereby governed by the terms of use of the WIKI as defined at the following location:

    https://www.ibm.com/developerworks/community/terms/

    AcknowledgementsThe authors would like to express their gratitude to the following people for contributions in the form of adding content, editing, and providing insight into TSM technology.

    Matt Anglin, Tivoli Storage Manager Server Development

    Dave Cannon, Tivoli Storage Manager Architect

    Robert Elder, Tivoli Storage Manger Performance Evaluation

    Tom Hughes, Executive; WW Storage Software

    Kathy Mitton, Tivoli Storage Manager Server Development

    Harley Puckett, Tivoli Storage Software Development - Executive Consultant

    Michael Sisco, Tivoli Storage Manager Server Development

    Richard Spurlock, CEO and Founder, Cobalt Iron

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 2 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    Contents1 Introduction

    1.1 Overview

    1.1.1 Description of deduplication technology

    1.1.2 Data reduction and data deduplication

    1.1.3 Server-side and client-side deduplication

    1.1.4 Pre-requisites for configuring TSM deduplication

    1.1.5 Comparing TSM deduplication and appliance deduplication

    1.2 Conditions for effective use of TSM deduplication

    1.2.1 Traditional TSM architectures compared with deduplication architectures

    1.2.2 Examples of appropriate use of TSM deduplication

    1.2.3 Data characteristics for effective deduplication

    1.3 When is it not appropriate to use TSM deduplication?

    1.3.1 Primary storage of backup data is on VTL or physical tape

    1.3.2 No flexibility with the backup processing window

    1.3.3 Restore performance considerations

    2 Resource requirements for TSM deduplication

    2.1 Database and log size requirements

    2.1.1 TSM database capacity estimation

    2.1.2 TSM database log size estimation

    2.2 Estimating capacity for deduplicated storage pools

    2.2.1 Estimating storage pool capacity requirements

    2.3 Hardware recommendations and requirements

    2.3.1 Database I/O requirements

    2.3.2 CPU

    2.3.3 Memory

    2.3.4 Considerations for the storage pool disk

    2.3.5 Hardware requirements for TSM client deduplication

    3 Implementation guidelines

    3.1 Deciding between client and server deduplication

    3.2 TSM Deduplication configuration recommendations

    3.2.1 Recommendations for deduplicated storage pools

    3.2.2 Recommended options for deduplication

    3.2.3 Best practices for ordering backup ingestion and data maintenance tasks

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 3 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    4 Estimating deduplication savings

    4.1 Factors that influence the effectiveness of deduplication

    4.1.1 Characteristics of the data

    4.1.2 Impacts from backup strategy decisions

    4.2 Effectiveness of deduplication combined with progressive incremental backup

    4.3 Interaction of compression and deduplication

    4.3.1 How deduplication and compression interact with TSM

    4.3.2 Considerations related to compression when choosing between client-side and server-side deduplication

    4.4 Understanding the TSM deduplication tiering implementation

    4.4.1 Controls for deduplication tiering

    4.4.2 The impact of tiering to deduplication storage reduction

    4.4.3 Client controls that optimize deduplication efficiency

    4.5 What kinds of savings can I expect for different application types

    4.5.1 IBM DB2

    4.5.2 Microsoft Exchange

    4.5.3 Microsoft SQL

    4.5.4 Oracle

    4.5.5 VMware

    5 How to determine deduplication results

    5.1 Simple TSM Server Queries

    5.1.1 QUERY STGPOOL

    5.1.2 Other server queries affected by deduplication

    5.2 TSM client reports

    5.3 TSM deduplication report script

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 4 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    1 IntroductionData deduplication is a technology that removes redundant data to reduce the storage capacity requirement for retaining the data. When deduplication technology is applied to data protection it can provide a highly effective means for reducing overall cost of a data protection solution. Tivoli Storage Manager introduced deduplication technology beginning with TSM V6.1. This document describes the benefits of deduplication and provides guidance on how to make effective use of the TSM deduplication feature as part of a well-designed data protection solution. The information provided by this document is relevant to both TSM Version 6 and Version 7. Significant enhancements have been made that impact the scalability of TSM deduplication beginning in TSM server levels 6.3.4.200 and 7.1.0. Many of the recommendations throughout the document assume you are running one of these levels or newer.

    Following are key points regarding TSM deduplication:

    TSM deduplication is an effective tool for reducing overall cost of a backup solution

    Additional resources (DB capacity, CPU, and memory) must be configured for a TSM server that is enabled with TSM deduplication. However, when properly configured, the benefit of storage pool capacity reduction will result in a significant cost reduction benefit.

    Cost reduction is the result of data reduction. Deduplication is just one of several methods that TSM provides for data reduction (such as progressive incremental backup). The goal is overall data reduction when all of the techniques are combined, rather than just on the deduplication ratio.

    TSM deduplication can operate on backup, archive, and HSM data. This includes data which is stored via the TSM API.

    TSM deduplication is an appropriate data reduction method for many situations. It can also be used as a cost effective option for backing up a subset of an environment that uses a deduplication appliance for the remaining backups.

    This document is intended to provide guidance specific to the use of TSM deduplication. The document does not provide comprehensive instruction and guidance for the administration of TSM, and should be used in addition to the TSM product documentation.

    1.1 Overview

    1.1.1 Description of deduplication technologyDeduplication technology detects patterns within data that appear multiple times within the scope of a collection of data. For the purposes of this document, the collection of data consists of TSM backup, archive, and HSM data (all of these types of data will be referred to as backup data throughout this document). The patterns that are detected are represented as a hash value that is much smaller than the original pattern, specifically 20 bytes. Except for the original instance of the pattern, subsequent instances of the chunk are referenced by the hash value. As a result, for a pattern that appears many times throughout a given collection of data, significant reduction in storage can be achieved.

    Unlike compression, deduplication can take advantage of a pattern that occurs multiple times within a collection of data. With compression, a single instance of a pattern is represented by a smaller amount of data that is used to algorithmically recreate the original data pattern. Compression cannot take advantage of common data patterns that reoccur throughout the collection of data, and this significantly reduces the potential reduction capability. However, compression can be combined with deduplication to take advantage of both techniques and further reduce the required amount of data storage beyond just one technique or the other.

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 5 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    1.1.1.1 TSM deduplication use compared with other deduplication approachesDeduplication technology of any sort requires CPU and memory resources to detect and replace duplicate chunks of data, as described throughout this section. Software based technologies such as TSM deduplication create similar outcomes to hardware based or appliance technologies.

    By using a software based solution the need to procure specialized, and therefore comparatively expensive, dedicated hardware is negated. This means that by using TSM based deduplication standard hardware components such as server and storage can be used. Because TSM has significant data efficiencies compared to other software based deduplication technologies (see section 1.1.2 of this document) there is less duplicate data to detect process and remove. Therefore, all other things being equal, TSM requires less of this standard hardware resource to function compared to other software based deduplication technologies.

    Care should still be taken in planning and implementing this technology, but under the majority of use cases TSM provides a viable proven technical platform where available. Where not available, such as when performing backups over the storage area network (SAN) alternate technologies, such as a VTL provide an appropriate architectural solution.

    The diagram below outlines the reference architectures for these uses cases and highlights some key considerations.

    1.1.1.2 How does TSM perform deduplicationTSM uses an algorithm to analyze variable sized, contiguous segments of data, called chunks, for patterns that are likely to be duplicated within the same TSM storage pool. This process is explained in more detail in a later section in this document. As described above the repeated identical chunks of data are removed and replaced with a smaller pointer.

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 6 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    The implementation of TSM deduplication only applies to the FILE device class (sequential-access disk) storage pools, and can be used with primary, copy, or active-data pools.

    1.1.2 Data reduction and data deduplicationWhen using data deduplication to substantially reduce storage capacity requirements it is important to consider other data reduction techniques that are available. When considering the effectiveness of deduplication, the deduplication ratio, or percentage of reduction is considered to be the ultimate measurement of effectiveness. Unlike other backup products, TSM provides a substantial advantage in data reduction through its native capability to back up data only once (rather than create duplicate data by repeatedly backing up unchanged files and other data). TSM provides a genuine measure of success rather than claiming efficiencies that are in fact just removing self created inefficiencies.

    Inherent efficiency combined with deduplication, compression, exclusion of specified objects, and appropriate retention policies, enables TSM to provide highly effective data reduction. If reduction of storage and infrastructure costs is the goal, the focus will be on overall data reduction effectiveness, with data deduplication effectiveness as one component. The following table provides a summary of the data reduction technologies that TSM offers:

    Client compression

    Incremental forever Subfile backup Deduplication

    How data reduction is achieved

    Client compresses files

    Client only sends changed files

    Client only sends changed regions of a file

    Eliminates redundant data chunks

    Conserves network bandwidth? Yes Yes Yes

    When client-side deduplication is used.

    Data supportedBackup, archive, HSM, API

    BackupBackup

    (Windows only)

    Backup, archive, HSM, API (HSM supported only for server-side deduplication)

    Scope of data reduction

    Redundant data within same file on client node

    Files that do not change between backups

    Unchanged regions within previously backed up files

    Redundant data from any data in storage pool

    Avoids storing identical files renamed, copied, or relocated on client node?

    No No No Yes

    Removes redundant data for files from different client nodes?

    No No No Yes

    Can be used with any type of storage pool configuration?

    Yes Yes Yes No

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 7 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    1.1.3 Server- side and client- side deduplicationTSM provides two options for performing deduplication: client-side and server-side deduplication. Both methods use the same algorithm to identify redundant data, however the when and where of the deduplication processing is different.

    1.1.3.1 Server- side deduplicationWith server-side deduplication, all of the processing of redundant data occurs on the TSM server, after the data has been backed up. Server-side deduplication is also called target-side deduplication. The key characteristics of server-side deduplication are:

    Duplicate data is identified after backup data has been transferred to the storage pool volume.

    The duplicate identification processing must run regularly on the server, and will consume TSM server memory, CPU and TSM database resources.

    Storage pool data reduction is not realized until data from the deduplication storage pool is moved to another storage pool volume, usually through a reclamation process, but can also occur during a TSM MOVE DATA process.

    1.1.3.2 Client- side deduplicationClient-side deduplication processes the redundant data during the backup process on the host system where the source data is located. The net results of deduplication are virtually the same as with server-side deduplication, except that the storage savings are realized immediately, since only the unique data needs to be sent to the server in its entirety. Data that is duplicated requires only a small signature to be sent to the TSM server. Client-side deduplication is especially effective when it is important to conserve bandwidth between the TSM client and server. In some cases, client-side deduplication has the potential to be more scalable than server-side deduplication due to the reduced I/O demands that result from how it immediately removes redundant data before it is sent to the TSM server. A number of conditions must exist for this to be the case:

    Sufficient client CPU resource to perform the duplicate identification processing that occurs in-line during backup.

    The ability to drive parallel client sessions where the number of client sessions exceeds the number of identify duplicates processes the server is capable of running.

    The combination of the TSM database running on fast disk, and a high bandwidth low latency network between the clients and server.

    1.1.3.2.1 Client deduplication cacheAlthough it is necessary for the backup client to check in with the server to determine whether a chunk is unique or a duplicate, the amount of data transfer is small. The client must query the server for each chunk of data that is processed. The overhead associated with this query process can be reduced substantially by configuring a cache on the client, which allows previously discovered chunks on the client (during the backup session) to be identified without a query to the TSM server. For the backup-archive client (including VMware backup,) it is recommended to always configure a cache when using client-side deduplication. For applications that use the TSM API, the deduplication cache should not be used due to the potential for backup failures caused by the cache being out of sync with the TSM server. If multiple, concurrent TSM client sessions are configured (such as with a TSM for VMware vStorage backup server), there must be a separate cache configured for each session. There are also conditions where faster performance will be possible when the deduplication cache is disabled. When the network between the clients and server has

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 8 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    high bandwidth and low latency and the TSM server database is on fast storage, the deduplication queries directly to the TSM server can outperform queries to the local cache.

    1.1.4 Pre- requisites for configuring TSM deduplicationThis section provides general description of pre-requisites when using TSM deduplication. For a complete list of pre-requisites refer to the TSM administrator documentation.

    1.1.4.1 Pre- requisites common to client and server- side deduplication The destination storage pool must be of type FILE (sequential disk)

    The target storage pool must have the deduplication setting enabled

    The TSM database must be configured according to best practices for high performance

    1.1.4.2 Pre- requisites specific to client- side deduplicationWhen configuring client-side TSM deduplication, the following requirements must be met:

    The client and server must be at version 6.2.0 or later. The latest maintenance version should always be used.

    The client must have the client-side deduplication option enabled (DEDUPLICATION YES).

    The server must enable the node for client-side deduplication with the DEDUP=CLIENTORSERVER parameter using either the REGISTER NODE or UPDATE NODE commands.

    Files must be bound to a management class with the destination parameter pointing to a storage pool that is enabled for deduplication.

    By default, all client files that are at least 2KB and smaller than the value specified by the server clientdeduptxlimit option are processed with deduplication. The exclude.dedup client option provides a feature to selectively exclude certain files from client-side deduplication processing.

    The following TSM features are incompatible with TSM client-side deduplication:

    Client encryption

    LAN-free/storage agent

    UNIX HSM client

    Subfile backup

    Simultaneous storage pool write

    1.1.5 Comparing TSM deduplication and appliance deduplicationTSMs deduplication provides the most cost effective solution for reducing backup storage costs, since there is no additional software license charge for it, and it does not require special purpose deduplicating hardware appliances. Deduplication of backup data can also be accomplished by using a deduplicating storage device in the TSM storage pool hierarchy. Deduplication appliances such as IBMs ProtecTIER and EMCs Data Domain provide deduplication capability at the storage device level. NAS devices are also available that provide NFS or CIFS mounted storage that removes redundant data through deduplication.

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 9 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    A optimal balance can be made between TSM deduplication and storage appliance deduplication. Both techniques can be used in the same environment for separate storage hierarchies or in separate TSM server instances. For example, TSM client-side deduplication is an ideal choice for backing up remote environments, either to a local TSM server or to a central datacenter. TSM node replication can then take advantage of the deduplicated storage pools to reduce data transfer requirements between TSM servers, for disaster recovery purposes. Alternatively, within a large datacenter, a separate TSM server may be designated for backing up a critical subset of all hosts using TSM deduplication. The remaining hosts would back up to a separate TSM server instance that uses a deduplicating appliance such as ProtecTier for its primary storage pool and also supports replication of the deduplicated data.

    TSM deduplication should not be used in the same storage hierarchy as a deduplicating appliance. For a deduplicating VTL, the TSM storage pool data would need to be rehydrated before moving to the VTL (as with any tape device), and there would be no data reduction as a result of the TSM deduplication rather it would be re-deduplicated by the VTL. For a deduplicating NAS device, a FILE device type could be created on the NAS. However, since the data is already deduplicated by TSM there would be little to no additional data reduction possible by the NAS device.

    1.1.5.1 Factors to consider when comparing TSM and appliance deduplicationThere are three major factors to consider when deciding which deduplication technology to use:

    Scale

    Scope

    Cost

    1.1.5.1.1 ScaleThe TSM deduplication technology is a scalable solution which uses software technology that makes heavy use of TSM database transactions. The deduplication processing has an impact on daily server processes such as reclamation and storage pool backup. For a specific TSM server hardware configuration (for example, TSM database disk speed, processor and memory capability, and storage pool device speeds), there is a practical limit to the amount of data that can be backed up using deduplication.

    The two primary points of scalability to consider are the daily amount of new data that is ingested, as well as, the total amount of data which will be protected over time. The practical limits described are not hard limits in the product, and will vary based on the capabilities of the hardware which is used. The limit on the amount of protected data is presented as a guideline with the purpose of keeping the size of the TSM database below the recommended limit of 4TB. A 4TB database corresponds roughly to 400TB of protected data (original data plus all retained versions). There is no harm in occasionally exceeding the limit for daily ingest which is prescribed with the goal of allowing enough time each day for the TSM servers maintenance tasks to run efficiently. Regularly exceeding the practical limit on daily ingest for your specific hardware may have an impact on the ability to achieve the maximum possible amount of data reduction, or cause backup durations to run longer than desired.

    Deduplication appliances have dedicated resources for deduplication processing and do not have a direct impact on TSM server performance and scalability. If it is desired to scale up a single TSM server instance as much as possible, beyond approximately 400TB of protected data (original data plus all retained versions), then appliance deduplication may be considered. However, often a more cost-effective approach is to scale out with additional TSM server instances. Using additional TSM server instances can provide the ability to manage many multiples of 400TB protected data.

    In addition to the scale of data stored, the scale of the daily amount of data backed up will also have a practical limit with TSM. The daily ingest is established by the capabilities of system resources as well as the inclusion of secondary processes such as replication and storage pool backup. Since deduplicating appliances are single-purpose devices, there is the potential for greater throughput due to the use of dedicated resources. A cost/benefit analysis should be considered to determine the appropriate choice or

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 10 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    mix of deduplication technologies. The following table provides some general guidelines for daily ingest ranges for each TSM server relative to hardware configuration choices.

    Ingest range Server requirements Storage requirements

    Up to 4TB per day 12 CPU cores

    64 GB RAM

    Database and active log on SAS/FC 15K rpm

    Storage pool on NL-SAS/SATA or SAS

    4 - 8 TB per day 24 CPU cores

    128 GB RAM

    Database and active log on SAS/FC 15K rpm

    Storage pool on NL-SAS/SATA or SAS

    8 - 20 TB per day

    and up to 30TB per day with client-side deduplication.

    32 CPU cores

    192 GB RAM

    Database and active log on SSD/flash storage

    Storage pool on SAS

    1.1.5.1.2 ScopeThe scope of TSM deduplication is limited to a single TSM server instance and more precisely within a TSM storage pool. A single, shared deduplication appliance can provide deduplication across multiple TSM servers.

    When TSM node replication is used in a many-to-one architecture, such as with branch offices, the deduplicated storage pool on the replication target can deduplicate across the data incoming from the multiple source servers.

    1.1.5.1.3 CostTSM deduplication functionality is embedded in the product without an additional software license cost; in fact TSM software license costs will reduce when capacity based licensing is in force because the capacity is calculated after deduplication has occurred. It is important to consider that hardware resources must be appropriately sized and configured. Additional expense should be anticipated when planning a TSM server configuration that will be used with deduplication. However, these additional costs can easily be offset by the savings in disk storage. Also, the software license costs are reduced when capacity-based pricing is in effect.

    Deduplication appliances are priced for the performance and capability that they provide, and generally are considered more expensive per GB than the hardware requirements for TSM native deduplication. A detailed cost comparison should be done to determine the most cost-effective solution.

    1.2 Conditions for effective use of TSM deduplicationAlthough TSM deduplication provides a cost-effective and convenient method for reducing the amount of disk storage required for backups, there are specific conditions that can provide the most benefit when using TSM deduplication. Conversely, there are conditions where TSM deduplication will not be effective and in fact may reduce the efficiency of a backup operation.

    Conditions that lead to effective use of TSM deduplication including the following:

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 11 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    Need for reduction of the disk space required for backup storage.

    Need for remote backups over limited bandwidth connections.

    Use of TSM node replication for disaster recovery across geographically dispersed locations.

    Total amount of backup data and data backed up per day are within the recommended limits of less than 400TB total and 30TB per day for each TSM server instance.

    Either a disk-to-disk backup should be configured (where the final destination of backup data is on a deduplicating disk storage pool), or data should reside in the FILE storage pool for a significant time (e.g., 30 days), or until expiration. The deduplication storage pools should not be used as a temporary staging pool before moving to tape or another non-deduplicating storage pool since this can be highly inefficient.

    Backup data should be a good candidate for data reduction through deduplication. This topic is covered in greater detail in later sections.

    High performance disk must be used for the TSM database to provide acceptable TSM deduplication performance.

    1.2.1 Traditional TSM architectures compared with deduplication architectures

    A traditional TSM architecture ingests data into disk storage pools, and moves this data to tape on a frequent basis to maintain adequate free space on disk for continued ingestion. An architecture that includes deduplication changes this model to store the primary copy of data in a sequential file storage pool for its entire life cycle. Deduplication provides enough storage savings to make keeping the primary copy on disk an affordable possibility.

    Tape storage pools still have a place in this architecture for maintaining a secondary storage pool backup copy for disaster recovery purposes or for data with very long retention periods for example 7years or forever. Other architectures are possible where data remains in deduplicated storage pools for only a portion of its life cycle, but this requires reconstructing the deduplicated objects and can defeat the purpose of spending the processing resources that are required to deduplicate the data.

    Tip: Avoid architectures where data is moved from a deduplicated storage pool to a non-deduplicated storage pool, which will force the deduplicated data to be reconstructed and lose the storage savings that were previously gained.

    1.2.2 Examples of appropriate use of TSM deduplicationThis section contains examples of TSM architectures that can make the most effective use of TSM deduplication.

    1.2.2.1 Deduplication with a secondary storage pool backup architecture

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 12 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    In this example the primary storage pool is a file-sequential disk storage pool configured for TSM deduplication. The deduplication storage pool is backed up to a tape library copy storage pool using the storage pool backup capability. The use of a secondary copy storage pool is an optional feature which provides an extra level of protection against disk failure in your primary storage pool. Here are some general considerations when a copy storage pool will be used:

    Having a second copy storage pool using disk (which can also be deduplicated) is another option.

    Server-side deduplication is a two-step process which includes duplicate identification followed by removal of the excess data during a subsequent data movement process such as reclamation or migration. The second step can be prevented until after a storage pool backup copy is created. See the description of the deduprequiresbackup option in a later section for additional considerations.

    When using server-side deduplication, schedule the storage pool backup process prior to the reclamation processing to ensure that there is minimal overhead when copying the data. After identify duplicates has run, the data is not deduplicated but it is redefined such that it can be reconstructed and dehydrated during the subsequent data movement operation. Sufficient time must be allotted for the scheduled storage pool backup to complete before the start of the schedule for reclamation.

    When using client-side deduplication, the storage pool backup processing will always occur after data has been deduplicated. This requires deduplicated data to be reconstructed during the copy (if the copy storage pool is not also deduplicated). The reconstruction processing can result in storage pool backup processing which is slower when compared with storage pool backup processing of data which has not been deduplicated. For planning purposes, estimate that the duration of storage pool backup will be doubled for data which is already deduplicated.

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 13 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    1.2.2.2 Deduplication with node replication copyThe TSM 6.3 release provides a node replication capability, which allows for an alternative architecture where deduplicated data is replicated to a second server in an incremental fashion that takes advantage of deduplication by only replicating unique data not previously replicated. The reconstruction penalty described in a previous section for storage pool backup of deduplicated data is also avoided.

    1.2.2.3 Disk- to- disk backupDisk-to-disk backup refers to the scenario where the preferred backup storage device is disk-based, as opposed to tape or a virtual tape library (VTL). Disk-based backup has become more popular as the unit cost of disk storage has fallen. It has also become more common as companies distinguish between backup data, which is kept for a relatively short amount of time, and archive data, which has long term retention.

    Disk-to-disk backup still requires a backup of the storage pool data, and the backup or copy destination may be tape or disk. However, with disk-to-disk backup, the primary storage pool data remains on disk until it expires. A significant reduction of disk storage can be achieved if the primary storage pool is configured for deduplication.

    1.2.3 Data characteristics for effective deduplicationWhen considering the use of TSM deduplication, you should assess whether the characteristics of the backup data are appropriate for deduplication. A more detailed description of data characteristics for deduplication is provided in the section on estimating deduplication efficiency. General types of structured and unstructured data are good candidates for deduplication, but if your backup data consists mostly of unique binary images

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 14 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    or encrypted data, you may wish to exclude these data types from a management class that uses a deduplicated storage pool.

    1.3 When is it not appropriate to use TSM deduplication?TSM deduplication can provide significant benefits and cost savings, but it does not apply to all situations. The following situations are not appropriate for using TSM deduplication:

    1.3.1 Primary storage of backup data is on VTL or physical tapeMovement to tape requires rehydration of the deduplicated data. This takes extra time and requires processing resources. If regular migration to tape is required, the benefits of using TSM deduplication may be reduced, since the goal is to reduce disk storage as the primary location of the backup data.

    1.3.2 No flexibility with the backup processing windowTSM deduplication processing requires additional resources, which can extend backup windows or server processing times for daily backup activities. For example, a duplicate identification process must run for server-side deduplication. Additional reclamation activity is required to remove the duplicate data from a storage pool after the duplicate identification processing completes. For client-side deduplication, the client backup speed will generally be reduced for local clients (remote clients may not be impacted if there is a bandwidth constraint).

    If the backup window has already reached the limit for service level agreements, TSM deduplication could possibly impact the backup window further unless careful planning is done.

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 15 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    1.3.3 Restore performance considerationsRestore performance from deduplicated storage pools is slower than from a comparable disk storage pool that does not use deduplication. However, restore from a deduplicated storage pool can compare favorably to restore from tape devices for certain workloads.

    If fastest restore performance from disk is a high priority, then restore performance benchmarking should be done to determine whether the effects of deduplication can be accommodated. The following table compares the restore performance of small and large object workloads across several storage scenarios.

    Storage pool type Small object workload Large object workload

    Tape Typically slower due to tape mounts and seeks

    Typically faster due to streaming capabilities of modern tape drives

    Non-deduplicated disk Typically faster due to absence of tape mounts and quick seek times

    Comparable to or slightly slower than tape

    Deduplicated disk Faster than tape, slower than non-deduplicated disk

    Slowest since data must be rehydrated, when compared to tape which is fast for streaming large objects that are not spread across many tapes.

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 16 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    2 Resource requirements for TSM deduplicationTSM deduplication provides significant benefits as a result of its data reduction technology, particularly when combined with other data reduction techniques available with TSM. However, the use of deduplication in TSM adds additional requirements for hardware and database/log storage, which are essential for a successful implementation. When configuring TSM to use deduplication, you must ensure that proper resources have been allocated to support the use of the technology. The resources include hardware requirements necessary to meet the additional processing performed during deduplication, additional storage requirements for handling the TSM database records used to store the deduplication catalog, and additional storage requirements for the TSM server database logs.

    The TSM internal database plays a central role in enabling the deduplication technology. Deduplication requires additional database capacity to be available. In addition, there is a significant increase in the frequency of references to records in the database during many TSM operations including backup, restore, duplicate identification, reclamation, and expiration. These demands on the database require that the database disk storage be capable of sustaining higher rates of I/O operations than would be required without the use of deduplication.

    As a result, planning for resources used by the TSM database is critical for a successful deduplication deployment. This section guides you through the estimation of resource requirements to support TSM deduplication.

    2.1 Database and log size requirements

    2.1.1 TSM database capacity estimationUse of TSM deduplication significantly increases the capacity requirements of the TSM database. This section provides some guidelines for estimating the capacity requirements of the database. It is important to plan ahead for the database capacity so an adequate amount of higher-performing disk can be reserved for the database (refer to the next section for performance requirements).

    The estimation guidelines are approximate, since actual requirements will depend on many factors including ones that cannot be predicted ahead of time (for example, a change in the data backup rate, the exact amount of backup data, and other factors).

    2.1.1.1 Planning database space requirementsThe use of deduplication in TSM requires more storage space in the TSM server database than without the use of deduplication. One important point to note is that when using deduplication, the TSM database grows proportionally to the amount of data that is stored in deduplicated storage pools. This is because each chunk of data that is stored in a deduplicated storage pool is referenced by an entry in the database.

    Without deduplication, each backed-up object (typically a file) is referenced by a database entry, and the database grows proportionally to the number of objects that are stored. With deduplication, the database grows proportionally to the total amount of data backed up. The following table provides an example to illustrate this point:

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 17 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    Number of objects stored

    Amount of data being managed Storage requirements

    Without deduplication 500 million 200 TB 475 GB *

    With deduplication 500 million 200 TB 2000 GB **

    * Using rule-of-thumb of 1KB of database space per object stored

    ** Using rule-of-thumb of 100GB of database space per 10TB of data managed

    The document Determining the impact of deduplication on TSM server database and storage pools provides detailed information for estimating the amount of disk storage that will be required for your TSM database. The document provides formulas for estimating database size based on the volume of data to be stored.

    As a simplified rule-of-thumb for taking a rough estimate, you can plan for 100GB of database storage for every 10TB of data that will be protected in deduplicated storage pools.

    2.1.1.2 Database reorganizationThe TSM server uses a process called reorganization to remove fragmentation that can accumulate in the database over time. When deduplication is used, the number of database records increases significantly for storing information for data chunks, and as data is expired on the TSM server, significant amounts of deletion occurs within the database increasing the need for reorganization. Reorganization can be processed on-line while the TSM server is running, or off-line while the server is halted. Depending on your server workloads, you might need to disable both table and index reorganization to maintain server stability and to reliably complete daily server activities. With reorganization disabled, if you experience unacceptable database growth or server performance degradation, you will need to plan offline reorganization for those tables.

    The TSM database sizing guidelines given in the previous section include additional space to accommodate database fragmentation that can grow the database used space between reorganizations.

    For additional information on best practices related to reorganization and data deduplication, see the technote titled Database size, database reorganization, and performance considerations for Tivoli Storage Manager V6 and V7 servers.

    2.1.2 TSM database log size estimationThe use of deduplication adds additional requirements for the TSM server database, active log, and archive log storage. Properly sizing the storage capacity for these components is essential for a successful implementation of deduplication.

    2.1.2.1 Planning active log space requirementsThe database active log stores information about database transactions that are in progress. With deduplication, transactions can run longer, requiring more space to store the active transactions.

    Tip: Use the maximum allowed size for the active log which is 128GB.

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 18 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    2.1.2.2 Planning archive log space requirementsThe archive log stores older log files for completed transactions until they are cleaned up as part of the TSM server database backup processing. The file system holding the archive log must be given sufficient capacity to avoid running out of space, which can cause the TSM server to be halted. Space is freed in the archive log every time a full backup is performed of the TSM servers database.

    See the document on Sizing the TSM archive log for detailed information on how to carefully calculate the space requirements for the TSM server archive log.

    Tip: A file system with 500GB of free space has proven to be more than adequate for a large-scale TSM server that ingests several terabytes a day of new data into deduplicated storage pools and performs a full TSM database backup once a day. For server which will ingest more than 4TB of new data each day, an archive log with 1TB of free space is recommended.

    2.2 Estimating capacity for deduplicated storage poolsTSM deduplication ratios typically range from 2:1 (50% reduction) to 15:1 (93% reduction), and is data dependent. Lower ratios are associated with backups of unique data (e.g., such as progressive incremental data), and higher ratios are associated with backups that are repeated, such as repeated full backups of databases or virtual machine images. Mixtures of unique and repeated data will result in ratios within that range. If you aren't sure of what type of data you have and how well it will reduce, use 3:1 for planning purposes when comparing with non deduplicated TSM storage pool occupancy. This ratio corresponds to an overall data reduction ratio of over 15:1 when factoring in the data reduction benefits of progressive incremental backups.

    2.2.1 Estimating storage pool capacity requirements2.2.1.1 Delayed release of storage pool dataDue to the latency for deletion of data chunks with multiple references, there is a need for transient storage associated with data chunks that must remain in a storage pool volume even though their associated file or object is deleted or expired. As a result of this behavior, storage pool capacity sizing must account for some percentage of data that is retained because of references by other objects. This latency can result in the delayed deletion of storage pool volumes.

    2.2.1.2 Delayed effect of post- identification processingStorage reduction does not always occur immediately with TSM deduplication. In the case of server-side deduplication, sufficient storage pool capacity is required to ingest the full amount of daily backup data. With server-side deduplication, removal of redundant data does not occur until after storage pool reclamation completes, which in turn may not complete until after a storage pool backup is done. If client-side deduplication is used, this delay will not apply. Sufficient storage pool free capacity must be maintained to accommodate continued backup ingestion.

    2.2.1.3 Estimating storage pool capacity requirements

    You can roughly estimate storage pool capacity requirements for a deduplicated storage pool using the following technique:

    Estimate the base size of the source data

    Estimate the daily backup size, using an estimated change and growth rate

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 19 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    Determine retention requirements

    Estimate the total amount of source data by factoring in the base size, daily backup size, and retention requirements.

    Apply the deduplication ratio factor

    Uplift the estimate to consider transient storage pool usage

    The following example illustrates the estimation method:

    Parameter Value Notes

    Base size of the source data 40TB Data from all clients that will be backed up to the deduplicated storage pool.

    Estimated daily change rate 2% Includes new and changed data

    Retention requirement 30 days

    Estimated deduplication ratio 3:1 3:1 assumes compression is used with client-side deduplication

    Uplift for transient storage pool volumes 30%

    Computed Values:

    Parameter Computation ResultBase source data 40TB 40TB

    Estimated daily backup amount 40TB * 0.02 change rate 0.8TB

    Total changed data retained 30 * 0.8TB daily backup 24TB

    Total data retained 40TB base data + 24TB retained 64TB

    Retained data after deduplication (3:1 ratio) 64TB/3 21.3TB

    Uplift for delays in chunk deletion (30%) 21.3TB * 1.3 27.69TB

    Add full daily backup amount 27.69TB + 0.8TB 28.49TB

    Round up: Storage pool capacity requirement 29TB

    2.3 Hardware recommendations and requirementsThe use of deduplication requires additional processing, which increases the TSM server hardware requirements beyond what is required without the use of deduplication. The most critical hardware requirement when using deduplication is the I /O capability of the disk system that is used for the TSM database.You should begin by understanding the base hardware recommendations for the TSM server, which are described in the following documents: AIX, HPUX, Linux x86, Linux on Power, Linux on system Z, Solaris, Windows.

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 20 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    Additional hardware recommendations are made in the TSM Version 6 deployment guide: TSM V6 Deployment Recommendations

    The Optimizing Performance guide also provides configuration best practices for the use of deduplication.

    2.3.1 Database I/O requirementsFor optimal performance, fast disk storage is always recommended for the TSM database as measured in terms of Input/Output Operations Per Second (IOPS). Due to the random access I/O patterns of the TSM database, minimizing the latency of operations that access the database volumes is critical for optimizing the performance of the TSM server. The large tables used for storing deduplication information in the TSM database bring about an even more significant demand for disk storage that can handle a large number of IOPS.

    In general, systems based on solid-state disk technology and SAS/FC provide the best capabilities in terms of increased IOPS. Because the claims of disk manufacturers are not always reliable, we recommend measuring actual IOPS of a disk system before implementing a new TSM database.

    Details about how to configure high performing disk storage are beyond the scope of this document. The following key points should be considered when configuring disk storage for the TSM database:

    The disk used for the TSM database should be configured according to best practices for a transactional database.

    Low-latency, high-performance disk devices or storage subsystems should be used for the TSM database storage volumes and the active log. Slower disk technology is acceptable for the archive log.

    Disk devices or storage systems that are capable of a minimum of approximately 3000 IOPS are suggested for the TSM Database disk device. An additional 1000 IOPS per TB of daily ingested data (pre-deduplication) should be considered. Lower-performing disk devices can be used, but performance may not be optimal. Refer to the Deduplication FAQ's for an example configuration.

    Disk I/O should be distributed over as many disk devices and controllers as possible.

    TSM database and logs should be configured on separate disk volumes (LUNS), and should not share disk volumes with the TSM storage pool or any other application or file system.

    2.3.1.1 Using flash storage for the TSM databaseLab testing has demonstrated a significant benefit to deduplication and node replication scalability when using flash storage for the TSM database. There are many choices available when moving to flash technology. Large ingest deduplication testing has been performed with the following classes of flash-based storage:

    Flash acceleration using in-server PCIe adapters. For example, the High IOPS MLC (Multi Level Cell) and Enterprise Value Flash adapters for IBM SystemX Servers, available in capacities from 365 GB to 3.2 TB. These adapters appear as block storage in the operating system, and provide persistent, low-latency storage.

    Solid state drive modules (SSDs) as part of a disk array. For example, the SSD options available with the IBM Storwize family of disk arrays are currently available with capacities of 400GB and 800GB each and can be used to build arrays of larger capacity.

    Flash memory appliances which provide a solution where flash storage can be shared across more than one TSM server. For example, the IBM FlashSystem family of products that are currently available in sizes ranging from 5 TB to 20TB.

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 21 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    The following are some general guidelines to consider when implementing the TSM database using solid-state storage technologies:

    Solid state storage provides the most significant benefit for the database containers and active log. Testing has demonstrated a substantial improvement from moving the active log to solid state storage versus moving only the database containers.

    There is no substantial benefit to placing the archive log on solid state storage.

    Although a costly design decision, testing has demonstrated a 5-10% improvement to daily ingest capabilities when using RAID10 for the database container arrays rather than RAID5.

    When using solid state technology for the database, faster storage pool disk such as SAS 10K may be required to gain the full benefit of the faster database storage. This is particularly true when using server-side deduplication.

    Faster database access from using solid state technology allows pushing the parallelism to the limit with tuning parameters for tasks such as backup sessions, indentify duplicates processes, reclamation processes, and expire inventory resources.

    2.3.2 CPUThe use of deduplication requires additional CPU resources on the TSM server, particularly for performing the task of duplicate identification. You should consider using a minimum of at least 8 (2.2Ghz or equivalent) processor cores in any TSM server that is configured for deduplication. The following table provides CPU recommendations for different ranges of daily ingest.

    Daily ingest Recommended CPU cores

    Up to 4TB 12

    4TB to 8TB 16

    8TB to 30TB 32

    2.3.3 MemoryFor the highest performance of a large-scale TSM server using deduplication, additional memory is recommended. The memory is used to optimize the frequent lookup of deduplication chunk information stored in the TSM database.

    A minimum of 64GB of system memory should be considered for TSM servers using deduplication. If the retained capacity of backup data grows, the memory requirement may need to be as high as 192GB. It is beneficial to monitor memory utilization on a regular basis to determine if additional memory is required. The following table provides system memory guidance for different ranges of daily ingest.

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 22 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    Daily ingest Recommended system memory

    Up to 4TB 64GB

    4TB to 8TB 128GB

    8TB to 30TB 192GB

    2.3.4 Considerations for the storage pool diskThe speed of the disk technology used for the deduplicated storage pool also has significant implications to the overall performance of a deduplication solution. In general, using cheaper disk such as SATA is desirable for the storage pool to keep the overall cost down. To prevent the use of slower disk technology from impacting performance, it is important to distribute the storage pool I/O across a very large number of disks. This can be accomplished by:

    1. Creating a large number of volumes within the storage array. Although the optimal number of volumes is dependent upon the environment, testing has shown that 32 volumes can provide an effective configuration.

    2. Present all of these volumes as file systems to the TSM servers device class definition so that I/O from activities such as backup ingest will be distributed across all of the volumes in parallel.

    3. Push the parallelism of tasks such as duplicate identification and reclamation to the upper limits using the options which control the number of processes used by the tasks to drive I/O across all of the disks in the storage pool. More information on this topic follows in later sections.

    For systems that will handle very large daily ingests beyond 8TB per day, faster SAS or FC 10K disk is recommended for the storage pool disk. This is particularly true when using server-side deduplication to accommodate the additional I/O required for identify duplicates and reclamation processing.

    2.3.5 Hardware requirements for TSM client deduplicationClient-side deduplication (and compression if used with deduplication) requires resources on the client system for processing. Prior to deciding to use client-side deduplication you should verify that client systems have adequate resources available during the backup window to perform the deduplication processing. A suggested minimum CPU requirement is the equivalent of one 2.2Ghz CPU core per backup process with client-side deduplication. As an example, a system with a single-socket, quad-core, 2.2Ghz processor that is utilized 75% or less during the backup window would be a good candidate to use client-side deduplication. One CPU core should also be planned for each parallel backup stream within a process for client types that support this such as TSM for Virtual Environments. Testing has demonstrated a similar benefit to lowering CPU usage during client-side deduplication resulting from adding CPU sockets compared to using more CPU cores per socket.

    There is no significant additional memory requirement for client systems that use client-side deduplication.

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 23 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    3 Implementation guidelinesA successful implementation of TSM deduplication requires careful planning in the following areas:

    Implementing an appropriate architecture suitable for using deduplication

    Properly sizing your TSM server hardware and storage

    Configuring TSM following best practices for separating data ingestion and data maintenance tasks

    3.1 Deciding between client and server deduplicationAfter you decide on an architecture using deduplication for your TSM server, you need to decide whether you will perform deduplication on the TSM clients, the TSM server, or using a combination of the two. The TSM deduplication implementation allows storage pools to manage deduplication performed by both clients and the TSM server. The server is optimized to only perform deduplication on data that has not been deduplicated by the TSM clients. Furthermore, duplicate data can be identified across objects regardless of whether the deduplication is performed on the client or server. These benefits allow for hybrid configurations that efficiently apply client-side deduplication to a subset of clients, and use server-side deduplication for the remaining clients.

    Typically a combination of both client-side and server-side data deduplication is the most appropriate. Here are some further points to consider:

    Server-side deduplication is a two-step process of duplicate data identification followed by reclamation to remove the duplicate data. Client-side deduplication stores the data directly in a deduplicated format, eliminating the need for the extra reclamation processing.

    Deduplication on the client can be combined with compression to provide the largest possible storage savings.

    Client-side deduplication processing can increase backup durations. Expect increased backup durations if network bandwidth is not restrictive. A doubling of backup durations is a reasonable estimate when using client-side deduplication in an environment that is not constrained by the network. In addition, if you will be creating a secondary copy using storage pool backup where the copy storage pool is not using deduplication, the data movement will take longer due to the extra processing required to reconstruct the deduplicated data.

    Client-side deduplication can outperform server-side deduplication with a high-performing TSM server configuration and a low-latency network connection between the clients and server. In addition, when combining deduplication with node replication, client-side deduplication stores data on the TSM server in a deduplicated state that is ready for immediate replication that will take advantage of the node replication ability to conserve bandwidth by not sending data chunks that have previously been replicated.

    Client-side deduplication can place a significant load on the TSM server in cases where a large number of clients are simultaneously driving deduplication processing. The load is a result of the TSM server processing duplicate chunk queries from the clients. Server-side deduplication, on the other hand, typically has a relatively small number of identification processes running in a controlled fashion.

    Client-side deduplication cannot be combined with LAN-free data movement using the Tivoli Storage Manager for SAN feature. If you are implementing one of TSMs supported LAN-free to disk solutions, then you can still consider using server-side deduplication.

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 24 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    Tips:Perform deduplication at the client in combination with compression in the following circumstances:

    1. Your backup network speed is a bottleneck.

    2. Increased backup durations can be tolerated, and the maximum storage savings is more important than having the fastest possible backup elapsed times.

    3. V6 servers only: the client does not typically send objects larger than 500GB in size, or client configuration options can be used to break up large objects into smaller objects. These options are discussed in a later section.

    3.2 TSM Deduplication configuration recommendations

    3.2.1 Recommendations for deduplicated storage poolsThe TSM deduplication feature is turned on at the storage pool level. The TSM server can be configured with more than one deduplicated storage pool, but duplicate data will not be identified across different storage pools. In most cases, using a single large deduplicated storage pool is recommended.

    The following commands provide an example of setting up a deduplicated storage pool on the TSM server. Some parameters are explained in further detail to give the rationale behind the values used, and later sections build upon those settings.

    3.2.1.1 Device classA device class is used to define the storage that will be used for sequential file volumes by the deduplicated storage pool. Each of the directories specified should be backed by a separate file system, which corresponds to a distinct logical volume on the disk storage subsystem. By using multiple directories backed by different storage elements on the subsystem, the TSM round-robin implementation for volume allocation is able to achieve more throughput by spreading I/O across a large pool of physical disks.

    Here are some considerations for parameters with the DEFINE DEVCLASS command:

    The mountlimit parameter limits the number of volumes that can be simultaneously mounted by all storage pools that use this device class. Typically client sessions sending data to the server use the most mount points, so you will want to set this parameter high enough to handle the expected number of simultaneous client sessions.

    This parameter needs to be set very high for deduplicated storage pools to avoid having client session and server processes waiting for available mount points. The setting is influenced by the numopenvolsallowed option, which is discussed in a later section. To estimate the setting of this option, use the following formula where numprocs is the largest number of processes used for a data copy/movement task such as reclamation and migration:

    mountlimit = (numprocs * numopenvolsallowed) + max_backup_sessions + (restore_sessions * numopenvolsallowed) + buffer

    The maxcapacity parameter controls the size of each file volume that will be created for your storage pool. This parameter takes some planning. The goal is to avoid too small of a volume size, which will result in frequent end-of-volume processing and spanning of larger objects across multiple volumes, and also to avoid volume sizes that are too large to ensure that enough writeable volumes are available to handle your expected number of client backup sessions. The following example shows a volume size of 50GB, which has proven to be optimal in many environments.

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 25 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    > define devclass dedupfile devtype=file mountlimit=4000 maxcapacity=50G directory=/tsmdedup1,/tsmdedup2,/tsmdedup3,/tsmdedup4,,/tsmdedup32

    3.2.1.2 Storage poolsThe storage pool is the repository for deduplicated storage and uses the device class previously defined. An example command for defining a deduplicated storage pool is given below, with explanations for parameters that vary from defaults. There are two methods for allocating volumes in a file-based storage pool. With the first method, volumes are pre-allocated and remain assigned to the same storage pool after they are reclaimed. The second method uses scratch volumes, which are allocated as needed, and return to the scratch pool once they are reclaimed. The examples below set up a storage pool using scratch volumes as this approach is more convenient and has shown in testing to more efficiently distribute the load across multiple storage containers within a disk subsystem.

    The deduplicate parameter is required to enable deduplication for the storage pool. The maxscratch parameter defines the maximum number of volumes that can be created for the

    storage pool. This parameter is used when using the scratch method of volume allocation, and should otherwise be set to a value of 0 when using pre-allocated volumes. Each volume will have a size determined by the maxcapacity parameter for the device class. In our example, 200 volumes multiplied by 50GB per volume, requires that 10TB of free space be available across the 32 file systems used by the device class.

    The identifyprocess parameter is set to 0 to prevent duplicate identification processes from starting automatically. This supports scheduling when duplicate identification runs, which is described in more detail in a later section.

    The reclaim parameter is set to 100 to prevent automatic storage pool reclamation from running. This supports the best practice of scheduling when reclamation runs, which is described in more detail in a later section. The actual threshold used for reclamation is defined as part of the scheduled reclamation command which is defined in a later section.

    The reclaimprocess parameter is set to a value higher than the default of 1 since a deduplicated storage pool requires a large volume of reclamation processing to keep up with the daily ingestion of new backups. As a rule-of-thumb, allow for one process for every file system defined to the device class. The example value of 32 is likely be sufficient for large-scale implementations, but you may need to tune this setting after monitoring system usage during reclamation.

    > define stgpool deduppool dedupfile maxscratch=200 deduplicate=yes identifyprocess=0 reclaim=100 reclaimprocess=32

    3.2.1.3 Policy settingsThe final configuration step involves defining policy settings on the TSM server that allow data to ingest directly into the deduplicated storage pool that has been created. Policy requirements vary for each customer, but the following example shows policy that retains extra backup versions for 30 days.> define domain DEDUPDISK> define policy DEDUPDISK POLICY1> define mgmtclass DEDUPDISK POLICY1 STANDARD> assign defmgmtclass DEDUPDISK POLICY1 STANDARD> define copygroup DEDUPDISK POLICY1 STANDARD type=backup destination=DEDUPPOOL VEREXISTS=nolimit VERDELETED=10 RETEXTRA=30 RETONLY=80 > define copygroup DEDUPDISK POLICY1 STANDARD type=archive destination=DEDUPPOOL RETVER=30> activate policyset DEDUPDISK POLICY1Document: Effective Planning and use of TSM V6 and V7 Deduplication

    Date: 12/09/2013

    Version: 2.0

    Page 26 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    3.2.2 Recommended options for deduplicationThe server has several tuning options that control deduplication processing. The following table summarizes these options, and provides an explanation for those options for which we recommend overriding the default values.

    Option Allowed values

    Recommended value Explanation

    DedupRequiresBackupYes | No

    Default: Yes

    Default This option delays the completion of server-side deduplication processing until after a secondary copy of the data has been made with storage pool backup. This option does not influence whether client-side deduplication is performed.

    The TSM server offers many levels of protection, including the ability to create a secondary copy of your data. Creating a secondary copy is optional, but is always a best practice for any storage pool regardless of whether it is deduplicated.

    Note: See the section which follows this table for additional information regarding this option.

    ClientDedupTxnLimit

    Min: 32

    Max: 2048

    Default: 300

    Default Specifies the largest object size in gigabytes that can be processed using client-side deduplication. This can be increased up to 2TB, but this does not guarantee that the TSM server will be able to process objects up to this size in all environments.

    ServerDedupTxnLimit

    Min: 32

    Max: 2048

    Default: 300

    Default Specifies the largest object size in gigabytes that can be processed using server-side deduplication. This can be increased up to 2TB, but this does not guarantee that the TSM server will be able to process objects up to this size in all environments.

    DedupTier2FileSize

    Min: 20

    Max: 9999

    Default: 100

    Default Changing the default tier settings is not recommended. Small changes may be tolerated, but avoid frequent changes to these settings, as changes will prevent matches between previously ingested backups and future backups.

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 27 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    DedupTier3FileSize

    Min: 90

    Max: 9999

    Default: 400

    Default Changing the default tier settings is not recommended. Small changes may be tolerated, but avoid frequent changes to these settings, as changes will prevent matches between previously ingested backups and future backups.

    NumOpenVolsAllowed

    Min: 3

    Max: 999

    Default: 10

    20 This option controls the number of volumes that a process such as reclamation or client sessions can hold open at the same time. A small increase to this option is recommended, and some trial and error may be needed. Note: The device class mount limit parameter may need to be increased if this option is increased.

    EnableNasDedupYes | No

    Default: No

    Default If you are using NDMP backup of NetApp file servers in your environment, change this option to Yes.

    3.2.2.1 Additional information regarding the deduprequiresbackup optionBackup of the TSM primary storage pool is optional as determined by environment-specific risk mitigation requirements. The table which follows summarizes the appropriate value for the deduprequiresbackup option for different situations. In the case of a non-deduplicated copy storage pool, the storage pool backup should be performed prior to running the reclamation processing. If storage pool backup is performed after the reclamation processing (or with client-side deduplication) the copy process will take longer since it requires the deduplicated data to be reassembled to full objects.

    Architecture for secondary copy of backup data Appropriate setting for deduprequiresbackup

    A secondary copy is created using the storage pool backup capability to a non - deduplicated copy pool such as a copy pool using tape. Yes

    A secondary copy is created using the storage pool backup capability to a deduplicated copy pool. No

    No secondary copy is created. No

    A secondary copy is created on another TSM server using the node replication feature. No

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 28 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    3.2.3 Best practices for ordering backup ingestion and data maintenance tasks

    A successful implementation of deduplication with TSM requires separating the tasks of ingesting client data and performing server data maintenance tasks into separate time windows. Furthermore, the server data maintenance tasks have an optimal ordering, and in some cases need to be performed without overlap to avoid resource contention problems.

    TSM has the ability to schedule all of these activities to follow these best practices. There is a variation on the recommended ordering when storage pool backup is used in combination with server-side deduplication, to delay duplicate identification to allow for the fastest possible throughput for backup ingestion on systems that are I/O constrained and cannot handle overlapping backup ingest with duplicate identification. This alternate variation can also be followed any time server-side deduplication is used and the fastest possible backup ingestion is desired. A later section provides the sample commands to implement the ordering with delayed duplicate identification through command scripts and scheduling. Two suggested task sequences, A and B are described. Refer to table below to determine the preferred task sequence.

    Type of Deduplication Used (Server or Client Side)

    Is Node Replication Used? (Yes or No)

    Are you doing storage pool backup to a non - deduplicated copy storage pool?

    Suggested Task Sequence

    Client Side Either Yes or No Either Yes or No A

    Server Side No No A

    Server Side Yes No A

    Server Side No Yes B, if fastest possible backup ingest is required.

    Please note that the list focuses on those tasks pertinent to deduplication. Please consult the product documentation for additional commands which you may also need to include in the daily maintenance tasks.

    3.2.3.1 Suggested task sequence A1. The following tasks can run in parallel:

    a. Client data ingestion.

    b. Perform server-side duplicate identification by running the IDENTIFY DUPLICATES command. This processes data that was not already deduplicated on the clients.

    2. Optional: Create the secondary disaster recovery (DR) copy using the REPLICATE NODE command or the BACKUP STGPOOL command.

    3. Create a DR copy of the TSM database by running the BACKUP DATABASE command. Following the completion of the database backup, the DELETE VOLHISTORY command can be used to remove older versions of database backups which are no longer required.

    4. Remove objects that have exceeded their allowed retention using the EXPIRE INVENTORY command.

    5. Reclaim unused space from storage pool volumes that has been released through deduplication and inventory expiration using the RECLAIM STGPOOL command.

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 29 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    6. Backup the volume history and device configuration using BACKUP VOLHISTORY and BACKUP DEVCONFIG commands.

    3.2.3.2 Suggested task sequence B1. Client data ingestion.

    2. Create the secondary disaster recovery (DR) copy using the BACKUP STGPOOL command.

    3. Create a DR copy of the TSM database by running the BACKUP DATABASE command. Following the completion of the database backup, the DELETE VOLHISTORY command can be used to remove older versions of database backups which are no longer required.

    4. Perform server-side duplicate identification by running the IDENTIFY DUPLICATES command. This processes data that was not already deduplicated on the clients.

    5. Remove objects that have exceeded their allowed retention using the EXPIRE INVENTORY command.

    6. Reclaim unused space from storage pool volumes that has been released through deduplication and inventory expiration using the RECLAIM STGPOOL command.

    7. Backup the volume history and device configuration using BACKUP VOLHISTORY and BACKUP DEVCONFIG commands.

    3.2.3.3 Define scripts that run each required maintenance taskThe following scripts, once defined, can be called by scheduled administrative commands. Here are a few points to note regarding these scripts:

    The storage pool backup script assumes you have already defined a copy storage pool named copypool, which uses tape storage. NOTE: Storage pool backup is optional, as determined by environment specific risk mitigation requirements.

    The database backup script requires a device class that typically also uses tape storage.

    The script for reclamation gives an example of how the parallel command can be used to simultaneously process more than one storage pool.

    The number of processes to use for identifying duplicates should not exceed the number of CPU cores available on your TSM server. This command also does not have a wait=yes parameter, so it is necessary to define a duration limit.

    If you have a large TSM database, you can further optimize the BACKUP DATABASE command by using multiple streams with TSM 6.3 and later.

    A deduplicated storage pool is typically reclaimed to a threshold lower than the default of 60 to allow more of the identified duplicate chunks to be removed. Some experimenting will be needed to find a value that can be completed within the available time. Tip: A reclamation setting of 40 or less is usually sufficient.

    define script STGBACKUP "/* Run stg pool backups */"update script STGBACKUP "backup stgpool DEDUPPOOL copypool maxprocess=10 wait=yes" line=020

    define script DEDUP "/* Run identify duplicate processes */"

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 30 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    update script DEDUP "identify duplicates DEDUPPOOL numprocess=12 duration=660" line=010

    set dbrecovery TAPEDEVC numstreams=3define script DBBACKUP "/* Run DB backups */"update script DBBACKUP "backup db devclass=TAPEDEVC type=full numstreams=3 wait=yes" line=010update script DBBACKUP "if(error) goto done" line=020update script DBBACKUP "backup volhistory" line=030update script DBBACKUP "backup devconfig" line=040update script DBBACKUP "delete volhistory type=dbbackup todate=today-7 totime=now" line=050update script DBBACKUP "done:exit" line=060

    define script RECLAIM "/* Run stg pool reclamation */"update script RECLAIM "parallel" line=010update script RECLAIM "reclaim stgpool DEDUPPOOL threshold=40 wait=yes" line=020update script RECLAIM "reclaim stgpool COPYPOOL threshold=60 wait=yes" line=030

    define script EXPIRE "/* Run expiration processes. */"update script EXPIRE "expire inventory resources=8 wait=yes" line=010

    3.2.3.4 Define schedules to run the data maintenance tasksThe TSM server has the ability to schedule commands to run, where the scheduled action is to run the various scripts that were defined in the previous sections. The examples below give specific start times that have proven to be successful in environments where backups run from midnight until 07:00 AM on the same day. You will need to change the start times to appropriate values for your environment.NOTE: Storage pool backup is optional, as determined by environment specific risk mitigation requirements.

    define schedule STGBACKUP type=admin cmd="run STGBACKUP" active=yes \desc="Run all stg pool backups." startdate=today starttime=08:00:00 \duration=15 durunits=minutes period=1 perunits=day

    define schedule DEDUP type=admin cmd="run DEDUP" active=yes \desc="Run indentify duplicates." startdate=today starttime=00:00:00 \duration=15 durunits=minutes period=1 perunits=day

    define schedule EXPIRATION type=admin cmd="run expire" active=yes \desc="Run expiration." startdate=today starttime=14:00:00 \

    duration=15 durunits=minutes period=1 perunits=day

    define schedule DBBACKUP type=admin cmd="run DBBACKUP" active=yes \desc="Run database backup." startdate=today starttime=12:00:00 \duration=15 durunits=minutes period=1 perunits=day

    define schedule RECLAIM type=admin cmd="run RECLAIM" active=yes \desc="Reclaim space from storage pools." startdate=today starttime=16:00 \duration=15 durunits=minutes period=1 perunits=day

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 31 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    4 Estimating deduplication savingsIf you ask someone in the data deduplication business to give you an estimate of the amount of savings to expect for your specific data, the answer will often be it depends. The reality is that TSM, like every other data protection product, cannot guarantee a certain level of deduplication because there are a variety of factors unique to your data that influence the results.

    Since deduplication requires computational resources, it is important to consider which environments and circumstances can benefit most from deduplication, and when other data reduction techniques may be more appropriate. What we can do is provide an understanding of the factors that influence deduplication effectiveness when using TSM, and provide some examples of observed behaviors for specific types of data, which can be used as a reference for planning purposes.

    4.1 Factors that influence the effectiveness of deduplicationThe following are factors that have an influence on how effectively TSM reduces the amount of data to be stored using deduplication.

    4.1.1 Characteristics of the data4.1.1.1 Uniqueness of the dataThe first factor to consider is the uniqueness of the data. Much of deduplication savings come from repeated backups of the same objects. Some savings, however, result from having data in common with backups of other objects or even within the same object. The uniqueness of the data is the portion of an object that has never been stored by a previous backup. Duplicate data can be found within the same object, across different objects stored by the same client, and from objects stored by different clients.

    4.1.1.2 Response to fingerprintingThe next factor is how data responds to the deduplication fingerprinting processing used by TSM. During deduplication, TSM breaks objects into chunks, which are examined to determine whether they have been previously stored. These chunks are variable in size and are identified using a process called fingerprinting. The purpose of fingerprinting is to ensure that the same chunk will always be identified regardless of whether it shifts to different positions within the object between successive backups.

    The TSM fingerprinting implementation uses a probability-based algorithm for identifying chunk boundaries within an object. The algorithm strives to have all of the chunks created for an object average out in terms of size to a target average for all chunks. The actual size of each chunk is variable within the constraints that it must be larger than the minimum chunk size and cannot be larger than the object itself. The fingerprinting implementation results in average chunk sizes that vary for different kinds of data. For data that fingerprints to average chunk sizes significantly larger than the target average, the deduplication efficiency is more sensitive to changes. More details are given in the later section that discusses tiering.

    4.1.1.3 Volatility of the dataThe final factor is the volatility of the data. A significant amount of deduplication savings is a result of the fact that similar objects are backed up repeatedly over time. Objects that undergo only minor changes between backups will end up having a significant percentage of chunks that are unchanged since the last backup and hence do not need to be stored again. Likewise, an object can undergo a pattern of change that alters a large percent of the chunks in the object. In these cases, there is very little savings realized by deduplication. It is important to note that this effect does not necessarily relate to the amount of data being written to an object. Instead, it is a factor of how pervasively the changes are scattered throughout the object. Some

    Document: Effective Planning and use of TSM V6 and V7 DeduplicationDate: 12/09/2013

    Version: 2.0

    Page 32 of 50

  • Effective Planning and Use of TSM V6 and V7 Deduplication

    change patterns, such as appending new data at the end of an object, have a very favorable response with deduplication.

    4.1.1.4 Examples of workloads that respond well to deduplicationThe following are general examples of backup workloads that respond well to deduplication:

    Backup of workstations with multiple copies or versions of the same file.

    Backup of objects with regions that repeat the same chunks of data (for example, regions with zeros).

    Multiple full backups of different versions of the same database.

    Operating system files across multiple systems. For example, Windows systemstate backup is a common source of duplicate data. Another example is virtual machine image backups with TSM for Virtual Environments.

    Backup of workstations with versions or copies of the same application data (for example, documents, presentations, or images).

    Periodic full backups taken of systems using a new nodename for the purposes of creating a out of cycle backup with special retention criteria.

    4.1.1.5 Deduplication efficiency of some data typesThe following table shows some common data types along with their expected deduplication efficiency.

    Data type Deduplication efficiency

    Audio (mp3, wma), Video (mp4), Images (jpeg) Poor

    Human generated/consumer data: text documents, source code

    Good

    Office documents spreadsheets, presentations Poor

    Common operating system files Good

    Large repeated backups of databases (Oracle, DB2, etc) Good

    Objects with embedded control structures Poor

    TSM data stored in non-native storage pools (for example, NDMP data)

    None

    4.1.2 Impacts from backup strategy decisionsThe gains realized from deduplication are also influenced by