understanding deduplication

7/29/2019 understanding deduplication

1/28

Understanding the HP Data Deduplicat ion

Strategy

W hy one size d oesnt f it everyone

Tab le of contents

Executive Summary ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Introduction ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 A wo rd of caution . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . 5

Customer Benefi ts of Data Dedup lication ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

A wo rd of caution . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . 9 Understanding Customer N eeds for Data Dedupl ication . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . 10

HP Accelerated Deduplica tion for the Larg e Enterprise Customer ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 Issues A ssociated w ith O bject-Level Differencing ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 4 W hat M akes HP A ccelerated Dedupl ication unique?. . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . 14

HP Dynam ic Dedup lication for Small and Med ium IT Environments... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 How Dynamic Dedupl ication W orks. .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . 17 Issues A ssociated w ith Ha sh-Based Chunking ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 9

Low -Bandwid th Repl ication Usage M odels. . . . . . . . . . . . .. . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . 21

W hy HP for Dedupl ication? . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . 23

Dedup lication Technologies A l igned w ith HP Virtual Libra ry Products... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Summary . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . .. . . 25

Ap pendix A G lossary of Terminology . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . 26

Ap pendix BDedupl icat ion compared to other da ta reduction technologies. . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . 27

For more information ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28


2/28

Executive Summary

Data d eduplica tion technology represents one of the most signif icant stora ge enha ncements in recent

years, promising to reshape future data protection and disaster recovery solutions. Data deduplication

offers the ability to store more on a given amount of storage and replicate data using lower

bandwidth links at a signif icantly reduced cost.

HP offers two comp lementary ded uplica tion technologies that meet very di fferent customer needs:

A ccelerated ded uplica tion (object-level differencing) for the high-end enterprise customer w horequires:

Fastest possible b ackup performa nce Fastest re stor es M ost scalable solution p ossible in terms of performa nce and c ap acity M ulti-node low -band w idth repl icat ion Hig h dedupl icat ion ratios W ide range o f rep lica tion mode ls

Dynamic deduplication (hash-based chunking) for the mid size enterprise and remote off icecustomers who require:

Low er cost device through smaller RA M footprint and op timized d isk usag e A fully integrated deduplication appliance with lights-out operation Backup a ppl icat ion and data type independence for maximum f lexib i l i ty W ide range o f rep lica tion mode ls

This whitepaper explains how HP deduplication technologies work in practice, the pros and cons of

each ap proach, w hen to choose a pa rticu lar type, and the type of low -band w idth repl ication mod els

HP plans to support.

W hy HP for Dedupl icat ion?

The HP Virtual Libra ry System (VLS) incorpora te Accelerated ded uplica tion technology that delivershigh-performance d eduplica tion for enterprise customers. H P is one of the few vendors to da te with

an object level differencing architecture that combines the virtual tape library and the deduplication

engine in the same a pp liance. O ur competitors w ith object level differencing use a sepa rate

ded uplica tion engine a nd VTL, w hich tends to b e ineff icient, as da ta is shunted b etween the tw o

appliances, as well as expensive.

HP D2D (Disk to Disk) Backup Systems use Dyna mic d eduplica tion technology that provid es asignif icant price advantage over our competitors. The combination of HP patents allows optimal

RA M and disk usag e, intell igent chunking, and minim al pa ging . Tog ether w ith the cost benefits of

using HP industry-standa rd ProLiant servers sets a new price po int for d eduplic ation a pp liances.

2


3/28

HP D2D Backup Systems and VLS virtual l ibraries provide deduplication ratio monitoring as can be

seen in the following screenshots.

Figure 1. Deduplication ratio screens on HP VLS and D2D devices

3


4/28

Introduction

O ver recent years, virtual tap e libraries have become the ba ckbo ne of a modern da ta protection

strategy because they offer:

Disk-based backup at a reasonable cost Improved b ackup p erformance in a SA N environment because new resources (virtual tape drives)

are easier to provision.

Faster single file restores than physical tape Seamless integration into an exist ing backup strategy, making it low risk The ab ility to off load o r mig rate the da ta to physical tap e for off-site disaster recovery or for lo ng-

term archiving

Because virtual tap e libra ries are disk-ba sed b ackup devices wi th a virtual f i le system and the backup

pro cess itself tends to ha ve a g reat dea l of repetit ive data, virtual tap e libra ries lend themselves

particularly well to data deduplication. In storage technology, deduplication essentially refers to the

elimination of redundant data. In the deduplication process, duplicate data is deleted, leaving only

one copy of the data to be stored. However, indexing of all data is st i l l retained should that data ever

be required. Deduplication is able to reduce the required storage capacity since only the unique data

is stored .

The amount of duplicate data that can be typically removed from a part icular data type is estimated

to be as follows:-

PACS 5%

W eb and M ic roso f t o f fi ce Da ta 30 %

Engineering Data Directories 3 5%

Softwa re code archive 45 %

Technical Publications 5 2%

Database Backup 70 % or h igher

In the above example PACs are Picture Archiving and Communication systems, a type of data used

in X-rays and medi cal im ag ing. These have very lit tle duplica te data. A t the other end of the spectrum,

databases contain a lot of redundant datatheir structure means that there will be many records with

empty f ields or the same da ta in the sam e fields.

W ith a virtual tape lib rary that has dedup lication, the net effect is that, over t ime, a gi ven amount of

disk storage capacity can hold more data than is actually sent to it . To work deduplication needs a

random access capability offered by disk based backup. This is not to say physical tape is dead,

indeed tape is st i l l required for archiving and disaster recovery, both disk and tape have their own

unique a ttributes in a comp rehensive da ta pro tection solution.

The capa city o pt imizat ion of fered by dedupl icat ion is dependent on:

Backup policy, full, incremental Retention p eriod s Data rate change

4


5/28

Figure 2. A visual explanation of ded upl ication

W hy use Dedupl icat ion?

0

2 0

4 0

6 0

8 0

1 0 0

1 2 01 4 0

1 2 3 6 9 1 2

Da ta Sto red Da ta Sen t

TBsStored

Months

TBsStored

Months

A w ord of caution

Some people view dedupl icat ion w ith the app roach of That is great! I can now buy less storag e,

but it does not w ork like that. Dedup lication i s a cumulative pro cess that can take several months to

yield i mpressive ded uplica tion ratios. Initially, the am ount of storag e you b uy has to sized to reflect

your exist ing backup tape rotation strategy and expected data change rate within your environment.

HP has developed deduplication sizing tools to assist with deciding the amount of storage capacity

with deduplication that is required. However these tools do rely on the customer having a degree of

knowledge of how much the data change rate is in their systems.

HP Backup Sizer Tool

Deduplication has become popular because as data growth soars, the cost of storing data also

increases, especially backup data on disk. Deduplication reduces the cost of storing mult iple backups

on d isk. Deduplica tion is the latest in a series of technolog ies that offer space saving to a g reater or

lesser degree. To compare Deduplication with other data reduction, or space saving technologies

please look at A ppendix B.

A worked example of deduplication is i l lustrated as follows:

5
http://h30144.www3.hp.com/SWDSizerWeb/default.htmhttp://h30144.www3.hp.com/SWDSizerWeb/default.htmhttp://h30144.www3.hp.com/SWDSizerWeb/default.htm


6/28

Figure 3. A wo rked example of dedupl icat ion for f i le system da ta over t ime

Retention policy

1 w eek, dai ly incrementals (5)

6 months, weekly ful ls (25)

Data parameters

Data comp ression rate = 2:1

Dai ly change rate = 1%

(10 % of data in 1 0% o f f i les)

Exa mp le 1 TB fi le server ba ckup

1 ,125 GB25,500 GBTOTAL

2 5 G B2 5 G B

2 5 G B

5 G B

5 G B

5 G B

5 G B

5 G B

5 0 0 G B

Data stored w ithdeduplicat ion

1 0 0 0 G B2 5 th weekly ful l backup1 0 0 0 G B3

rd

weekly ful l backup

1 0 0 0 G B2 nd weekly ful l backup

1 0 0 G B5 th daily incremental backup


1 0 0 G B3 rd daily incremental backup

1 0 0 G B2 nd daily incremental backup

1 0 0 G B1 st dai ly incremental b ackup

1 0 0 0 G B1 st dai ly fu ll backup

Data sent from backuphost

1 ,125 GB25,500 GBTOTAL

2 5 G B2 5 G B

2 5 G B

5 G B

5 G B

5 G B

5 G B

5 G B

5 0 0 G B

Data stored w ithdeduplicat ion

1 0 0 0 G B2 5 th weekly ful l backup1 0 0 0 G B3

rd

weekly ful l backup

1 0 0 0 G B2 nd weekly ful l backup



1 0 0 G B3 rd daily incremental backup

1 0 0 G B2 nd daily incremental backup

1 0 0 G B1 st dai ly incremental b ackup

1 0 0 0 G B1 st dai ly fu ll backup

Data sent from backuphost

~23 :1 reduction in data stored

2 .5 TB of disk backup =

only two w eeks of dataretention normally


7/28

Customer Benefits of Data Deduplication

W hat data d edup lication offers to customers is:

The ability to store dramatically more data online (by o nline we mean d isk ba sed) A n increase in the rang e of Recovery Point O bjectives (RPO s) ava ilab le da ta can be reco vered

from further back in time from the backup to better meet Service Level Agreements (SLAs). Disk

recovery of a single f i les is alw ays faster than tape

A reduction of investment in physical tap e by restricting i ts use more to a deep archivi ng a ndDisaster recovery usage m od el

Dedup lication ca n automa te the disaster recovery pro cess by pro viding the ability to perform site tosite repl icat ion at a low er cost. Because dedupl icat ion know s what da ta has changed at a b lock or

byte level, replication becomes more intell igent and transfers only the changed data as opposed to

the complete da ta set. This saves time a nd rep lication ba ndw idth a nd i s one of the most attractive

pro po sit ions that ded uplica tion offers. C ustomers who do not use di sk based replica tion acro ss sites

today will embrace low-bandwidth replication, as it enables better disaster tolerance without the

need a nd o pera tional costs associa ted w ith transporting d ata off-site on physical tap e. Replica tion

is performed at a tape cartridge level

Figure 4. Remote site data pro tection BEFO RE low-bandw idth repl ica tion

Local site

Loca l site

Of fsite tap e vault

Remote site d ata p rotection before low ba ndw idth repl ication

Backup hostsLocal site

SA N

1 year

Slow restores (from tape) beyond 2 weeks

Loss of co ntrol of tapes w hen given to o ffsiteservice

Excessive cost for offsite vaulting services

Frequent backup fai lures during off hours

Tedious d aily onsite media management o f

tapes, labels and offsite shipment coo rdination

Risk a nd o pera tional cost impa ct

2 w eeks >2 w eeks

Process replicated o n ea ch site, requiringlocal operators for managing tape

Tapes made nightly andsent to an offsite for DR

Data staged to disk andthen copied to tape

7


8/28

Figure 5. Remote site data pro tection A FTER low -ba ndw idth repl ica tion

Remote site

Remote site data protection after low bandwidth replication

Loca l site

Local site

Backup ho stsLocal site

4 mo nths

SA N

4 months Ta p ecopies

SA N

Improved RTO SLA all restores are from disk

N o outside vaulting service required N o ad ministrat ive media management requirements

at local sites

Reliable ba ckup pro cess

Cop y-to-tape less frequently; consolidate tap e usageto a single site reducing number of tapes

Risk a nd op erationa l cost impa ct

Data o n disk is extended to4 months. Al l restorescome from disk. N o tapesare created locally.

N o operators are requi red atlocal sites for tape o perations

Data automatical ly repl icated toremote si tes across a W AN .Copies made to tape on amonthly ba sis for archive.

To show how much of an imp act dedupl icat ion can have on repl ication t imes, take a look at the

following Figure 6. This model also takes into account a certain overhead in control information that

has to be sent site to site as well as the data deltas themselves. Currently without deduplication the full

amount of data has to be transferred between sites and in general this requires high bandwidth links

such as Gb E or Fibre C hannel. W ith Deduplica tion only the delta chang es are transferred between

sites and this reduction allow s low er band w idth links such as T3 o r O C1 2 to b e used at low er cost.

The follow ing ex am ple il lustrates the estimated replica tion t imes for varyi ng a mounts of chang e. M ost

customers would be happy with a replication t ime of say 2 hours between sites using say a T3 link.

The feed from HP D2 D b ackup systems or H P Virtual Libra ry systems to the replica tion link i s one or

more G bE pipes.

8


9/28

Figure 6. Replication times with and without deduplication

Estimated Time to Replicate Datafor a 1TB Backup Environment @ 2:1

Link Type

W ith d edupeChange Rate

W ithout d edupeBackup Type

OC12T3T1Data Sent

4 . 3 m in5 9 m in2 9 h rs1 3 . 1 G B0 . 5 %

5 . 3 m in7 3 m in3 5 h rs1 6 . 3 G B1 . 0 %

4 9 h rs

4 5 . 4 d a ys

4 . 5 d a y s

1 .5 M b / s

2 2 . 5 G B

5 0 0 G B

5 0 G B

6 2 2 .1 M b / s4 4 .7 M b / s

1 6 m in3 .8 h rsIncremental

2 .7 h rs1 . 6 d a y sFull

7 . 3 m in1 0 2 m in2 . 0 %

Link Rate (66% efficient)

Link Type

W ith d edupeChange Rate

W ithout d edupeBackup Type

OC12T3T1Data Sent

4 . 3 m in5 9 m in2 9 h rs1 3 . 1 G B0 . 5 %

5 . 3 m in7 3 m in3 5 h rs1 6 . 3 G B1 . 0 %

4 9 h rs

4 5 . 4 d a ys

4 . 5 d a y s

1 .5 M b / s

2 2 . 5 G B

5 0 0 G B

5 0 G B

6 2 2 .1 M b / s4 4 .7 M b / s

1 6 m in3 .8 h rsIncremental

2 .7 h rs1 . 6 d a y sFull

7 . 3 m in1 0 2 m in2 . 0 %

Link Rate (66% efficient)

A w ord of caution

An init ial synchronization of the backup device at the primary site and the one at the secondary site

must be p erformed. Because the volume of da ta that requires synchroniz ing at this stag e is high, a

low-bandwidth link will not suff ice. Synchronization can be achieved in three different ways:

Provision the tw o d evices on the sam e site and use a feature such as local rep lication o ver high-ba ndw idth f ibre cha nnel l inks to synchroniz e the data. Then ship o ne of the libraries to the remote

site

Install the two sepa rate devices at sepa rate sites, perform init ial ba ckup a t Site A. C op y the backupfrom Site A to phy sical tape, then transfer the phy sical tapes to site B and i mp ort them. W hen the

systems at bo th sites are synchroniz ed, start low-band w idth replica tion betw een the tw o

A fter in i tia l b ackup a t si te A a l low a multi -day w indow for in i t ia l synchronizat ion a l lowi ng the tw odevices to cop y the in i tia l b ackup d ate over a low -band w idth l ink

9


10/28

Und erstand ing C ustomer N eeds for Da ta Dedup l ica tion

Both large and small organizations have remarkably similar concerns when it comes to data

pro tection. W hat differs is the priority o f their issues.

Figure 7. Co mmon cha llenges w ith data protection amo ngst remote offices, SMEs and la rge customers

O vercome a lack o f dedi cated IT resources

M anage da ta g row th

M ainta in ba ckup ap pl i cat ion, f i le and O Sindependence

Spend less t ime managing backups

Hand le explosive data g row th

M eet and mainta in backup w indow s

Achieve greater backup rel iabi l i ty

Accelerate restore f rom tap e ( inc v i r tual tape)

M ana ge remote site da ta p rotection

Co mmon chal lenges

SME

Remote office

Data center

Environment N eeds

Different priorit ies are what have led HP to develop two dist inct approaches to data deduplication.

For example:

Large enterprises have issues meeting backup windows, so any deduplication technology that couldslow do w n the ba ckup p rocess is of no use to them. M edium a nd Small enterprises are concerned

abo ut backup w indow s as w el l but to a lesser degree

M ost larg e enterprise custom ers have Service Level A gre ements (SLA s) per taini ng to restore timesany deduplication technology that slows down restore t imes is not welcome either

M any la rge customers back up hundred s of terab ytes per nig ht, and their ba ckup solution w ithdeduplication needs to scale up to these capacit ies without degrading performance. Fragmenting

the approach by having to use several smaller deduplication stores would also make the whole

backup p rocess harder to manag e

Conversely remote of f ices and smaller organiz at ions general ly need an easy a ppro ach adedicated appliance that is self-containedat a reasonable cost

Remote off ices and SM Es do no t wa nt or need a system that is inf initely scalable, a nd the costassociated w ith l inear ly scalable ca pacity a nd p erformance. They need a single engine ap proach

that can w ork transparently in any o f their environments

1 0


11/28

HP Accelerated Deduplication for the Large Enterprise

Customer

HP Accelerated d eduplica tion technolog y is designed for la rge enterprise data centers. It is the

technology HP has chosen for the H P Stora geW orks Virtual Libra ry Systems. Acc elerated

deduplication has the following features and benefits:

Utilizes object-level differencing technology with a design centered on performance and scalability Delivers fastest possible b ackup performa nce it leverages post-pro cessing technology to pro cess

data dedupl icat ion as ba ckup job s complete, dedupl icat ing p revious ba ckups whi lst o ther backups

are stil l completing.

Delivers fastest restore from recently ba cked up da ta it maintains a co mplete copy of the mostrecent backup d ata and e l iminates dupl icate da ta in p revious backups

Scalable deduplication performanceit uses distributed architecture where performance can beincreased by adding addit ional nodes

Flexible replica tion op tions to p rotect your investment

Figure 8. O bject-level differencing comp ares only current and p revious backups from the same ho sts and el iminates duplica tedata by means of pointers. The latest backup is always held intact.

______ ________ ________ ________ ________ __

A______ ________ ________ ________ ________ __

A

_____ ________ ________ ________ ________ ___

A

HP Accelerated Deduplication

SpaceReclamation

Deletesdupl icated data,

reallocatedunused space

PreviousBackup

CurrentBackup

CurrentBackup

DataGrooming

Identif ies similardata ob jects

______ _________ _________ _________ ___

______ _________ _________ _________ ___

______ _________ _________ _________ ___

______ _________ _________ _________ _________ __

______ ________ ________ ________ __

______ ________ ________ ________ ________ __

A

DataDiscrimination/

Data Comparison

CurrentBackup

PreviousBackup

______ ________ ________ ________ ________ __

A

Pointer toExisting Data

Optional SecondIntegrity Check

Compares dedup l icateddata to or iginal data

objects

_____ ________ ________ ________ ________ ___

A

Identi fies di fferencesat byte level,

ensures dataintegri ty

______ ________ ________ ________ ________ __

A

N e w d a tastored. Dupl icateddata replacement

wi th pointer toexisting data.

Reassembly

______ _________ _________ _________ ___

______ _________ _________ _________ ___

______ _________ _________ _________ ___

______ _________ _________ _________ _________ __

______ ________ ________ ________ __

______ ________ ________ ________ ________ __

A

1 1


12/28

How Accelerated Deduplication Works

W hen the backup runs the data stream is processed a s it is stored to di sk assembling a c ontent

database on the f ly by interrogating the meta data attached by the backup application. This process

has min imal performance impa ct.

1 . A fter the f irst backup job comp letes, tasks are scheduled to b egin the ded uplica tion processing.The content database is used to identify subsequent backups from the same data sources. This is

essential, since the w ay ob ject-level differencing w orks is to comp are the current backup from a

host to the p revious backup from that same ho st.

Figure 9. Identi fy ing dupl icated da ta by str ipping aw ay the meta da ta associated w i th backup formats, f i les and da tabases

O b ject Level Differencing str ip s aw a ythe meta data to reveal real dupl ication

Physical d ata

A ctual f i le AMETAA

A ctual f i le A

METAB

Logical data

These tw o f i les look di f ferent at abackup object level because of thedif ferent backup meta da ta but at

a logical level they a re ident ical .O bject Level Dif ferencingdedupl ica tion str ips awa y theba ckup meta d ata to reveal realdupl icated data.

Backup A ppM eta da ta

Session 1

Session 2

2 . A data compa rison is performed b etw een the current backup a nd the previous backup f rom thesame host. There are d ifferent levels of co mp arison. For exa mple, some ba ckup sessions are

comp ared at an entire session level. Here, d ata is comp ared by te-for-byte betw een the tw o

versions and co mmo n streams of da ta are identif ied. O ther ba ckup sessions comp are versions of

f i les w ithin the ba ckup sessions. N ote that w ithin A ccelerated ded uplica tions object-leveldifferencing, the comp arison is done A FTER the backup m eta da ta and f i le system meta da ta has

been stripped away. (See the example in the following Figure 10) This makes the deduplication

process much more eff icient but relies on an intimate knowledge of both the backup application

meta da ta types and the data typ e meta da ta (f i le system file, da taba se f i le, and so on).

3 . W hen duplica te data is found i n the comp arison pro cess, the duplicate da ta streams in the oldestba ckup a re replaced by a set of pointers to a mo re recent cop y of the same d ata. This ensures tha

the latest backup is alw ays fully contiguous, and a restore from the latest backup w ill alw ays take

place at maxim um speed.

1 2


13/28

Figure 10 . W ith object level differencing the last ba ckup is alwa ys fully intact. Duplicated ob jects in previous backup s are

replaced wi th pointers plus byte level differences.

HP A cce lera ted Data Dedupl ica tion Deta i ls

SESSION1

SESSION2

SESSION3

DAY 3DAY 2DAY 1

SESSION1

SESSION2

SESSION3

DAY 3DAY 2DAY 1

A C DA C D

A CA C

A B A B A B

A C

A, B, C and D are f i les within a backup sessiondifferenced data

poi nter to currentversion

In the preceding d ia gram, backup session 1 had f i les A and B. W hen backup session 2 completed

and was compared with backup session 1, f i le A was found and a byte-level difference calculated for

the older version. So i n the older ba ckup (session1), f i le A w as replaced by po inters plus differencedeltas to the f i le A d ata in b ackup session 2. Subsequently, when b ackup session 3 comp letes, it is

compa red w ith backup session 2 and f i le C is found to be dupl icated. Hence a d if ference and a

pointer is placed in backup session 2 pointing to the f i le C data in backup session 3, also at the same

time the orig inal p ointer to f i le A in Session1 is readjusted to p oint to the new location o f f i le A. This is

to prevent multiple ho ps for po inters w hen restoring o lder d ata. So the process continues, every t ime

comparing the current backup with the previous backup. Each t ime a difference plus pointer is written

stora ge ca pa city is saved . This process allow s the ded uplica tion to track even a byte level change

betw een f i les.

4 . Secondary Integr i ty Checkb efore a ba ckup tape is replaced by a d edupl icated version w ithpo inters to a mo re recent occurrence of that data, a byte-for-byte comp arison can take place

comparing the original backup with the reconstructed backup, including pointers to ensure thatthe two are ident ica l. O nly w hen the compa re succeeds wi l l the or ig ina l backup tape be replaced

by a version including po inters. This step is optional. See Figure 9 Step 4 .

5 . Space reclamation occurs when all the free space created by replacing duplicate data withpointers to a single instance of the data is complete. This can take some time and results in used

capacity being returned to a free pool on the device.

Replication can take place from Step 3 because the changed data is available to be replicated even

before the space has been reclaimed.

1 3


14/28

HP A ccelerated dedupl icat ion:

W ill scale up to hundred s of TB Has no impa ct on ba ckup performance, s ince the comparison is done af ter the ba ckup job

completes (post process)

A llows more dedupl icat ion compute nodes to be add ed to increase dedupl icat ion performanceand ensure the post proc essing is comp lete before the ba ckup cy cle starts aga in.

Yields high deduplication ratios because it strips away meta data to reveal true duplication, anddoes not rely on data chunking.

Provides fast bulk data restore and tape cloning for recently backed up datamaintains thecomplete most recent copy of the backup data but eliminates duplicate data in previous backups.

Issues A ssoc ia ted w ith O b ject-Level Differenc ing

The ma jor i ssue w ith obj ect-level differencing is that the device ha s to b e know ledgea ble in terms of

backup formats and data types to understand the M eta data. HP Accelerated d edupl icat ion w i l l

support a subset of backup applications and data types at launch.

Addit ionally, object-level differencing compares only backups from the same host against each other,

so there is no d eduplica tion ac ross hosts, but the a mount of co mmo n d ata a cross different hosts can

be quite low .

W hat M akes HP A ccelerated Dedup l ication unique?

The ob ject-level differencing in H P A ccelerated ded uplica tion is unique in the marketpla ce. Unlike

hash-ba sed techniques that are an all-or-nothing m ethod o f ded uplica tion, ob ject-level differencing

applies intell igence to the process, giving users the ability to decide what data types are deduplicated

and allowing f lexibil ity to reduce the deduplication load if it is not yielding the expected or desired

results. HP O bject-level differencing technology is also the only d edup lication technolog y that can

scale to hundreds of terabytes with no impact on backup performance, because the architecture does

not depend on managing ever increasing amounts of Index tables, as is the case with Hash based

chunking. It is also w ell suited for la rger scaleab le system since it is ab le to di stribute the

deduplication workload across all the available processing resources and can even have dedicatednodes purely for deduplication activit ies.

HP Accelerated d edupl icat ion wi l l be supported o n a rang e of backup a ppl icat ions: HP Data Protector Symantec N etBackup Tivol i Storag e M anag er Legato N etw orker

HP Accelerated dedupl icat ion w i l l support a w ide ra nge of f i le types: W i nd o w s 2 0 0 3 W indows Vista HP-UX 1 1 .x Solaris standard f i le backups Linux Redh at Linu x SuSe AIX f i le backups Tru64 f i le backups

1 4


15/28

HP Accelerated deduplication will support database backups over t ime: O rac le RM AN Ho t SQ L Backups O nline Exchange M API ma i lbox backups

For the latest more details on what Backup software and data types are supported with HP

Accelerated Deduplication please look at the HP Enterprise Backup Solutions compatibil ity guide at

h ttp : / / w w w . hp . co m/ g o / eb s

HP A ccelerated deduplica tion technolog y is ava ilab le by license on HP Stora geW orks Virtual Libra ry

Systems (models 6000, 9000, and 12000). The license fee is per TB of user storage (before

comp ression or d eduplica tion takes effect).

Figure 11 . Pros and cons of HP Accelerated Deduplication

Pros & C ons of H P A ccelerated Ded uplica tion

PRO CO N

Does not restrict ba ckup ra te sinceda ta is processed a fter the ba ckuphas completed.

Faster restore rate forwa rdreferencing p ointers a l low rapidaccess to da ta.

Can handle datasets > 100TBw ithout having to pa rti t ion ba ckups no hashing tab le depend encies.

Can selectively compare data l ikelyto match, increasing performancefur ther higher dedupl icat ion ratios.

Best suited to large Enterprise VTLs.

Ha s to b e ISV format aw are andda ta type aw are, content coveragewil l grow over t ime.

M ay need ad di t ional computenod es to speed up p ost pro cessingded upl icat ion in scenarios w ith longbackup windows.

N eeds to cache 2 b ackups in orderto perform post process comparison.

So a dd itiona l disk capa city equal tothe size o f the larg est backup needsto be sized into the solution.

A t ingest t ime when the tape co ntent da taba se is generated there is a small p erformance o verhead

(< 0 .5 %) and there is a sma ll amount of di sk spa ce required to hold this datab ase (much less than the

hash tab les in the hash based chunking d edup lication technolog y). Even if this content da taba se w ere

comp letely destroyed it wo uld stil l be p ossible to ma intain a ccess to the da ta beca use the pointers are

stil l fully i ntact and held w ithin the re-w ritten tap e forma ts.

HP object level differencing also has ability to provide selective deduplication by content type, and in

the future could b e used to ind ex content provid ing content addressab le archive searches.

The question o ften arises W hat hap pens if dedup lication is not comp lete by the time the same

backup from the same host arrives? Typically the deduplication process takes about 2 x as long as

the backup process for a given backup, so as long as a single backup job does not take > 8 hours

1 5
http://www.hp.com/go/ebshttp://www.hp.com/go/ebshttp://www.hp.com/go/ebs


16/28

this w ill not occur. In ad dit ion the mult i-node architecture ensures that each nod e is load ba lanced to

provide 33% of its processing capabilit ies to deduplication whilst st i l l maintaining the necessary

performance for backup and restore. Final ly add it ional dedicated 1 00 % dedupl icat ion compute

nodes can be added if necessary.

Let us now analyze HPs second type of d edupl icat ion technology Dynamic dedupl icat ion, w hich

uses hash-based chunking.

HP Dyna mic Ded upl ication for Smal l and M edium ITEnvironments

HP Dynam ic d eduplic ation i s designed fo r customers w ith smaller IT environments. Its main features

and benefits include:

Hash-based chunking technology with a design center around compatibil ity and cost Low cost and a small RA M footprint Independence f rom backup app l icat ions Systems w ith built-in d ata d eduplic ation Flexible rep lication o ptions for increa sed investment protection.Ha sh-ba sed chunking techniques for da ta reduction have b een around for yea rs. Ha shing consists of

app lying an a lgor i thm to a specif ic chunk of da ta a nd yie ld ing a unique f ingerpr int of that data. The

backup stream is simply broken down into a series of chunks. For example, a 4K chunk in a data

stream can be hashed so that it is uniquely represented by a 20-byte hash code. See Figure13

Figure 12 . Hashing technology

Hashing Technology

in-line = ded uplication on the fly as data is ingested using hashingtechniques

hashing = is a reprod ucib le method o f turn ing some kind of da ta intoa (relatively) small number that may serve as a d igital " f ingerprint" ofthe data

IN PUT O UTPUT

HP invent

HP StorageW orks

Hashing DFCD3453

Hashing 7 8 5 C 3 D 9 2

HP Nea r l ineStorage

Hashing 4 6 7 3 F D 7 4 B

The larger the chunks, the less chance there is of finding an identical chunk that generates the same

hash code thus, the ded uplica tion ratio w ill not be as high. The smaller the chunk size, the more

1 6


17/28

eff icient the data deduplication process, but then a larger number of indexes are created, which leads

to problems storing enormous numbers of indexes (see the following example and Glossary).

Figure 13 . How hash based chunking w orks

_____ _________ _________ _________ ____

Backup 1

_____ _________ _________ _________ ____

Backup 1

Backup has been split into chunks and thehashing function has been a ppl ied

Hash generated andlook-up performed

N ew H ash genera tedand entered into index

#33 #13 #1 #65 #9 #245 #21 #127

#33 #13 #222 #75 #9 #24 5 #86 #127

# 8 6

# 3 3

5677#8 6

34 7#7 5

6459#222

13#127

12 3#2 1

97 6#245

78 5#9

3245#6 5

89#1

23 4#1 3

5#3 3

Disk Block# N o s

Index (RAM )

5677#8 6

34 7#7 5

6459#222

13#127

12 3#2 1

97 6#245

78 5#9

3245#6 5

89#1

23 4#1 3

5#3 3

Disk Block# N o s

Index (RAM )

_____ _________ _________ _________ ____

Backup 2

_____ _________ _________ _________ ____

Backup 2

Chunks are storedon Disk

How Dynamic Dedupl ica tion W orks

1 . As the backup data stream enters the target device (in this case the HP D2D2500 or D2D4000Backup System), it is chunked into nom inal 4 K chunks ag ainst which the SHA -1 hashing a lgori thm

is run. These results are place i n an index (hash value) and stored in RA M in the targ et D2D

device. The hash value is also stored a s an entry in a recipe f i le whi ch represents the ba ckup

stream, and po ints to the da ta in the dedup lication store where the origi nal 4 K chunk is stored.

This happ ens in real t ime as the backup is taking plac e. Step 1 continues for the who le bac kup

data stream.

2 . W hen another 4K chunk generates the same ha sh index a s a p revious chunk, no index is add edto the index list and the da ta is not wri tten to the dedup lication store. A n entry w ith the hash value

is simply added to the recipe f i le for that backup stream pointing to the previously stored data,

so spa ce is saved. N ow as you scale this up over m any b ackups there are ma ny instances of the

same hash value being generated, b ut the actual da ta is only stored once, so the spa ce saving s

increase.

3 . N ow let us consider ba ckup 2 in Figure 1 3 . A s the da ta stream is run through the hashingalgor i thm ag ain, much of the data w i l l generate the same hash index cod es as in backup 1

hence, there is no need to ad d i ndexes to the tab le or use storage in the ded uplica tion store. In

this backup, some of the data has changed. In some cases (#2 22 , #7 5, a nd # 8 6), the data is

unique and generates new indexes for the index store and new data entries into the deduplication

store.

4 . A nd so the hashing proc ess continues ad infinitum until as ba ckups are overw ritten by the taperotation strategy certain hash indexes are no longer required, and so in a housekeeping operation

they are removed.

1 7


18/28

Figure 14 . How hash-based chunking performs restores

13#127

12 3#2 1

97 6#245

78 5#9

3245#6 5

89#1

23 4#1 3

5#3 3

Disk Block# N o s

Index (RAM )

13#127

12 3#2 1

97 6#245

78 5#9

3245#6 5

89#1

23 4#1 3

5#3 3

Disk Block# N o s

Index (RAM )

RestoreBackup 1

___________________________________

___________________________________

___________________________________

___________________________________ #33 #13 #1 #65 #9 #245 #21 #127#33 #13 #1 #65 #9 #245 #21 #127

#33 #13 #1 #65 #9 #245 #21 #127#33 #13 #1 #65 #9 #245 #21 #127

Backup 1 Recipe f ile

Restore commences and recipe file isreferenced in the Dedupe store

A recipe file is stored inthe Dedupe store, whichis used to re-construct thetape blocks thatconstitute a backup

Recipe filerefers to ind ex

#33 is restored

Recipe file

Chunks are storedon Disk

# 3 3

5 . O n receiving a restore comma nd from the backup system, the D2D device selects the correctrecipe file and starts sequentially re-assembling the file to restore.

a. Read recipe f i le.

b. Look up hash in index to get disk po inter.

c. G et or ig ina l chunk f rom disk.

d. Return da ta to restore strea m.

e. Repea t for every hash entry in the recipe f i le.

1 8


19/28

Issues Associated with Hash-Based Chunking

The mai n issue wi th hash-ba sed chunking technology i s the gro w th of indexes and the limited a mount

of RA M storage req uired to store them. Let us take a simple exa mple: i f w e have a 1 TB backup da ta

stream using 4 K chunks, and every 4 K chunk prod uces a uniq ue hash value. This equates to 25 0

mill ion 20-byte hash values or 5GB of storage.

If we performed no other optimization (for example, paging of indexes onto and off disk), then the

ap plia nce wo uld need 5G B of RA M for every TB of ded uplica ted unique da ta. M ost server systems

canno t supp ort much more than 16 G B of RA M . For this reason, hash-ba sed chunking ca nnot easilyscale to hundreds of terabytes.

M ost low er-end to mid -rang e dedup lication technologi es use varia tions on hash-ba sed chunking , but

w ith add it ional techniques to red uce the size of the indexes generated, reducing the amount of RA M

required, but generally at the expense of some deduplication eff iciency or performance. If the index

management is not eff icient, it will slow the backup down to unacceptable levels or miss many

instances of d uplica te data. The other op tion is to use larg er chunk sizes to red uce the size of the

index . A s mentioned earlier, the do w nside of this is that deduplic ation w ill be less eff icient. These

algorithms can also be adversely affected by non-repeating data patterns that occur in some backup

softw are tap e forma ts. This beco mes a b igg er issue w ith larger chunk sizes.

HP has developed a unique innovated technology leveraging work from HP Labs that dramatically

reduces the amount of memory required for managing the index without sacrif icing performance or

ded uplica tion eff iciency. N ot only does this technolog y enab le low -cost hig h performance d isk

ba ckup systems, but it also allow s the use of much smaller chunk sizes to p rovid e mo re effective d ata

deduplication which is more robust to variations in backup stream formats or data types.

Restore t imes can be slow with hash-based chunking. As you can see from figure 14, to recover a 4K

piece of d ate from a hash-ba sed ded uplica tion store req uires a reconstruction pro cess. The restore

can take longer than i t d id to ba ck up.

Finally y ou ma y here the term hashing co llisions this means that 2 different chunks of data

pro duce the same ha sh value, w hich ob viously undermines the da ta integrity. The chances of this

hap pening are rem ote to say the least. H P Lab s calculated

Using a TW EN TY BYTE (16 0 bit) hash such as SHA 1 , the time required for a ha shing co llision tooccur is 10 0 ,0 00 ,00 0 ,0 00 ,00 0 years, based on the back ing up 1TB o f da ta per work ing day.

Even so, HP Dynamic deduplication adds a further Cyclic Redundancy Checksum (CRC) at a tape

record level that w ould ca tch the highly unlikely event of a ha sh collision.

Despite the ab ove limitations, dedup lication using ha sh-ba sed chunking is a w ell-pro ven technolog y

and serves remote off ices and medi um sized businesses very w ell. The b ig gest benefit of hash-ba sed

chunking i s that it is totally d ata forma t-indep endent and it does not have to be engineered to w ork

with specif ic backup applications and data types. The products using the hash based deduplication

technology stil l have to be tested with the various backup applications but the design approach is

generic.

HP is deploying Dynamic deduplication technology on its latest D2D Backup Systems, which aredesigned for remote off ices and small to medium organizations.

HP D2 D 25 0 0 a nd 4 00 0 Backup Systems come with dedupl icat ion as standard w ith no add it ional

licensing costs.

1 9


20/28

Figure 15 . Pros and cons of hash-based chunking deduplication

Pros & Co ns of HP Dyna mic Dedup lication

PRO CON

Dedupl ication performed at ba ckuptime Can restrict ingest rate (backup rate) i f not do ne efficiently and couldslow backups down. Can instantly handle any data

format Restore time may be longer thanob ject level d ifferencingSignificant processing overhead, butdedupl icat ionbecause of datakeeping pace with processorregeneration pro cess. developments.Co ncerns over scalabi l ityFast search, algorithms alreadyw hen using very large ha sh ind exes.proven to aid hash detectionFor d ata sets > 1 0 0 TB may ha ve to

Low storag e overhead do n

W hat makes HP Dynamic Dedupl icat ion technology unique a re a lgor i thms developed w ith HP Labs

that drama t ica l ly reduce the amo unt of memory required for ma naging the index, a nd w ithout

sacrif ici ng p erformance o r ded uplica tion effectiveness. Specif ically, this technolog y:

Uses far less memory b y im plementing a lgori thms that determine w hich are the most op timalindexes to hold in RAM for a g iven backup da ta stream

Allows the use of much smaller chunk sizes to provide more effective data deduplication which ismore robust to variations in backup stream formats or data types

Provid es intelligent storage of chunks and recipe f i les to limit disk I/ O and pa ging W orks wel l in a b road range o f environments since i t is independent of backup sof tw are format

and data types

t have startto hold complete backups (TBs) forpo st ana lysis

Best suited to smaller size VTLs

par ti tioning backups to

ensure better hash indexmanagement.

2 0


21/28

Low -Band w id th Repl ication Usag e M od els

The second mai n benefit of ded uplica tion is the ab ility to replicate the changes in da ta on site A to a

remote site B at a fraction of the cost because high-bandwidth links are no longer required. A general

guideline is that a T1 link is about 10% of the cost of a 4Gb FC link over the same distance. Low-

band w idth repl icat ion wi l l be avai lab le on both D2D and VLS products. Upto 2 G bE ports w i l l be

avai lab le for repl icat ion on D2 D devices and 1 G bE port per node w i l l be avai lab le on the VLS

products.

HP wil l suppo rt three topo logies for low band w idth repl ication:

Box --> Box Active Active M any --- O neThe unit of replication is a ca rtridge. O n VLS, it will b e po ssible to pa rt ition slots in a virtual l ibrary

replication target device to be associated with specif ic source replication cartridges.

Figure 16 . Active Active repl ication on HP VLS and D2D systems with deduplication

Accelerated Deduplication ReplicationExamp le Use Case A ctive/ A ctive

Backup Server

TCP/ IP

VLib1 VLib1VLS1

VLS2

VLib2 VLib2

Backup Server

Generally datacenter-to-datacenter replication, with each device performinglocal backups and also acting as the replication store for the other datacenter

2 1


22/28

Figure 17 . M any-to-one replica tion on H P VLS and D2 D systems w ith deduplica tion

A ccelerated Dedup l ication Repl icationExample Use Case M any to O ne

Backup Server

VLib1

Backup Server

VLib1

Backup Server

VLib1

{

{

{

Backup Server

VLib2

VLib1

TCP/ IPVLS4

Ca n d ivide up a sing le destination targ et into m ultiple slots rang es to al low

many-to-one without needing a separate replication l ibrary for each one

VLS1

VLS2

VLS3

Initially i t will no t possible for D2 D d evices to rep licate into the much larger VLS devices, since their

ded uplica tion technologies are so di fferent, but HP plans to be a ble to offer this feature in the near

future.

W hat wi l l be possib le is to repl icate mult ip le HP D2D2 50 into a centra l D2D4 00 0 o r repl icate smalle

VLS62 00 models in to a centra l VLS 12 00 0 (See Figure 1 8)

Deduplication technology is leading us is to the point where many remote sites can replicate data

ba ck to a central da ta center at a reasonable co st, removing the need for tedio us off-site vaulting o f

tapes and fully a utomating the pro cesssaving even more co sts.

This ensures

The most cost effective solution is deployed at each specific site The costs and issues associated w ith off site vault ing o f phy sical tape are rem oved The whole Disaster recovery process is automated The solution is scalable at all sites

2 2


23/28

Figure 18 . Enterprise Deployment with repl ication across remote and branch offices back to data centers

Enterpri se Dep loym ent

Servers

1-4 servers

> 20 0 G B storage

Servers

Larg e Remote Offi ce

Regional Site orSmall Datacenter

LAN

Large Datacenter

LAN

M obi le/ desktop c l ient agents

Backup agent

Backup/ media server

D2D ApplianceD2D Appliance

DiskStorage

Virtual LibrarySystem

LA

N

Small Remote O ffice

Mobile/ Desktops

Backup Server

LA

N

Mobile/ Desktops

Backup Server

Servers

Mobile/ Desktops

Mobile/ DesktopsServers

Servers

VirtualLibrarySystem

BackupServer

SA N

SA N

TapeLibrarySystem

D2D Appliance

W ith small an d lar ge ROBOs

SecondaryDatacenter

BackupServers

DiskStorage

Servers

W hy HP for Dedupl ication?

Deduplication is a powerful technology and there are many different ways to implement it , but most

vendors offer only one method and , as w e have seen, no one method i s best in all circumstances. HPoffers a choice of deduplication technologies depending on your needs. HP does not pretend that

one size f its all.

Cho ose HP Dynamic ded uplica tion for small and mid -size IT environments because it offers the best

technology footprint for deduplication at a price point that is affordable. Flexible replication options

further enhance the solution.

Choose HP Accelerated deduplication for Enterprise data centers where scalability and backup

performa nce are p ara mount. Flexible replic ation o ptions further enhance the solution.

The scalabi lity issues associa ted w ith hash-ba sed chunking are a dd ressed b y some com petitors by

creating mult iple separate deduplication stores behind a single management interface, but what this

do es is create islands of d edup lication, so the customer sees reduced b enefits and excessive costsbecause the solution is not inherently scalable.

A t the da ta center level, the major comp etitors of HP using ob ject-level differencing ha ve used bo lt-

on deduplication engines with exist ing virtual tape library architectures and have not integrated the

ded uplica tion engine w ithin the VTL itself. This leads to d ata b eing m oved b ack a nd forth betw een the

virtual l ibrary and the deduplication engine, which is very ineff icient.

2 3


24/28

Ded uplica tion Technolog ies A l igned w ith H P Virtual Lib rary

Products

HP has a rang e of di sk-ba sed b ackup prod ucts w ith dedup lication start ing w ith the entry-level

D2D2500 at 2.25TB user unit for small businesses and remote off ices, right up to the VLS12000 EVA

G atewa y w ith cap acit ies over 1 PB for the high-end enterprise data center customer. They emulate a

range of HP Physical tape autoloaders and libraries.

Figure 19 . HP disk-based backup portfol io with deduplication

HP Storag eW orksDisk-to-disk a nd Virtual Library po rtfolio w ith deduplication

Mid-rangeEntry-level Enterprise

High-capacity and performancemulti-node system

Avai lab le and sca lab le

Enterprise data centers

Large FC SAN s M a n a g e a b le ,rel iable

Mids izedbusinesses or ITwith remotebranch of f ices

(iSCSI)

Scalab le , manageable ,re l iab le a pp l iance

M edium to la rge datacenters

M edium to large FCS A N s

VLS60 00 Fam ily

VLS 12000EVA Ga teway

VLS9000

D2D2500D2D4000

Ma nageable , re liab le

M idsized businessesor IT with small datacentres

iSCSI & FC

Dynamic Dedup l ication

Hash Based Chunking

Accelerated Dedupl icat ion

O bject level Differencing

Capacity

The HP StorageW orks D2 D25 00 and D2 D4 00 0 Backup Systems support HP dynam ic dedupl icat ion

These range in size from 2.25TB to 7.5TB and are aimed at remote off ices or small enterprise

customers. The D2 D2 5 0 0 has an iSCSI interface to reduce the cost of implementation at remote

of f ices, w hi le the D2D4 00 0 of fers a choice of iSCSI or 4 G b FC.

The HP Storag eW orks Virtual Libra ry Systems are all 4 G b SAN -attach devices w hich rang e in native

user capacity f rom 4.4 TB to over a petabyte with the VLS90 00 and VLS12 0 00 EVA G ateway.

Hardware compression is avai lable on the VLS6000, 9000 and 12000 models, achieving even

higher capacit ies. The VLS9000 and VLS12000 use a mult i-node architecture that allows the

performa nce to scale in a linear fashion. W ith eight nodes, these devices can sustain a throughp ut of

up to 48 00 M B/ sec at 2:1 data comp ression, provid i ng the SAN hosts can supply d ata at th is rate.

HP Virtual Libra ry Systems will dep loy the HP Accelerated ded uplica tion technolog y.

2 4


25/28

Summary

Data d eduplica tion technology represents one of the most signif icant stora ge enha ncements in recent

years, pro mising to reshap e future da ta protection and disaster recovery solutions. Deduplica tion

offers the ability to store more on a given amount of storage and enables replication using low-

ba ndw idth links, both of w hich im pro ve cost effectiveness.

HP offers two comp lementary ded uplica tion technologies for di fferent customer needs:

A ccelerated ded uplica tion (with ob ject-level differencing) for hig h-end enterprise customers w horequire:

Fastest possible b ackup performa nce Fastest r estore M ost scalable solution in terms of performa nce and ca pa city M ulti-node low b andw idth repl icat ion Highest deduplication ratios W ide range o f rep lica tion mode ls

Dynamic deduplication (with hash-based chunking) for mid size organizations and remote off icesthat require:

Low er cost and a smaller footprint An integrated deduplication appliance with lights-out operation Backup a ppl icat ion and data type ind ependent for max imum f lexib i l i ty W ide range o f rep lica tion mode ls

This whitepap er expla ined how dedupl icat ion technologies of HP work in p ract ice, the pros and cons

of each ap proach, w hen to choose a pa rticu lar type a nd the type of low -band w idth repl ication

mod els HP pla ns to supp ort.

The H P Virtual Libra ry System (VLS) incorpo rate A ccelerated d edup lication technology that scales for

larg e mult i-nod e systems and delivers high-performa nce d eduplica tion for enterprise customers.

HP D2D (Disk to Disk) Backup Systems use Dyna mic d eduplica tion technology that provid es a

signif icant price advantage over our competitors. The combination of HP patents allowing optimal

RA M usag e (RAM footpr int) with min ima l new hash values being generated on similar backup

streams. HP D2D backup systems with integrated deduplication set a new price point for

deduplication devices.

2 5


26/28

A pp endix A G lossary o f Terminolog y

Source-based Deduplication

W here data i s deduplica ted in the host(s) prio r to transmission o ver the storage netw ork. This

general ly tends to be a p ropr ietary app roach.

Targ et-based Deduplication

This is where the data is deduplicated in a Target device such as a virtual tape library and is

ava ilab le to all hosts using that targ et device.

Hashing

This is a rep roduci ble method o f turning some kind of da ta into a (relatively) small number that may

serve as a d ig i ta l " f ingerpr int" of the data.

Chunks

This is a m ethod of b reaking do w n a d ata stream i nto segments (chunks), and o n each chunk the

hashing algorithm is run.

SHA-1

Secure hashing a lg or i thm 1 . For example SHA-1 can enab le a 4 K chunk of data to be uniquely

represented by a 20-byte hash value.

O bject-Level Differ encing

Is a g eneral IT description that describes a pro cess that has an i ntimate know ledge of the da ta that it

is handl ing dow n to log ical format level. O bject-level d i f ferencing ded upl icat ion means the

deduplication process has an intimate knowledge of the backup application format, the f i le types

being ba cked up (for examp le, W indo w s f i le system, excha nge f i les, and SQ L files). This intima te

knowledge allows f i le comparisons at a byte level to remove duplicated data.

Box-to-Box

Replica tion from a Source to Destination i n one d irection.

Active-ActiveReplica tion from a Source device on Site A to a Targ et Device on Site B and vic e versa.

Many-to-One

Replica tion from mult iple sources to a single destination device.

Deduplication ra tio

The reduction in stora ge req uired for a ba ckup (after several other backup s have taken p lace). Figures

between 10 :1 and 30 0:1 have been quoted b y d if ferent vendors. The ratio is h ighly dependent on:

Rate of change of d ata ( for example, 1 0 % of the data in 1 0 % of the fi les) Retention period of backups Efficiency of deduplication technology implementationSpa ce Reclama tion

W ith all deduplica tion devices t ime is required to free up space that w as used by the duplica ted data

and return it to a free pool Because this can be quite t ime consuming in tends to occur in off peak

periods.

Post Processing

This is where the ded uplica tion is do ne A FTER the backup comp letes to ensure there is no w ay the

dedupl icat ion process can slow dow n the backup a nd increase the backup w indow required.

2 6


27/28

2 7

In-Line

This is where the ded uplica tion p rocess takes place REAL TIM E as the b ackup is actually taking pla ce.

Depending on implementations this may or may not slow the backup process down.

Multi- thread

W ithin HP O bject Level differencing the compa re and spa ce reclamation p rocesses are run wi th

multiple paths simultaneously to ensure faster execution times.

Mult i -node

HP VLS90 0 0 and VLS12 00 0 products scale to of fer very h igh performance levels up to e ight nodes

can run in para l le l , g iv ing throughput capa bi l i t ies up to 48 00 M B/ sec at 2:1 compression rat io . This

mult i-nod e architecture is fundam ental to H Ps Accelerated ded uplica tion technology b ecause it allow s

maximum processing p ow er to be a ppl ied to the dedupl icat ion p rocess.

A pp endix B Dedupl ication comp ared to other data

reduction technologies

Technology descrip tion Pro Con Comments

Deduplication A d va nc ed

technique for efficiently

storing data by referencing

existing blocks of data that

have been previously

stored, and only storing

new data that is unique.

Two fold benefits

Space savings of between

10 and 10 0 :1 be ing quo ted

Further benefit of low

bandw idth repl icat ion

Can s low backup down i f

not implemented efficiently.

Hash ba sed technologies

may not scale to 1 00 s of TB

O bject Level differencing

technologies need to be

mul ti format aw are w hich

takes time to eng ineer

Deduplication is by far the

most impressive disk

storage reduction

technology to emerge over

recent years.

Implementation varies by

vendors. Benchmarking

highly recommended

Single Instancing Is r ea lly

dedupl icat ion at a f i le level

Avai lable a s part of the

M icrosoft fi le system and a s

a feature of the file system

of a N etapp f i ler . System

based approa ch to space

savings

W i l l not el iminate

redundancy w i th in a f i le ,

only i f two fi les are exactly

the same

For example adding fi les to

a PST fi le, or adding a

slide to a presentation.

Limi ted use

Arr ay-based snapshots

capture changed blocks on

a disk LUN

Used primari ly for fast rol l-

back to a consistent state

using image recovery

not real ly focused on

storage efficiency.

Does not el iminate

redundant data for the

changed blocks

Cap tures any change ma de

by the file system

example does not

distinguish between real

da ta and de leted / f ree

space on disk

W el l establ ished. General ly

used for quick recovery to

a know n point in t ime

Incremental Forever

backups re cr ea te a fu ll

restore ima ge fro m just one

ful l backup and lots of

incrementals

M inimizes the need for

frequent ful l backups and

hence al lows for smaller

backup w indows

M ore focused at t ime

savings than really at space

savings

G eneral ly only wo rks wi th

fi le system ba ckups not

databa se based backups

Compression softw a re o r

hardware

Fast (i f done in hardware),

slower i f done in software.

W el l establ ished a nd

understood

M aximum space savings

are general ly 2 :1

Can be used in a ddi t ion to

dedupl icat ion


28/28

For mo re informa tion

w w w . hp .c om / g o / ta p e

w w w . hp .c om / g o / D 2 D

w w w . h p .c o m/ g o / V LS

ww w.h p . co m/ g o / d e d up l ica tio n

HP Storag eW orks custom er success stori es

C op yr ig ht 2 0 0 8 H ew let t-Pack a rd De ve lo p me nt C om p a ny , L.P. The in fo rm a tio ncontained herein is subject to change without notice. The only warranties for HPproducts and services are set forth in the express warranty statementsaccompa nying such products and services. N othing herein should be construed asconstituting a n add it ional wa rranty. HP shall not be l iab le for technical or editoria lerrors or omissions contained herein.

Linux is a U.S. registered trademark of Linus Torvalds. M icrosoft and W indow s areU.S. registered trademarks of M icrosoft Corporat ion. UN IX is a registered
http://www.hp.com/go/tapehttp://www.hp.com/go/D2Dhttp://www.hp.com/go/VLShttp://www.hp.com/go/deduplicationhttp://www.hp.com/go/deduplicationhttp://h18006.www1.hp.com/storage/customer_stories/index.html?jumpid=reg_R1002_USENhttp://h18006.www1.hp.com/storage/customer_stories/index.html?jumpid=reg_R1002_USENhttp://h18006.www1.hp.com/storage/customer_stories/index.html?jumpid=reg_R1002_USENhttp://www.hp.com/go/deduplicationhttp://www.hp.com/go/VLShttp://www.hp.com/go/D2Dhttp://www.hp.com/go/tape

understanding deduplication

Documents