understanding deduplication
TRANSCRIPT
-
7/29/2019 understanding deduplication
1/28
Understanding the HP Data Deduplicat ion
Strategy
W hy one size d oesnt f it everyone
Tab le of contents
Executive Summary ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Introduction ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 A wo rd of caution . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . 5
Customer Benefi ts of Data Dedup lication ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
A wo rd of caution . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . 9 Understanding Customer N eeds for Data Dedupl ication . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . 10
HP Accelerated Deduplica tion for the Larg e Enterprise Customer ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 Issues A ssociated w ith O bject-Level Differencing ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 4 W hat M akes HP A ccelerated Dedupl ication unique?. . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . 14
HP Dynam ic Dedup lication for Small and Med ium IT Environments... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 How Dynamic Dedupl ication W orks. .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . 17 Issues A ssociated w ith Ha sh-Based Chunking ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 9
Low -Bandwid th Repl ication Usage M odels. . . . . . . . . . . . .. . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . 21
W hy HP for Dedupl ication? . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . 23
Dedup lication Technologies A l igned w ith HP Virtual Libra ry Products... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Summary . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . .. . . 25
Ap pendix A G lossary of Terminology . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . 26
Ap pendix BDedupl icat ion compared to other da ta reduction technologies. . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . 27
For more information ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
-
7/29/2019 understanding deduplication
2/28
Executive Summary
Data d eduplica tion technology represents one of the most signif icant stora ge enha ncements in recent
years, promising to reshape future data protection and disaster recovery solutions. Data deduplication
offers the ability to store more on a given amount of storage and replicate data using lower
bandwidth links at a signif icantly reduced cost.
HP offers two comp lementary ded uplica tion technologies that meet very di fferent customer needs:
A ccelerated ded uplica tion (object-level differencing) for the high-end enterprise customer w horequires:
Fastest possible b ackup performa nce Fastest re stor es M ost scalable solution p ossible in terms of performa nce and c ap acity M ulti-node low -band w idth repl icat ion Hig h dedupl icat ion ratios W ide range o f rep lica tion mode ls
Dynamic deduplication (hash-based chunking) for the mid size enterprise and remote off icecustomers who require:
Low er cost device through smaller RA M footprint and op timized d isk usag e A fully integrated deduplication appliance with lights-out operation Backup a ppl icat ion and data type independence for maximum f lexib i l i ty W ide range o f rep lica tion mode ls
This whitepaper explains how HP deduplication technologies work in practice, the pros and cons of
each ap proach, w hen to choose a pa rticu lar type, and the type of low -band w idth repl ication mod els
HP plans to support.
W hy HP for Dedupl icat ion?
The HP Virtual Libra ry System (VLS) incorpora te Accelerated ded uplica tion technology that delivershigh-performance d eduplica tion for enterprise customers. H P is one of the few vendors to da te with
an object level differencing architecture that combines the virtual tape library and the deduplication
engine in the same a pp liance. O ur competitors w ith object level differencing use a sepa rate
ded uplica tion engine a nd VTL, w hich tends to b e ineff icient, as da ta is shunted b etween the tw o
appliances, as well as expensive.
HP D2D (Disk to Disk) Backup Systems use Dyna mic d eduplica tion technology that provid es asignif icant price advantage over our competitors. The combination of HP patents allows optimal
RA M and disk usag e, intell igent chunking, and minim al pa ging . Tog ether w ith the cost benefits of
using HP industry-standa rd ProLiant servers sets a new price po int for d eduplic ation a pp liances.
2
-
7/29/2019 understanding deduplication
3/28
HP D2D Backup Systems and VLS virtual l ibraries provide deduplication ratio monitoring as can be
seen in the following screenshots.
Figure 1. Deduplication ratio screens on HP VLS and D2D devices
3
-
7/29/2019 understanding deduplication
4/28
Introduction
O ver recent years, virtual tap e libraries have become the ba ckbo ne of a modern da ta protection
strategy because they offer:
Disk-based backup at a reasonable cost Improved b ackup p erformance in a SA N environment because new resources (virtual tape drives)
are easier to provision.
Faster single file restores than physical tape Seamless integration into an exist ing backup strategy, making it low risk The ab ility to off load o r mig rate the da ta to physical tap e for off-site disaster recovery or for lo ng-
term archiving
Because virtual tap e libra ries are disk-ba sed b ackup devices wi th a virtual f i le system and the backup
pro cess itself tends to ha ve a g reat dea l of repetit ive data, virtual tap e libra ries lend themselves
particularly well to data deduplication. In storage technology, deduplication essentially refers to the
elimination of redundant data. In the deduplication process, duplicate data is deleted, leaving only
one copy of the data to be stored. However, indexing of all data is st i l l retained should that data ever
be required. Deduplication is able to reduce the required storage capacity since only the unique data
is stored .
The amount of duplicate data that can be typically removed from a part icular data type is estimated
to be as follows:-
PACS 5%
W eb and M ic roso f t o f fi ce Da ta 30 %
Engineering Data Directories 3 5%
Softwa re code archive 45 %
Technical Publications 5 2%
Database Backup 70 % or h igher
In the above example PACs are Picture Archiving and Communication systems, a type of data used
in X-rays and medi cal im ag ing. These have very lit tle duplica te data. A t the other end of the spectrum,
databases contain a lot of redundant datatheir structure means that there will be many records with
empty f ields or the same da ta in the sam e fields.
W ith a virtual tape lib rary that has dedup lication, the net effect is that, over t ime, a gi ven amount of
disk storage capacity can hold more data than is actually sent to it . To work deduplication needs a
random access capability offered by disk based backup. This is not to say physical tape is dead,
indeed tape is st i l l required for archiving and disaster recovery, both disk and tape have their own
unique a ttributes in a comp rehensive da ta pro tection solution.
The capa city o pt imizat ion of fered by dedupl icat ion is dependent on:
Backup policy, full, incremental Retention p eriod s Data rate change
4
-
7/29/2019 understanding deduplication
5/28
Figure 2. A visual explanation of ded upl ication
W hy use Dedupl icat ion?
0
2 0
4 0
6 0
8 0
1 0 0
1 2 01 4 0
1 2 3 6 9 1 2
Da ta Sto red Da ta Sen t
TBsStored
Months
TBsStored
Months
A w ord of caution
Some people view dedupl icat ion w ith the app roach of That is great! I can now buy less storag e,
but it does not w ork like that. Dedup lication i s a cumulative pro cess that can take several months to
yield i mpressive ded uplica tion ratios. Initially, the am ount of storag e you b uy has to sized to reflect
your exist ing backup tape rotation strategy and expected data change rate within your environment.
HP has developed deduplication sizing tools to assist with deciding the amount of storage capacity
with deduplication that is required. However these tools do rely on the customer having a degree of
knowledge of how much the data change rate is in their systems.
HP Backup Sizer Tool
Deduplication has become popular because as data growth soars, the cost of storing data also
increases, especially backup data on disk. Deduplication reduces the cost of storing mult iple backups
on d isk. Deduplica tion is the latest in a series of technolog ies that offer space saving to a g reater or
lesser degree. To compare Deduplication with other data reduction, or space saving technologies
please look at A ppendix B.
A worked example of deduplication is i l lustrated as follows:
5
http://h30144.www3.hp.com/SWDSizerWeb/default.htmhttp://h30144.www3.hp.com/SWDSizerWeb/default.htmhttp://h30144.www3.hp.com/SWDSizerWeb/default.htm -
7/29/2019 understanding deduplication
6/28
Figure 3. A wo rked example of dedupl icat ion for f i le system da ta over t ime
Retention policy
1 w eek, dai ly incrementals (5)
6 months, weekly ful ls (25)
Data parameters
Data comp ression rate = 2:1
Dai ly change rate = 1%
(10 % of data in 1 0% o f f i les)
Exa mp le 1 TB fi le server ba ckup
1 ,125 GB25,500 GBTOTAL
2 5 G B2 5 G B
2 5 G B
5 G B
5 G B
5 G B
5 G B
5 G B
5 0 0 G B
Data stored w ithdeduplicat ion
1 0 0 0 G B2 5 th weekly ful l backup1 0 0 0 G B3
rd
weekly ful l backup
1 0 0 0 G B2 nd weekly ful l backup
1 0 0 G B5 th daily incremental backup
1 0 0 G B4 th daily incremental backup
1 0 0 G B3 rd daily incremental backup
1 0 0 G B2 nd daily incremental backup
1 0 0 G B1 st dai ly incremental b ackup
1 0 0 0 G B1 st dai ly fu ll backup
Data sent from backuphost
1 ,125 GB25,500 GBTOTAL
2 5 G B2 5 G B
2 5 G B
5 G B
5 G B
5 G B
5 G B
5 G B
5 0 0 G B
Data stored w ithdeduplicat ion
1 0 0 0 G B2 5 th weekly ful l backup1 0 0 0 G B3
rd
weekly ful l backup
1 0 0 0 G B2 nd weekly ful l backup
1 0 0 G B5 th daily incremental backup
1 0 0 G B4 th daily incremental backup
1 0 0 G B3 rd daily incremental backup
1 0 0 G B2 nd daily incremental backup
1 0 0 G B1 st dai ly incremental b ackup
1 0 0 0 G B1 st dai ly fu ll backup
Data sent from backuphost
~23 :1 reduction in data stored
2 .5 TB of disk backup =
only two w eeks of dataretention normally
-
7/29/2019 understanding deduplication
7/28
Customer Benefits of Data Deduplication
W hat data d edup lication offers to customers is:
The ability to store dramatically more data online (by o nline we mean d isk ba sed) A n increase in the rang e of Recovery Point O bjectives (RPO s) ava ilab le da ta can be reco vered
from further back in time from the backup to better meet Service Level Agreements (SLAs). Disk
recovery of a single f i les is alw ays faster than tape
A reduction of investment in physical tap e by restricting i ts use more to a deep archivi ng a ndDisaster recovery usage m od el
Dedup lication ca n automa te the disaster recovery pro cess by pro viding the ability to perform site tosite repl icat ion at a low er cost. Because dedupl icat ion know s what da ta has changed at a b lock or
byte level, replication becomes more intell igent and transfers only the changed data as opposed to
the complete da ta set. This saves time a nd rep lication ba ndw idth a nd i s one of the most attractive
pro po sit ions that ded uplica tion offers. C ustomers who do not use di sk based replica tion acro ss sites
today will embrace low-bandwidth replication, as it enables better disaster tolerance without the
need a nd o pera tional costs associa ted w ith transporting d ata off-site on physical tap e. Replica tion
is performed at a tape cartridge level
Figure 4. Remote site data pro tection BEFO RE low-bandw idth repl ica tion
Local site
Loca l site
Of fsite tap e vault
Remote site d ata p rotection before low ba ndw idth repl ication
Backup hostsLocal site
SA N
1 year
Slow restores (from tape) beyond 2 weeks
Loss of co ntrol of tapes w hen given to o ffsiteservice
Excessive cost for offsite vaulting services
Frequent backup fai lures during off hours
Tedious d aily onsite media management o f
tapes, labels and offsite shipment coo rdination
Risk a nd o pera tional cost impa ct
2 w eeks >2 w eeks
Process replicated o n ea ch site, requiringlocal operators for managing tape
Tapes made nightly andsent to an offsite for DR
Data staged to disk andthen copied to tape
7
-
7/29/2019 understanding deduplication
8/28
Figure 5. Remote site data pro tection A FTER low -ba ndw idth repl ica tion
Remote site
Remote site data protection after low bandwidth replication
Loca l site
Local site
Backup ho stsLocal site
4 mo nths
SA N
4 months Ta p ecopies
SA N
Improved RTO SLA all restores are from disk
N o outside vaulting service required N o ad ministrat ive media management requirements
at local sites
Reliable ba ckup pro cess
Cop y-to-tape less frequently; consolidate tap e usageto a single site reducing number of tapes
Risk a nd op erationa l cost impa ct
Data o n disk is extended to4 months. Al l restorescome from disk. N o tapesare created locally.
N o operators are requi red atlocal sites for tape o perations
Data automatical ly repl icated toremote si tes across a W AN .Copies made to tape on amonthly ba sis for archive.
To show how much of an imp act dedupl icat ion can have on repl ication t imes, take a look at the
following Figure 6. This model also takes into account a certain overhead in control information that
has to be sent site to site as well as the data deltas themselves. Currently without deduplication the full
amount of data has to be transferred between sites and in general this requires high bandwidth links
such as Gb E or Fibre C hannel. W ith Deduplica tion only the delta chang es are transferred between
sites and this reduction allow s low er band w idth links such as T3 o r O C1 2 to b e used at low er cost.
The follow ing ex am ple il lustrates the estimated replica tion t imes for varyi ng a mounts of chang e. M ost
customers would be happy with a replication t ime of say 2 hours between sites using say a T3 link.
The feed from HP D2 D b ackup systems or H P Virtual Libra ry systems to the replica tion link i s one or
more G bE pipes.
8
-
7/29/2019 understanding deduplication
9/28
Figure 6. Replication times with and without deduplication
Estimated Time to Replicate Datafor a 1TB Backup Environment @ 2:1
Link Type
W ith d edupeChange Rate
W ithout d edupeBackup Type
OC12T3T1Data Sent
4 . 3 m in5 9 m in2 9 h rs1 3 . 1 G B0 . 5 %
5 . 3 m in7 3 m in3 5 h rs1 6 . 3 G B1 . 0 %
4 9 h rs
4 5 . 4 d a ys
4 . 5 d a y s
1 .5 M b / s
2 2 . 5 G B
5 0 0 G B
5 0 G B
6 2 2 .1 M b / s4 4 .7 M b / s
1 6 m in3 .8 h rsIncremental
2 .7 h rs1 . 6 d a y sFull
7 . 3 m in1 0 2 m in2 . 0 %
Link Rate (66% efficient)
Link Type
W ith d edupeChange Rate
W ithout d edupeBackup Type
OC12T3T1Data Sent
4 . 3 m in5 9 m in2 9 h rs1 3 . 1 G B0 . 5 %
5 . 3 m in7 3 m in3 5 h rs1 6 . 3 G B1 . 0 %
4 9 h rs
4 5 . 4 d a ys
4 . 5 d a y s
1 .5 M b / s
2 2 . 5 G B
5 0 0 G B
5 0 G B
6 2 2 .1 M b / s4 4 .7 M b / s
1 6 m in3 .8 h rsIncremental
2 .7 h rs1 . 6 d a y sFull
7 . 3 m in1 0 2 m in2 . 0 %
Link Rate (66% efficient)
A w ord of caution
An init ial synchronization of the backup device at the primary site and the one at the secondary site
must be p erformed. Because the volume of da ta that requires synchroniz ing at this stag e is high, a
low-bandwidth link will not suff ice. Synchronization can be achieved in three different ways:
Provision the tw o d evices on the sam e site and use a feature such as local rep lication o ver high-ba ndw idth f ibre cha nnel l inks to synchroniz e the data. Then ship o ne of the libraries to the remote
site
Install the two sepa rate devices at sepa rate sites, perform init ial ba ckup a t Site A. C op y the backupfrom Site A to phy sical tape, then transfer the phy sical tapes to site B and i mp ort them. W hen the
systems at bo th sites are synchroniz ed, start low-band w idth replica tion betw een the tw o
A fter in i tia l b ackup a t si te A a l low a multi -day w indow for in i t ia l synchronizat ion a l lowi ng the tw odevices to cop y the in i tia l b ackup d ate over a low -band w idth l ink
9
-
7/29/2019 understanding deduplication
10/28
Und erstand ing C ustomer N eeds for Da ta Dedup l ica tion
Both large and small organizations have remarkably similar concerns when it comes to data
pro tection. W hat differs is the priority o f their issues.
Figure 7. Co mmon cha llenges w ith data protection amo ngst remote offices, SMEs and la rge customers
O vercome a lack o f dedi cated IT resources
M anage da ta g row th
M ainta in ba ckup ap pl i cat ion, f i le and O Sindependence
Spend less t ime managing backups
Hand le explosive data g row th
M eet and mainta in backup w indow s
Achieve greater backup rel iabi l i ty
Accelerate restore f rom tap e ( inc v i r tual tape)
M ana ge remote site da ta p rotection
Co mmon chal lenges
SME
Remote office
Data center
Environment N eeds
Different priorit ies are what have led HP to develop two dist inct approaches to data deduplication.
For example:
Large enterprises have issues meeting backup windows, so any deduplication technology that couldslow do w n the ba ckup p rocess is of no use to them. M edium a nd Small enterprises are concerned
abo ut backup w indow s as w el l but to a lesser degree
M ost larg e enterprise custom ers have Service Level A gre ements (SLA s) per taini ng to restore timesany deduplication technology that slows down restore t imes is not welcome either
M any la rge customers back up hundred s of terab ytes per nig ht, and their ba ckup solution w ithdeduplication needs to scale up to these capacit ies without degrading performance. Fragmenting
the approach by having to use several smaller deduplication stores would also make the whole
backup p rocess harder to manag e
Conversely remote of f ices and smaller organiz at ions general ly need an easy a ppro ach adedicated appliance that is self-containedat a reasonable cost
Remote off ices and SM Es do no t wa nt or need a system that is inf initely scalable, a nd the costassociated w ith l inear ly scalable ca pacity a nd p erformance. They need a single engine ap proach
that can w ork transparently in any o f their environments
1 0
-
7/29/2019 understanding deduplication
11/28
HP Accelerated Deduplication for the Large Enterprise
Customer
HP Accelerated d eduplica tion technolog y is designed for la rge enterprise data centers. It is the
technology HP has chosen for the H P Stora geW orks Virtual Libra ry Systems. Acc elerated
deduplication has the following features and benefits:
Utilizes object-level differencing technology with a design centered on performance and scalability Delivers fastest possible b ackup performa nce it leverages post-pro cessing technology to pro cess
data dedupl icat ion as ba ckup job s complete, dedupl icat ing p revious ba ckups whi lst o ther backups
are stil l completing.
Delivers fastest restore from recently ba cked up da ta it maintains a co mplete copy of the mostrecent backup d ata and e l iminates dupl icate da ta in p revious backups
Scalable deduplication performanceit uses distributed architecture where performance can beincreased by adding addit ional nodes
Flexible replica tion op tions to p rotect your investment
Figure 8. O bject-level differencing comp ares only current and p revious backups from the same ho sts and el iminates duplica tedata by means of pointers. The latest backup is always held intact.
______ ________ ________ ________ ________ __
A______ ________ ________ ________ ________ __
A
_____ ________ ________ ________ ________ ___
A
HP Accelerated Deduplication
SpaceReclamation
Deletesdupl icated data,
reallocatedunused space
PreviousBackup
CurrentBackup
CurrentBackup
DataGrooming
Identif ies similardata ob jects
______ _________ _________ _________ ___
______ _________ _________ _________ ___
______ _________ _________ _________ ___
______ _________ _________ _________ _________ __
______ ________ ________ ________ __
______ ________ ________ ________ ________ __
A
DataDiscrimination/
Data Comparison
CurrentBackup
PreviousBackup
______ ________ ________ ________ ________ __
A
Pointer toExisting Data
Optional SecondIntegrity Check
Compares dedup l icateddata to or iginal data
objects
_____ ________ ________ ________ ________ ___
A
Identi fies di fferencesat byte level,
ensures dataintegri ty
______ ________ ________ ________ ________ __
A
N e w d a tastored. Dupl icateddata replacement
wi th pointer toexisting data.
Reassembly
______ _________ _________ _________ ___
______ _________ _________ _________ ___
______ _________ _________ _________ ___
______ _________ _________ _________ _________ __
______ ________ ________ ________ __
______ ________ ________ ________ ________ __
A
1 1
-
7/29/2019 understanding deduplication
12/28
How Accelerated Deduplication Works
W hen the backup runs the data stream is processed a s it is stored to di sk assembling a c ontent
database on the f ly by interrogating the meta data attached by the backup application. This process
has min imal performance impa ct.
1 . A fter the f irst backup job comp letes, tasks are scheduled to b egin the ded uplica tion processing.The content database is used to identify subsequent backups from the same data sources. This is
essential, since the w ay ob ject-level differencing w orks is to comp are the current backup from a
host to the p revious backup from that same ho st.
Figure 9. Identi fy ing dupl icated da ta by str ipping aw ay the meta da ta associated w i th backup formats, f i les and da tabases
O b ject Level Differencing str ip s aw a ythe meta data to reveal real dupl ication
Physical d ata
A ctual f i le AMETAA
A ctual f i le A
METAB
Logical data
These tw o f i les look di f ferent at abackup object level because of thedif ferent backup meta da ta but at
a logical level they a re ident ical .O bject Level Dif ferencingdedupl ica tion str ips awa y theba ckup meta d ata to reveal realdupl icated data.
Backup A ppM eta da ta
Session 1
Session 2
2 . A data compa rison is performed b etw een the current backup a nd the previous backup f rom thesame host. There are d ifferent levels of co mp arison. For exa mple, some ba ckup sessions are
comp ared at an entire session level. Here, d ata is comp ared by te-for-byte betw een the tw o
versions and co mmo n streams of da ta are identif ied. O ther ba ckup sessions comp are versions of
f i les w ithin the ba ckup sessions. N ote that w ithin A ccelerated ded uplica tions object-leveldifferencing, the comp arison is done A FTER the backup m eta da ta and f i le system meta da ta has
been stripped away. (See the example in the following Figure 10) This makes the deduplication
process much more eff icient but relies on an intimate knowledge of both the backup application
meta da ta types and the data typ e meta da ta (f i le system file, da taba se f i le, and so on).
3 . W hen duplica te data is found i n the comp arison pro cess, the duplicate da ta streams in the oldestba ckup a re replaced by a set of pointers to a mo re recent cop y of the same d ata. This ensures tha
the latest backup is alw ays fully contiguous, and a restore from the latest backup w ill alw ays take
place at maxim um speed.
1 2
-
7/29/2019 understanding deduplication
13/28
Figure 10 . W ith object level differencing the last ba ckup is alwa ys fully intact. Duplicated ob jects in previous backup s are
replaced wi th pointers plus byte level differences.
HP A cce lera ted Data Dedupl ica tion Deta i ls
SESSION1
SESSION2
SESSION3
DAY 3DAY 2DAY 1
SESSION1
SESSION2
SESSION3
DAY 3DAY 2DAY 1
A C DA C D
A CA C
A B A B A B
A C
A, B, C and D are f i les within a backup sessiondifferenced data
poi nter to currentversion
In the preceding d ia gram, backup session 1 had f i les A and B. W hen backup session 2 completed
and was compared with backup session 1, f i le A was found and a byte-level difference calculated for
the older version. So i n the older ba ckup (session1), f i le A w as replaced by po inters plus differencedeltas to the f i le A d ata in b ackup session 2. Subsequently, when b ackup session 3 comp letes, it is
compa red w ith backup session 2 and f i le C is found to be dupl icated. Hence a d if ference and a
pointer is placed in backup session 2 pointing to the f i le C data in backup session 3, also at the same
time the orig inal p ointer to f i le A in Session1 is readjusted to p oint to the new location o f f i le A. This is
to prevent multiple ho ps for po inters w hen restoring o lder d ata. So the process continues, every t ime
comparing the current backup with the previous backup. Each t ime a difference plus pointer is written
stora ge ca pa city is saved . This process allow s the ded uplica tion to track even a byte level change
betw een f i les.
4 . Secondary Integr i ty Checkb efore a ba ckup tape is replaced by a d edupl icated version w ithpo inters to a mo re recent occurrence of that data, a byte-for-byte comp arison can take place
comparing the original backup with the reconstructed backup, including pointers to ensure thatthe two are ident ica l. O nly w hen the compa re succeeds wi l l the or ig ina l backup tape be replaced
by a version including po inters. This step is optional. See Figure 9 Step 4 .
5 . Space reclamation occurs when all the free space created by replacing duplicate data withpointers to a single instance of the data is complete. This can take some time and results in used
capacity being returned to a free pool on the device.
Replication can take place from Step 3 because the changed data is available to be replicated even
before the space has been reclaimed.
1 3
-
7/29/2019 understanding deduplication
14/28
HP A ccelerated dedupl icat ion:
W ill scale up to hundred s of TB Has no impa ct on ba ckup performance, s ince the comparison is done af ter the ba ckup job
completes (post process)
A llows more dedupl icat ion compute nodes to be add ed to increase dedupl icat ion performanceand ensure the post proc essing is comp lete before the ba ckup cy cle starts aga in.
Yields high deduplication ratios because it strips away meta data to reveal true duplication, anddoes not rely on data chunking.
Provides fast bulk data restore and tape cloning for recently backed up datamaintains thecomplete most recent copy of the backup data but eliminates duplicate data in previous backups.
Issues A ssoc ia ted w ith O b ject-Level Differenc ing
The ma jor i ssue w ith obj ect-level differencing is that the device ha s to b e know ledgea ble in terms of
backup formats and data types to understand the M eta data. HP Accelerated d edupl icat ion w i l l
support a subset of backup applications and data types at launch.
Addit ionally, object-level differencing compares only backups from the same host against each other,
so there is no d eduplica tion ac ross hosts, but the a mount of co mmo n d ata a cross different hosts can
be quite low .
W hat M akes HP A ccelerated Dedup l ication unique?
The ob ject-level differencing in H P A ccelerated ded uplica tion is unique in the marketpla ce. Unlike
hash-ba sed techniques that are an all-or-nothing m ethod o f ded uplica tion, ob ject-level differencing
applies intell igence to the process, giving users the ability to decide what data types are deduplicated
and allowing f lexibil ity to reduce the deduplication load if it is not yielding the expected or desired
results. HP O bject-level differencing technology is also the only d edup lication technolog y that can
scale to hundreds of terabytes with no impact on backup performance, because the architecture does
not depend on managing ever increasing amounts of Index tables, as is the case with Hash based
chunking. It is also w ell suited for la rger scaleab le system since it is ab le to di stribute the
deduplication workload across all the available processing resources and can even have dedicatednodes purely for deduplication activit ies.
HP Accelerated d edupl icat ion wi l l be supported o n a rang e of backup a ppl icat ions: HP Data Protector Symantec N etBackup Tivol i Storag e M anag er Legato N etw orker
HP Accelerated dedupl icat ion w i l l support a w ide ra nge of f i le types: W i nd o w s 2 0 0 3 W indows Vista HP-UX 1 1 .x Solaris standard f i le backups Linux Redh at Linu x SuSe AIX f i le backups Tru64 f i le backups
1 4
-
7/29/2019 understanding deduplication
15/28
HP Accelerated deduplication will support database backups over t ime: O rac le RM AN Ho t SQ L Backups O nline Exchange M API ma i lbox backups
For the latest more details on what Backup software and data types are supported with HP
Accelerated Deduplication please look at the HP Enterprise Backup Solutions compatibil ity guide at
h ttp : / / w w w . hp . co m/ g o / eb s
HP A ccelerated deduplica tion technolog y is ava ilab le by license on HP Stora geW orks Virtual Libra ry
Systems (models 6000, 9000, and 12000). The license fee is per TB of user storage (before
comp ression or d eduplica tion takes effect).
Figure 11 . Pros and cons of HP Accelerated Deduplication
Pros & C ons of H P A ccelerated Ded uplica tion
PRO CO N
Does not restrict ba ckup ra te sinceda ta is processed a fter the ba ckuphas completed.
Faster restore rate forwa rdreferencing p ointers a l low rapidaccess to da ta.
Can handle datasets > 100TBw ithout having to pa rti t ion ba ckups no hashing tab le depend encies.
Can selectively compare data l ikelyto match, increasing performancefur ther higher dedupl icat ion ratios.
Best suited to large Enterprise VTLs.
Ha s to b e ISV format aw are andda ta type aw are, content coveragewil l grow over t ime.
M ay need ad di t ional computenod es to speed up p ost pro cessingded upl icat ion in scenarios w ith longbackup windows.
N eeds to cache 2 b ackups in orderto perform post process comparison.
So a dd itiona l disk capa city equal tothe size o f the larg est backup needsto be sized into the solution.
A t ingest t ime when the tape co ntent da taba se is generated there is a small p erformance o verhead
(< 0 .5 %) and there is a sma ll amount of di sk spa ce required to hold this datab ase (much less than the
hash tab les in the hash based chunking d edup lication technolog y). Even if this content da taba se w ere
comp letely destroyed it wo uld stil l be p ossible to ma intain a ccess to the da ta beca use the pointers are
stil l fully i ntact and held w ithin the re-w ritten tap e forma ts.
HP object level differencing also has ability to provide selective deduplication by content type, and in
the future could b e used to ind ex content provid ing content addressab le archive searches.
The question o ften arises W hat hap pens if dedup lication is not comp lete by the time the same
backup from the same host arrives? Typically the deduplication process takes about 2 x as long as
the backup process for a given backup, so as long as a single backup job does not take > 8 hours
1 5
http://www.hp.com/go/ebshttp://www.hp.com/go/ebshttp://www.hp.com/go/ebs -
7/29/2019 understanding deduplication
16/28
this w ill not occur. In ad dit ion the mult i-node architecture ensures that each nod e is load ba lanced to
provide 33% of its processing capabilit ies to deduplication whilst st i l l maintaining the necessary
performance for backup and restore. Final ly add it ional dedicated 1 00 % dedupl icat ion compute
nodes can be added if necessary.
Let us now analyze HPs second type of d edupl icat ion technology Dynamic dedupl icat ion, w hich
uses hash-based chunking.
HP Dyna mic Ded upl ication for Smal l and M edium ITEnvironments
HP Dynam ic d eduplic ation i s designed fo r customers w ith smaller IT environments. Its main features
and benefits include:
Hash-based chunking technology with a design center around compatibil ity and cost Low cost and a small RA M footprint Independence f rom backup app l icat ions Systems w ith built-in d ata d eduplic ation Flexible rep lication o ptions for increa sed investment protection.Ha sh-ba sed chunking techniques for da ta reduction have b een around for yea rs. Ha shing consists of
app lying an a lgor i thm to a specif ic chunk of da ta a nd yie ld ing a unique f ingerpr int of that data. The
backup stream is simply broken down into a series of chunks. For example, a 4K chunk in a data
stream can be hashed so that it is uniquely represented by a 20-byte hash code. See Figure13
Figure 12 . Hashing technology
Hashing Technology
in-line = ded uplication on the fly as data is ingested using hashingtechniques
hashing = is a reprod ucib le method o f turn ing some kind of da ta intoa (relatively) small number that may serve as a d igital " f ingerprint" ofthe data
IN PUT O UTPUT
HP invent
HP StorageW orks
Hashing DFCD3453
Hashing 7 8 5 C 3 D 9 2
HP Nea r l ineStorage
Hashing 4 6 7 3 F D 7 4 B
The larger the chunks, the less chance there is of finding an identical chunk that generates the same
hash code thus, the ded uplica tion ratio w ill not be as high. The smaller the chunk size, the more
1 6
-
7/29/2019 understanding deduplication
17/28
eff icient the data deduplication process, but then a larger number of indexes are created, which leads
to problems storing enormous numbers of indexes (see the following example and Glossary).
Figure 13 . How hash based chunking w orks
_____ _________ _________ _________ ____
Backup 1
_____ _________ _________ _________ ____
Backup 1
Backup has been split into chunks and thehashing function has been a ppl ied
Hash generated andlook-up performed
N ew H ash genera tedand entered into index
#33 #13 #1 #65 #9 #245 #21 #127
#33 #13 #222 #75 #9 #24 5 #86 #127
# 8 6
# 3 3
5677#8 6
34 7#7 5
6459#222
13#127
12 3#2 1
97 6#245
78 5#9
3245#6 5
89#1
23 4#1 3
5#3 3
Disk Block# N o s
Index (RAM )
5677#8 6
34 7#7 5
6459#222
13#127
12 3#2 1
97 6#245
78 5#9
3245#6 5
89#1
23 4#1 3
5#3 3
Disk Block# N o s
Index (RAM )
_____ _________ _________ _________ ____
Backup 2
_____ _________ _________ _________ ____
Backup 2
Chunks are storedon Disk
How Dynamic Dedupl ica tion W orks
1 . As the backup data stream enters the target device (in this case the HP D2D2500 or D2D4000Backup System), it is chunked into nom inal 4 K chunks ag ainst which the SHA -1 hashing a lgori thm
is run. These results are place i n an index (hash value) and stored in RA M in the targ et D2D
device. The hash value is also stored a s an entry in a recipe f i le whi ch represents the ba ckup
stream, and po ints to the da ta in the dedup lication store where the origi nal 4 K chunk is stored.
This happ ens in real t ime as the backup is taking plac e. Step 1 continues for the who le bac kup
data stream.
2 . W hen another 4K chunk generates the same ha sh index a s a p revious chunk, no index is add edto the index list and the da ta is not wri tten to the dedup lication store. A n entry w ith the hash value
is simply added to the recipe f i le for that backup stream pointing to the previously stored data,
so spa ce is saved. N ow as you scale this up over m any b ackups there are ma ny instances of the
same hash value being generated, b ut the actual da ta is only stored once, so the spa ce saving s
increase.
3 . N ow let us consider ba ckup 2 in Figure 1 3 . A s the da ta stream is run through the hashingalgor i thm ag ain, much of the data w i l l generate the same hash index cod es as in backup 1
hence, there is no need to ad d i ndexes to the tab le or use storage in the ded uplica tion store. In
this backup, some of the data has changed. In some cases (#2 22 , #7 5, a nd # 8 6), the data is
unique and generates new indexes for the index store and new data entries into the deduplication
store.
4 . A nd so the hashing proc ess continues ad infinitum until as ba ckups are overw ritten by the taperotation strategy certain hash indexes are no longer required, and so in a housekeeping operation
they are removed.
1 7
-
7/29/2019 understanding deduplication
18/28
Figure 14 . How hash-based chunking performs restores
13#127
12 3#2 1
97 6#245
78 5#9
3245#6 5
89#1
23 4#1 3
5#3 3
Disk Block# N o s
Index (RAM )
13#127
12 3#2 1
97 6#245
78 5#9
3245#6 5
89#1
23 4#1 3
5#3 3
Disk Block# N o s
Index (RAM )
RestoreBackup 1
___________________________________
___________________________________
___________________________________
___________________________________ #33 #13 #1 #65 #9 #245 #21 #127#33 #13 #1 #65 #9 #245 #21 #127
#33 #13 #1 #65 #9 #245 #21 #127#33 #13 #1 #65 #9 #245 #21 #127
Backup 1 Recipe f ile
Restore commences and recipe file isreferenced in the Dedupe store
A recipe file is stored inthe Dedupe store, whichis used to re-construct thetape blocks thatconstitute a backup
Recipe filerefers to ind ex
#33 is restored
Recipe file
Chunks are storedon Disk
# 3 3
5 . O n receiving a restore comma nd from the backup system, the D2D device selects the correctrecipe file and starts sequentially re-assembling the file to restore.
a. Read recipe f i le.
b. Look up hash in index to get disk po inter.
c. G et or ig ina l chunk f rom disk.
d. Return da ta to restore strea m.
e. Repea t for every hash entry in the recipe f i le.
1 8
-
7/29/2019 understanding deduplication
19/28
Issues Associated with Hash-Based Chunking
The mai n issue wi th hash-ba sed chunking technology i s the gro w th of indexes and the limited a mount
of RA M storage req uired to store them. Let us take a simple exa mple: i f w e have a 1 TB backup da ta
stream using 4 K chunks, and every 4 K chunk prod uces a uniq ue hash value. This equates to 25 0
mill ion 20-byte hash values or 5GB of storage.
If we performed no other optimization (for example, paging of indexes onto and off disk), then the
ap plia nce wo uld need 5G B of RA M for every TB of ded uplica ted unique da ta. M ost server systems
canno t supp ort much more than 16 G B of RA M . For this reason, hash-ba sed chunking ca nnot easilyscale to hundreds of terabytes.
M ost low er-end to mid -rang e dedup lication technologi es use varia tions on hash-ba sed chunking , but
w ith add it ional techniques to red uce the size of the indexes generated, reducing the amount of RA M
required, but generally at the expense of some deduplication eff iciency or performance. If the index
management is not eff icient, it will slow the backup down to unacceptable levels or miss many
instances of d uplica te data. The other op tion is to use larg er chunk sizes to red uce the size of the
index . A s mentioned earlier, the do w nside of this is that deduplic ation w ill be less eff icient. These
algorithms can also be adversely affected by non-repeating data patterns that occur in some backup
softw are tap e forma ts. This beco mes a b igg er issue w ith larger chunk sizes.
HP has developed a unique innovated technology leveraging work from HP Labs that dramatically
reduces the amount of memory required for managing the index without sacrif icing performance or
ded uplica tion eff iciency. N ot only does this technolog y enab le low -cost hig h performance d isk
ba ckup systems, but it also allow s the use of much smaller chunk sizes to p rovid e mo re effective d ata
deduplication which is more robust to variations in backup stream formats or data types.
Restore t imes can be slow with hash-based chunking. As you can see from figure 14, to recover a 4K
piece of d ate from a hash-ba sed ded uplica tion store req uires a reconstruction pro cess. The restore
can take longer than i t d id to ba ck up.
Finally y ou ma y here the term hashing co llisions this means that 2 different chunks of data
pro duce the same ha sh value, w hich ob viously undermines the da ta integrity. The chances of this
hap pening are rem ote to say the least. H P Lab s calculated
Using a TW EN TY BYTE (16 0 bit) hash such as SHA 1 , the time required for a ha shing co llision tooccur is 10 0 ,0 00 ,00 0 ,0 00 ,00 0 years, based on the back ing up 1TB o f da ta per work ing day.
Even so, HP Dynamic deduplication adds a further Cyclic Redundancy Checksum (CRC) at a tape
record level that w ould ca tch the highly unlikely event of a ha sh collision.
Despite the ab ove limitations, dedup lication using ha sh-ba sed chunking is a w ell-pro ven technolog y
and serves remote off ices and medi um sized businesses very w ell. The b ig gest benefit of hash-ba sed
chunking i s that it is totally d ata forma t-indep endent and it does not have to be engineered to w ork
with specif ic backup applications and data types. The products using the hash based deduplication
technology stil l have to be tested with the various backup applications but the design approach is
generic.
HP is deploying Dynamic deduplication technology on its latest D2D Backup Systems, which aredesigned for remote off ices and small to medium organizations.
HP D2 D 25 0 0 a nd 4 00 0 Backup Systems come with dedupl icat ion as standard w ith no add it ional
licensing costs.
1 9
-
7/29/2019 understanding deduplication
20/28
Figure 15 . Pros and cons of hash-based chunking deduplication
Pros & Co ns of HP Dyna mic Dedup lication
PRO CON
Dedupl ication performed at ba ckuptime Can restrict ingest rate (backup rate) i f not do ne efficiently and couldslow backups down. Can instantly handle any data
format Restore time may be longer thanob ject level d ifferencingSignificant processing overhead, butdedupl icat ionbecause of datakeeping pace with processorregeneration pro cess. developments.Co ncerns over scalabi l ityFast search, algorithms alreadyw hen using very large ha sh ind exes.proven to aid hash detectionFor d ata sets > 1 0 0 TB may ha ve to
Low storag e overhead do n
W hat makes HP Dynamic Dedupl icat ion technology unique a re a lgor i thms developed w ith HP Labs
that drama t ica l ly reduce the amo unt of memory required for ma naging the index, a nd w ithout
sacrif ici ng p erformance o r ded uplica tion effectiveness. Specif ically, this technolog y:
Uses far less memory b y im plementing a lgori thms that determine w hich are the most op timalindexes to hold in RAM for a g iven backup da ta stream
Allows the use of much smaller chunk sizes to provide more effective data deduplication which ismore robust to variations in backup stream formats or data types
Provid es intelligent storage of chunks and recipe f i les to limit disk I/ O and pa ging W orks wel l in a b road range o f environments since i t is independent of backup sof tw are format
and data types
t have startto hold complete backups (TBs) forpo st ana lysis
Best suited to smaller size VTLs
par ti tioning backups to
ensure better hash indexmanagement.
2 0
-
7/29/2019 understanding deduplication
21/28
Low -Band w id th Repl ication Usag e M od els
The second mai n benefit of ded uplica tion is the ab ility to replicate the changes in da ta on site A to a
remote site B at a fraction of the cost because high-bandwidth links are no longer required. A general
guideline is that a T1 link is about 10% of the cost of a 4Gb FC link over the same distance. Low-
band w idth repl icat ion wi l l be avai lab le on both D2D and VLS products. Upto 2 G bE ports w i l l be
avai lab le for repl icat ion on D2 D devices and 1 G bE port per node w i l l be avai lab le on the VLS
products.
HP wil l suppo rt three topo logies for low band w idth repl ication:
Box --> Box Active Active M any --- O neThe unit of replication is a ca rtridge. O n VLS, it will b e po ssible to pa rt ition slots in a virtual l ibrary
replication target device to be associated with specif ic source replication cartridges.
Figure 16 . Active Active repl ication on HP VLS and D2D systems with deduplication
Accelerated Deduplication ReplicationExamp le Use Case A ctive/ A ctive
Backup Server
TCP/ IP
VLib1 VLib1VLS1
VLS2
VLib2 VLib2
Backup Server
Generally datacenter-to-datacenter replication, with each device performinglocal backups and also acting as the replication store for the other datacenter
2 1
-
7/29/2019 understanding deduplication
22/28
Figure 17 . M any-to-one replica tion on H P VLS and D2 D systems w ith deduplica tion
A ccelerated Dedup l ication Repl icationExample Use Case M any to O ne
Backup Server
VLib1
Backup Server
VLib1
Backup Server
VLib1
{
{
{
Backup Server
VLib2
VLib1
TCP/ IPVLS4
Ca n d ivide up a sing le destination targ et into m ultiple slots rang es to al low
many-to-one without needing a separate replication l ibrary for each one
VLS1
VLS2
VLS3
Initially i t will no t possible for D2 D d evices to rep licate into the much larger VLS devices, since their
ded uplica tion technologies are so di fferent, but HP plans to be a ble to offer this feature in the near
future.
W hat wi l l be possib le is to repl icate mult ip le HP D2D2 50 into a centra l D2D4 00 0 o r repl icate smalle
VLS62 00 models in to a centra l VLS 12 00 0 (See Figure 1 8)
Deduplication technology is leading us is to the point where many remote sites can replicate data
ba ck to a central da ta center at a reasonable co st, removing the need for tedio us off-site vaulting o f
tapes and fully a utomating the pro cesssaving even more co sts.
This ensures
The most cost effective solution is deployed at each specific site The costs and issues associated w ith off site vault ing o f phy sical tape are rem oved The whole Disaster recovery process is automated The solution is scalable at all sites
2 2
-
7/29/2019 understanding deduplication
23/28
Figure 18 . Enterprise Deployment with repl ication across remote and branch offices back to data centers
Enterpri se Dep loym ent
Servers
1-4 servers
> 20 0 G B storage
Servers
Larg e Remote Offi ce
Regional Site orSmall Datacenter
LAN
Large Datacenter
LAN
M obi le/ desktop c l ient agents
Backup agent
Backup/ media server
D2D ApplianceD2D Appliance
DiskStorage
Virtual LibrarySystem
LA
N
Small Remote O ffice
Mobile/ Desktops
Backup Server
LA
N
Mobile/ Desktops
Backup Server
Servers
Mobile/ Desktops
Mobile/ DesktopsServers
Servers
VirtualLibrarySystem
BackupServer
SA N
SA N
TapeLibrarySystem
D2D Appliance
W ith small an d lar ge ROBOs
SecondaryDatacenter
BackupServers
DiskStorage
Servers
W hy HP for Dedupl ication?
Deduplication is a powerful technology and there are many different ways to implement it , but most
vendors offer only one method and , as w e have seen, no one method i s best in all circumstances. HPoffers a choice of deduplication technologies depending on your needs. HP does not pretend that
one size f its all.
Cho ose HP Dynamic ded uplica tion for small and mid -size IT environments because it offers the best
technology footprint for deduplication at a price point that is affordable. Flexible replication options
further enhance the solution.
Choose HP Accelerated deduplication for Enterprise data centers where scalability and backup
performa nce are p ara mount. Flexible replic ation o ptions further enhance the solution.
The scalabi lity issues associa ted w ith hash-ba sed chunking are a dd ressed b y some com petitors by
creating mult iple separate deduplication stores behind a single management interface, but what this
do es is create islands of d edup lication, so the customer sees reduced b enefits and excessive costsbecause the solution is not inherently scalable.
A t the da ta center level, the major comp etitors of HP using ob ject-level differencing ha ve used bo lt-
on deduplication engines with exist ing virtual tape library architectures and have not integrated the
ded uplica tion engine w ithin the VTL itself. This leads to d ata b eing m oved b ack a nd forth betw een the
virtual l ibrary and the deduplication engine, which is very ineff icient.
2 3
-
7/29/2019 understanding deduplication
24/28
Ded uplica tion Technolog ies A l igned w ith H P Virtual Lib rary
Products
HP has a rang e of di sk-ba sed b ackup prod ucts w ith dedup lication start ing w ith the entry-level
D2D2500 at 2.25TB user unit for small businesses and remote off ices, right up to the VLS12000 EVA
G atewa y w ith cap acit ies over 1 PB for the high-end enterprise data center customer. They emulate a
range of HP Physical tape autoloaders and libraries.
Figure 19 . HP disk-based backup portfol io with deduplication
HP Storag eW orksDisk-to-disk a nd Virtual Library po rtfolio w ith deduplication
Mid-rangeEntry-level Enterprise
High-capacity and performancemulti-node system
Avai lab le and sca lab le
Enterprise data centers
Large FC SAN s M a n a g e a b le ,rel iable
Mids izedbusinesses or ITwith remotebranch of f ices
(iSCSI)
Scalab le , manageable ,re l iab le a pp l iance
M edium to la rge datacenters
M edium to large FCS A N s
VLS60 00 Fam ily
VLS 12000EVA Ga teway
VLS9000
D2D2500D2D4000
Ma nageable , re liab le
M idsized businessesor IT with small datacentres
iSCSI & FC
Dynamic Dedup l ication
Hash Based Chunking
Accelerated Dedupl icat ion
O bject level Differencing
Capacity
The HP StorageW orks D2 D25 00 and D2 D4 00 0 Backup Systems support HP dynam ic dedupl icat ion
These range in size from 2.25TB to 7.5TB and are aimed at remote off ices or small enterprise
customers. The D2 D2 5 0 0 has an iSCSI interface to reduce the cost of implementation at remote
of f ices, w hi le the D2D4 00 0 of fers a choice of iSCSI or 4 G b FC.
The HP Storag eW orks Virtual Libra ry Systems are all 4 G b SAN -attach devices w hich rang e in native
user capacity f rom 4.4 TB to over a petabyte with the VLS90 00 and VLS12 0 00 EVA G ateway.
Hardware compression is avai lable on the VLS6000, 9000 and 12000 models, achieving even
higher capacit ies. The VLS9000 and VLS12000 use a mult i-node architecture that allows the
performa nce to scale in a linear fashion. W ith eight nodes, these devices can sustain a throughp ut of
up to 48 00 M B/ sec at 2:1 data comp ression, provid i ng the SAN hosts can supply d ata at th is rate.
HP Virtual Libra ry Systems will dep loy the HP Accelerated ded uplica tion technolog y.
2 4
-
7/29/2019 understanding deduplication
25/28
Summary
Data d eduplica tion technology represents one of the most signif icant stora ge enha ncements in recent
years, pro mising to reshap e future da ta protection and disaster recovery solutions. Deduplica tion
offers the ability to store more on a given amount of storage and enables replication using low-
ba ndw idth links, both of w hich im pro ve cost effectiveness.
HP offers two comp lementary ded uplica tion technologies for di fferent customer needs:
A ccelerated ded uplica tion (with ob ject-level differencing) for hig h-end enterprise customers w horequire:
Fastest possible b ackup performa nce Fastest r estore M ost scalable solution in terms of performa nce and ca pa city M ulti-node low b andw idth repl icat ion Highest deduplication ratios W ide range o f rep lica tion mode ls
Dynamic deduplication (with hash-based chunking) for mid size organizations and remote off icesthat require:
Low er cost and a smaller footprint An integrated deduplication appliance with lights-out operation Backup a ppl icat ion and data type ind ependent for max imum f lexib i l i ty W ide range o f rep lica tion mode ls
This whitepap er expla ined how dedupl icat ion technologies of HP work in p ract ice, the pros and cons
of each ap proach, w hen to choose a pa rticu lar type a nd the type of low -band w idth repl ication
mod els HP pla ns to supp ort.
The H P Virtual Libra ry System (VLS) incorpo rate A ccelerated d edup lication technology that scales for
larg e mult i-nod e systems and delivers high-performa nce d eduplica tion for enterprise customers.
HP D2D (Disk to Disk) Backup Systems use Dyna mic d eduplica tion technology that provid es a
signif icant price advantage over our competitors. The combination of HP patents allowing optimal
RA M usag e (RAM footpr int) with min ima l new hash values being generated on similar backup
streams. HP D2D backup systems with integrated deduplication set a new price point for
deduplication devices.
2 5
-
7/29/2019 understanding deduplication
26/28
A pp endix A G lossary o f Terminolog y
Source-based Deduplication
W here data i s deduplica ted in the host(s) prio r to transmission o ver the storage netw ork. This
general ly tends to be a p ropr ietary app roach.
Targ et-based Deduplication
This is where the data is deduplicated in a Target device such as a virtual tape library and is
ava ilab le to all hosts using that targ et device.
Hashing
This is a rep roduci ble method o f turning some kind of da ta into a (relatively) small number that may
serve as a d ig i ta l " f ingerpr int" of the data.
Chunks
This is a m ethod of b reaking do w n a d ata stream i nto segments (chunks), and o n each chunk the
hashing algorithm is run.
SHA-1
Secure hashing a lg or i thm 1 . For example SHA-1 can enab le a 4 K chunk of data to be uniquely
represented by a 20-byte hash value.
O bject-Level Differ encing
Is a g eneral IT description that describes a pro cess that has an i ntimate know ledge of the da ta that it
is handl ing dow n to log ical format level. O bject-level d i f ferencing ded upl icat ion means the
deduplication process has an intimate knowledge of the backup application format, the f i le types
being ba cked up (for examp le, W indo w s f i le system, excha nge f i les, and SQ L files). This intima te
knowledge allows f i le comparisons at a byte level to remove duplicated data.
Box-to-Box
Replica tion from a Source to Destination i n one d irection.
Active-ActiveReplica tion from a Source device on Site A to a Targ et Device on Site B and vic e versa.
Many-to-One
Replica tion from mult iple sources to a single destination device.
Deduplication ra tio
The reduction in stora ge req uired for a ba ckup (after several other backup s have taken p lace). Figures
between 10 :1 and 30 0:1 have been quoted b y d if ferent vendors. The ratio is h ighly dependent on:
Rate of change of d ata ( for example, 1 0 % of the data in 1 0 % of the fi les) Retention period of backups Efficiency of deduplication technology implementationSpa ce Reclama tion
W ith all deduplica tion devices t ime is required to free up space that w as used by the duplica ted data
and return it to a free pool Because this can be quite t ime consuming in tends to occur in off peak
periods.
Post Processing
This is where the ded uplica tion is do ne A FTER the backup comp letes to ensure there is no w ay the
dedupl icat ion process can slow dow n the backup a nd increase the backup w indow required.
2 6
-
7/29/2019 understanding deduplication
27/28
2 7
In-Line
This is where the ded uplica tion p rocess takes place REAL TIM E as the b ackup is actually taking pla ce.
Depending on implementations this may or may not slow the backup process down.
Multi- thread
W ithin HP O bject Level differencing the compa re and spa ce reclamation p rocesses are run wi th
multiple paths simultaneously to ensure faster execution times.
Mult i -node
HP VLS90 0 0 and VLS12 00 0 products scale to of fer very h igh performance levels up to e ight nodes
can run in para l le l , g iv ing throughput capa bi l i t ies up to 48 00 M B/ sec at 2:1 compression rat io . This
mult i-nod e architecture is fundam ental to H Ps Accelerated ded uplica tion technology b ecause it allow s
maximum processing p ow er to be a ppl ied to the dedupl icat ion p rocess.
A pp endix B Dedupl ication comp ared to other data
reduction technologies
Technology descrip tion Pro Con Comments
Deduplication A d va nc ed
technique for efficiently
storing data by referencing
existing blocks of data that
have been previously
stored, and only storing
new data that is unique.
Two fold benefits
Space savings of between
10 and 10 0 :1 be ing quo ted
Further benefit of low
bandw idth repl icat ion
Can s low backup down i f
not implemented efficiently.
Hash ba sed technologies
may not scale to 1 00 s of TB
O bject Level differencing
technologies need to be
mul ti format aw are w hich
takes time to eng ineer
Deduplication is by far the
most impressive disk
storage reduction
technology to emerge over
recent years.
Implementation varies by
vendors. Benchmarking
highly recommended
Single Instancing Is r ea lly
dedupl icat ion at a f i le level
Avai lable a s part of the
M icrosoft fi le system and a s
a feature of the file system
of a N etapp f i ler . System
based approa ch to space
savings
W i l l not el iminate
redundancy w i th in a f i le ,
only i f two fi les are exactly
the same
For example adding fi les to
a PST fi le, or adding a
slide to a presentation.
Limi ted use
Arr ay-based snapshots
capture changed blocks on
a disk LUN
Used primari ly for fast rol l-
back to a consistent state
using image recovery
not real ly focused on
storage efficiency.
Does not el iminate
redundant data for the
changed blocks
Cap tures any change ma de
by the file system
example does not
distinguish between real
da ta and de leted / f ree
space on disk
W el l establ ished. General ly
used for quick recovery to
a know n point in t ime
Incremental Forever
backups re cr ea te a fu ll
restore ima ge fro m just one
ful l backup and lots of
incrementals
M inimizes the need for
frequent ful l backups and
hence al lows for smaller
backup w indows
M ore focused at t ime
savings than really at space
savings
G eneral ly only wo rks wi th
fi le system ba ckups not
databa se based backups
Compression softw a re o r
hardware
Fast (i f done in hardware),
slower i f done in software.
W el l establ ished a nd
understood
M aximum space savings
are general ly 2 :1
Can be used in a ddi t ion to
dedupl icat ion
-
7/29/2019 understanding deduplication
28/28
For mo re informa tion
w w w . hp .c om / g o / ta p e
w w w . hp .c om / g o / D 2 D
w w w . h p .c o m/ g o / V LS
ww w.h p . co m/ g o / d e d up l ica tio n
HP Storag eW orks custom er success stori es
C op yr ig ht 2 0 0 8 H ew let t-Pack a rd De ve lo p me nt C om p a ny , L.P. The in fo rm a tio ncontained herein is subject to change without notice. The only warranties for HPproducts and services are set forth in the express warranty statementsaccompa nying such products and services. N othing herein should be construed asconstituting a n add it ional wa rranty. HP shall not be l iab le for technical or editoria lerrors or omissions contained herein.
Linux is a U.S. registered trademark of Linus Torvalds. M icrosoft and W indow s areU.S. registered trademarks of M icrosoft Corporat ion. UN IX is a registered
http://www.hp.com/go/tapehttp://www.hp.com/go/D2Dhttp://www.hp.com/go/VLShttp://www.hp.com/go/deduplicationhttp://www.hp.com/go/deduplicationhttp://h18006.www1.hp.com/storage/customer_stories/index.html?jumpid=reg_R1002_USENhttp://h18006.www1.hp.com/storage/customer_stories/index.html?jumpid=reg_R1002_USENhttp://h18006.www1.hp.com/storage/customer_stories/index.html?jumpid=reg_R1002_USENhttp://www.hp.com/go/deduplicationhttp://www.hp.com/go/VLShttp://www.hp.com/go/D2Dhttp://www.hp.com/go/tape