deduplication school

Upload: harshvyas

Post on 07-Apr-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Deduplication School

    1/61

    Sponsor

    ed by:

    Deduplication School 2010Presentation DownloadSince 2009, there have been a number of changes and advancements in the storage environment.Data deduplication was the # 1 storage technology being evaluated by storage professionals last year.

    This presentation download will explain how to leverage data deduplication technology to benefityour organization.

    Download to learn the answers to the following questions:

    How do recent major acquisitions affect the options in the dedupe marketplace?

    Is everyone doing dedupe now?

    Are all dedupe products roughly equivalent, or are their advantages of certain approaches?

    These questions and more will be answered by storage expert, W. Curtis Preston, in this DedupeSchool seminar presentation download.

    http://www.emc.com/
  • 8/6/2019 Deduplication School

    2/61

    Deduplication School 201

    http://searchstorage.techtarget.com/Ded

    W. Curtis Preston

    Executive Editor, TechTarget

    Founder/ CEO Truth in IT, Inc.

    http://searchstorage.techtarget.com/DedupeSchoolhttp://searchstorage.techtarget.com/DedupeSchool
  • 8/6/2019 Deduplication School

    3/61

    A Little About Me When I started as backup guy at $35B com

    1993:

    Tape Drive: QIC 80 (80 MB capacity)

    Tape Drive: Exabyte 8200 (2.5 GB & 256KB/ s)

    Biggest Server: 4 GB (93), 100 GB (96)

    Entire Data Center: 200 GB (93), 400 GB (96)

    My TIVO now has 5 times the storage my data cente

    Consulting in backup & recovery since 96

    Author of OReillys Back up & Recover y &

    Usin g SANs an d N ASW b t f B k C t l

  • 8/6/2019 Deduplication School

    4/61

    A Little Bit About Truth in IT, Inc Inspired by Consumer Reports, but designe

    No advertising, no partners = no need to

    No huge consulting fees just to find out w hic

    work and which ones dont work (such fees

    at $10K and go all the way to $100K!)

    Funded instead by $999 annual subscription

    Private online community w ith written resea

    results, podcasts of interviews w ith users of

    and direct communication w ith real custome

  • 8/6/2019 Deduplication School

    5/61

    Agenda Understanding Deduplication

    Using Deduplication in Backup System

    Using Data Reduction in P rimary Syst

    Recent Backup Software Advancemen

    Backing up Virtual Servers

    Backups on a Budget

  • 8/6/2019 Deduplication School

    6/61

    Session 1

    Understanding Deduplicatio

  • 8/6/2019 Deduplication School

    7/61

    Why Disk? First a little history

  • 8/6/2019 Deduplication School

    8/61

    History of My World Part I When I joined the industry (1993)

    Disks were 4 MB/ s, tapes were 256 KB/ s

    Networks were 10 Mb shared

    Seventeen years later (2010)

    Disks are 70 MB/ s, tapes are 120 MB/ s Networks are 10 Gb sw itched

    Changes in 17 years

    17x increase in disk speed (luckily, RAID

    E

  • 8/6/2019 Deduplication School

    9/61

    More History Plan A: Stage to disk, spool to tape

    Pioneered by IBM in 90s, w idely adopted in lat

    Large, very fast virtual disk as caching mecha

    Only need enough disk to hold one nights bac

    Helps backups; does not help restores

    Plan B: Backup to disk, leave on disk

    AKA the early VTL craze

    Helps backups and restores

    Di k till t i t k thi

  • 8/6/2019 Deduplication School

    10/61

    Plan C: Dedupe I ts perfect for traditional backup

    Fulls backup the same data every day/ week/ m

    Incrementals backup entire file when only one

    Both backup file 100 times if its in 100 locatio

    Databases are often backed up full every day

    Tons of duplicate blocks!

    Average actual reduction of 10:1 and higher

    I ts not perfect for everythingP d t d d t

  • 8/6/2019 Deduplication School

    11/61

    Naysayers Eliminate all but one copy?

    No, just eliminate duplicates per location

    What about hash collisions?

    More on this later, but this is nothing but FUD

    I f youre unconvinced, use a delta differential

    Doesnt this have immutability conce

    Everything that changes the format of the data

    immutability concerns (e.g. sector-based stora

  • 8/6/2019 Deduplication School

    12/61

    Is There a Plan D? Some pundits/ analysts think dedupe

    target dedupe) is a band-aid, and w il

    eventually be done away with via bac

    software-based dedupe, delta-backup

    Maybe this w ill happen in a 3-5 year maybe it wont. (In fact, some backu

    companies w ill tell you they dont ne

    stinking dedupe appliances.)

  • 8/6/2019 Deduplication School

    13/61

    How Dedupe Works

  • 8/6/2019 Deduplication School

    14/61

    Your Mileage WILL Vary You really can get 10x to 400x

    I t depends on Frequency of full backups (more fulls = more

    How much of a given incremental backup cont

    of other files (multimedia generally doesnt ha Length of retention (longer retention = more d

    Redundancy in single full backup (if your prod

    Things that confuse dedupe

  • 8/6/2019 Deduplication School

    15/61

    HowDo They Identify Duplicate D

    Two very different methods

    Chunking/ hashing

    Asigra, EMC Avamar, Symantec PureDisk, C

    Simpana

    EMC Data Domain, Greenbytes, FalconStor V

    NEC, Quantum DXi Delta differential

    Exagrid, IBM P rotectier, Ocarina, SEPATON

    Some systems may use a hybrid approach

  • 8/6/2019 Deduplication School

    16/61

    Chunking/Hashing Method Slice all data into s e g m e n t s or c h u n k

    Run chunk through hashing algorithm

    Check hash value against all other ha

    Chunk w ith identical hash value is dis

    Will find redundant blocks between f

    different file systems, even different

  • 8/6/2019 Deduplication School

    17/61

    Delta Differential Method Correlate backups

    Mathematical methods

    Using metadata

    Compare similar backups byte-by-byt

    Example

    Tonights backup of Exchange instance Elv isis

    similar to last nights backup of Elvis

    Tonights backup ofElv isis compared byte-by

  • 8/6/2019 Deduplication School

    18/61

    Hashing & Delta Differential Hashing

    Most used method w ith most mileage

    Some concerned about hash collisions (more

    Compares everything to everything, therefordedupe out of similar data in dissimilar datasproduction and test copy of same data)

    Delta Differentials Faster than hashing

    No concern about hash collisions

    Only compares like backups, so w ill get no d

    similar data in dissimilar datasets, but does dedupe on same data

  • 8/6/2019 Deduplication School

    19/61

    Hash Collisions: The real numbHash Size Number of Hashes & Amount of Dat

    Desired Probability (Assuming 8k c

    10-15

    128 bits (MD5) 8.2 1011 6.6 PB 8.2 1016

    160 bits (SHA-1) 5.4 1016 432.5 EB 5.4 1021

    10-15: Odds of single disk w riting incorrect data a

    know ing it (Undetectable Bit Error Rate or UBER

    With SHA-1, we have to w rite 6.6 PB to get those

    10

    -5

    : Worst odds of a double-disk RAID5 failureWe have to write 1 371 181 YB to reach those od

  • 8/6/2019 Deduplication School

    20/61

    WhereIs the Data Deduped?

    Target Dedupe

    Data is sent unmodified across LAN & deduped at ta

    No LAN/ WAN benefits until you replicate target to t Cannot compress or encrypt before sending to targe

    Source Dedupe

    Redundant data is identified at backup client

    Only new, unique data sent across LAN/ WAN

    LAN/ WAN benefits, can back up remote/ mobile data

    Allows for compression, encryption at source

    H b id

  • 8/6/2019 Deduplication School

    21/61

    Lets Make It More Complicated Standalone Target Dedupe

    Dedupe appliance separate from backup softw a

    Integrated Target Dedupe

    Target dedupe from b/ u s/ w vendor that backs

    Standalone Source Dedupe

    Full dedupe solution that only does source ded

    Integrated Source DedupeB k ft th t d d t li t (

  • 8/6/2019 Deduplication School

    22/61

    Name That Dedupe Standalone Target Dedupe

    Data Domain, Exagrid, Greenbytes, IBM, NEC,

    SEPATON

    Integrated Target Dedupe

    Symantec NetBackup

    Integrated Source Dedupe

    Asigra, Symantec NetBackup

    Standalone Source Dedupe

  • 8/6/2019 Deduplication School

    23/61

    Multi-node Deduplication

    AKA Global Deduplication

    AKA Clustered Deduplication

  • 8/6/2019 Deduplication School

    24/61

    What Were Not Talking About Remember hashing vs. delta differen

    Delta compares like to like Hashing compares everything to ever

    Somesales reps from s o m e companiedont have multi-node/ global dedupe

    calling the latter global dedupe. I ts n

    At a minimum this is honest confusio

  • 8/6/2019 Deduplication School

    25/61

    Single-node/Local vs. Multi-node Assume a customer buys multiple no

    dedupe system

    Suppose, then, that they back up exa

    same client to each of those multiple

    I f the vendor fails to recognize the d

    data and stores it multiple times, it h

    node/ local dedupe

  • 8/6/2019 Deduplication School

    26/61

    Doctor It Hurts When I Do This Single-node/ local dedupe vendors say then

    that. Why would you do that?

    They tell you to split up your datasets and se

    dataset to only one appliance

    Easy to do if Your dataset sizes never change

    A given dataset never outgrows a node

    Some single-node sales reps w ill point out th

  • 8/6/2019 Deduplication School

    27/61

    Multi-node Is the Way to Go Especially for larger environments &

    conscious environments that buy as t

    With multi-node dedupe you can load

    treat same as you would a large tape

    Single-node dedupe pushes the vend

    the crest of the CPU/ RAM wave

    Multi-node vendors can ride behind t

  • 8/6/2019 Deduplication School

    28/61

    Multi/Single Node Dedupe Vendo Multi-node/ global

    EMC Avamar (12 nodes)

    Exagrid (10 nodes)

    NEC (55 nodes)

    SEPATON (8 nodes)

    Symantec PureDisk, NetBackup & Backup Exec

    Diligent (2 nodes)

    Single-node/ local (as of Mar 2010)EMC D t D i

  • 8/6/2019 Deduplication School

    29/61

    When Is It Deduped?

    AKA Inline or Post Process?

  • 8/6/2019 Deduplication School

    30/61

    Get Out the Swords Wed have just as much luck trying to

    these arguments

    Apple vs Windows

    Linux vs either of them

    Linux vs FreeBSD

    Vmw are vs the mainframe (the original hyperv

    Cable modem vs DSL

    Initial common sense leans to inline,

  • 8/6/2019 Deduplication School

    31/61

    Whats the Difference? This only applies to target dedupe

    Inline is synchronous dedupe Post-process is asynchronous dedupe

    Both are deduping as the data is comthe device (w ith most products and c

    The question is really where the dedu

    reads the native data from If it read

  • 8/6/2019 Deduplication School

    32/61

    Inline & Post-process: An I/O WalkthrStep IL Hash IL Delta PP Hash

    Ingest (100%) RAM write RAM write Disk w rite

    New segment RAM read RAM read Disk read

    Old segment RAM read Disk read RAM read

    Match (90%) Disk dele

    No match(10%) Disk w rite Disk w rite

    For every 100 GB an inline hash system writes 10 GB to d

    For every 100 GB an inline delta system w rites 10 GB, re

    from disk

    For every 100 GB a post process hash system writes 100

    GB, and deletes 90 GB from disk

  • 8/6/2019 Deduplication School

    33/61

    The Chair Recognizes Inline

    When youre done w ith backups, you

    w ith dedupe

    Backups begin replicating as soon as

    The post-process vendors need a sta

    The post-process vendors dont start

    until a backup is done; that w ill make

    take longer

  • 8/6/2019 Deduplication School

    34/61

    The Chair Recognizes Post-proces

    When backups are done, dedupe is almost d

    Replication begins as soon as the first backu

    We wait until a backup is done, not until al l

    are done (unless you tell us to)

    The staging area allows In itial backups to be faster

    Allow s copies and recent restores to come from native data

    Allows for staggered implementation of dedupe

    Selectively dedupe only what makes sense

  • 8/6/2019 Deduplication School

    35/61

    Inline & Post-process Vendors

    Inline

    EMC Data Domain

    IBM Protectier

    NEC HydraStor

    Post-process Exagrid

    Greenbytes

    Quantum DXiSEPATON D lt t

  • 8/6/2019 Deduplication School

    36/61

    How Does Replication Work?

    Does replication use dedupe?

    Can I replicate many-to-one, one-to-cascading replication?

    I f deduping many to one, w ill it dedu

    across those appliances?

    Can I control what gets replicated an

    (e.g. production vs development)

  • 8/6/2019 Deduplication School

    37/61

    Is There an Index?

    What happens if the index is destroy

    How do you protect against that?

    Does it need its index to read the dat

    What do you to verify data integrity?

    What about malicious people? Some dedupe vendors arent very go

    answering these questions, partially

    they dont get them enough

  • 8/6/2019 Deduplication School

    38/61

    Truth in IT Backup Concierge Ser Community of verified but anonymous end-use

    (no vendors)

    Included in base service:

    Billable product & strategy-related questions

    Learn from other customers questions & answ

    Much less expensive than traditional consulting

    Talk to real people using the products you are

    Podcast interviews w ith end-users and though

    Unbiased product briefings written by experts

  • 8/6/2019 Deduplication School

    39/61

    Session Two

    Using Deduplication in Backu

    Systems

    U i D R d i i P i

  • 8/6/2019 Deduplication School

    40/61

    The Dedupe Tax AKA Rehydration

    Essentially a read from very fragmen

    Not all dedupe systems are equally areassembling Humpty Dumpty

    Especially visible during tape copies &

    of large systems (single stream perfo

    Recent POC of three major vendors s

    difference in performance!

  • 8/6/2019 Deduplication School

    41/61

    Isnt It Cheaper Just to

    Buy tape?

    Tape is cheaper than ever & keeps getting che

    Must encrypt if youre using tape

    Must use D2D2T to stream modern tape drives

    Must constantly tweak to ensure youre doing

    Take all that away and use dedupe

    May not be cheaper but definitely better

    Buy JBOD/ RAIDE if it f till h t it

  • 8/6/2019 Deduplication School

    42/61

    Lets Talk About What Matters What are the risks of their approach?

    Data integrity questions

    How big is it? Whats m y dedupe ratio?

    How big can it grow (local vs global)

    How fast is it How fast can it backup/ restore/ copy m y data?

    How fast is replication?

    How much does it cost? Pricing schemes are all over the board

    Try to get them on even playing field Also consider operational costs

  • 8/6/2019 Deduplication School

    43/61

    Advanced Uses of Deduplicatio

  • 8/6/2019 Deduplication School

    44/61

    Eliminate Tape Shipping

    Offsite backups w/ oshipping tapes

    Backups w ith no humanhands on them

    Make tapes offsite fromreplicated copy and never

    move them No tapes shipped = No

    need to encrypt tapes

  • 8/6/2019 Deduplication School

    45/61

    Shorter Recovery Point Objective

    Most compan

    backups once

    Even though their transact

    throughout th

    theyre only s

    once per day

    Dedupe and r

    could get the

    immediately

  • 8/6/2019 Deduplication School

    46/61

    VMware Backup

    One of the challenges w ith

    typical VMware backup is

    the I/ O load it places onthe server

    Source dedupe can

    perform an incremental-forever backup w ith a

    much lower I/ O load

    Could allow you to

  • 8/6/2019 Deduplication School

    47/61

    ROBO & Laptop Backups Dedupe software can

    protect even the largestlaptops over the Internet

    I t can also protectrelatively large remotesites w ithout installinghardware

    Restores can be donelocally (for slower RTOs)or locally using a localrecovery server (for

    quicker RTOs)

  • 8/6/2019 Deduplication School

    48/61

    Where to Use Target/Source Ded

    Laptops, Vmware, Hyper-V are easy: its got

    Small, remote sets of data also an easy deci

    do target w/ remote backup server, but cost

    pushes people to source.

    A medium-sized (

  • 8/6/2019 Deduplication School

    49/61

    Source Dedupe: Remote Backup

    I f using source dedupe to backup a r

    office, should you back up directly to

    centralized backup server or backup

    backup server that replicates to a cen

    server?

    I ts all about the RTO you need.

    Decide on RTO, test totally remote d if it t it

  • 8/6/2019 Deduplication School

    50/61

    How Big is Too Big to Replicate B

    Remote office replicating to a CO, or

    replicating its backups to a DR site, t

    limit to how much you can replicate

    Make sure youve done all you can to

    deduplication ratio. A 10:1 site w ill nas much bandw idth as a 20:1 site.

    Depends on daily deduplicated changhi h i f t f d t t d d

  • 8/6/2019 Deduplication School

    51/61

    Test, Test, Test!!!

  • 8/6/2019 Deduplication School

    52/61

    Test Everything Installation and configuration, including adding additional capacity

    Support call and ask stupid questions

    Dedupe ratio

    Must use your data

    Must use your retention settings

    Must fill up the system

    All speeds

    Backup speed

    Copy speed extremely important to test

    Restore speed

    Aggregate performance With all your data types

    Especially true if using local dedupe

    Single stream performance

    Backup speed

    Restore and copy speed (especially if going to tape)

    Replication Performance

  • 8/6/2019 Deduplication School

    53/61

    Testing Methods: Source Dedupe

    Must install on all data types you pla

    up

    Must task the system to the point tha

    to use it VMware anyone?

    OK to back up many redundant syste

    kind of the point

    Remember to test speed of copy to ta

  • 8/6/2019 Deduplication School

    54/61

    Testing Methods: Target Dedupe

    Copy production backups into IDT/ VT

    your backup softwares built-in

    cloning/ migration/ dupe features

    Use dedicated drives if possible and s

    run 24x7

    You must fill up the system, expire so

    then add more data to see steady sta

  • 8/6/2019 Deduplication School

    55/61

    Data Reduction in Primary Sto

  • 8/6/2019 Deduplication School

    56/61

    A Whole New Ball Game

    In primary space, we use the term d a ta r ed u

    more inclusive than dedupe

    A very different access pattern; latency is m

    important

    The standard in backup w orld is tape: just d

    slower than that and youre OK

    The standard in primary world is disk: anyth

    slow it down w ill kill the project

  • 8/6/2019 Deduplication School

    57/61

    Options

    Compression

    File-level dedupe

    Sub-file-level dedupe

    Some files compress, but dont dedup

    Some files dedupe but dont compres

  • 8/6/2019 Deduplication School

    58/61

    Vendors

    Compression

    Storw ize, Ocarina

    File-level dedupe

    EMC Celerra

    Sub-file-level dedupe

    NetApp ASIS, Ocarina, Greenbytes, Exar/ Hifn,

    Usually you get compression or dedu

  • 8/6/2019 Deduplication School

    59/61

    Pros/Cons of Primary Data Reduc

    Saves disk space, power/ cooling

    Can have positive or negative impact

    performance must test to see whic

    Does not usually help backups: data

    before being read by any app, includi

    Exception to above rule is NetApp Sn

    tape

  • 8/6/2019 Deduplication School

    60/61

    Contact Me

    Email [email protected]

    Websites to which I contribute:

    http:/ / www.backupcentral.com

    http:/ / www.searchstorage.com

    http:/ / www.searchdatabackup.com

    Follow me on Tw itter @wcpreston

    My upcoming venture:

    mailto:[email protected]://www.backupcentral.com/http://www.searchstorage.com/http://www.searchdatabackup.com/http://www.truthinit.com/http://www.searchdatabackup.com/http://www.searchstorage.com/http://www.backupcentral.com/mailto:[email protected]
  • 8/6/2019 Deduplication School

    61/61

    The ROI of Backup Redesign Using Deduplication: An EMC Data Domain UserResearch Study H19

    IDC Executive Guide: Assess the Value of Deduplication for your StorageConsolidation Initiatives

    Why CIOs Should Look To Data Deduplication

    RESOURCES FROM OUR SPONSOR

    http://www.bitpipe.com/detail/RES/1268767835_717.htmlhttp://www.bitpipe.com/detail/RES/1268767835_717.htmlhttp://www.bitpipe.com/detail/RES/1267815600_941.htmlhttp://www.bitpipe.com/detail/RES/1267815600_941.htmlhttp://www.bitpipe.com/detail/RES/1254259141_188.htmlhttp://www.bitpipe.com/detail/RES/1268767835_717.htmlhttp://www.bitpipe.com/detail/RES/1268767835_717.htmlhttp://www.bitpipe.com/detail/RES/1267815600_941.htmlhttp://www.bitpipe.com/detail/RES/1267815600_941.htmlhttp://www.bitpipe.com/detail/RES/1254259141_188.htmlhttp://www.emc.com/