data domain fundamental

39
EMC Data Domain - 1 Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved. The objectives for this module are shown here. Please take a moment to read them.

Upload: haribabu6502

Post on 11-Apr-2015

215 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Data Domain Fundamental

EMC Data Domain - 1

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

The objectives for this module are shown here. Please take a moment to read them.

Page 2: Data Domain Fundamental

EMC Data Domain - 2

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

The objectives for this lesson are shown here. Please take a moment to review them.

Page 3: Data Domain Fundamental

EMC Data Domain - 3

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Shown in the slide is a Data Domain deployment. A Data Domain system is a storage system

that deduplicates data on arrival. It has shelves of disks, and it has a controller. It’s very

optimized, first to backup and second to archive applications, and supports most of the industry

leaders.

Data Domain easily integrates with the existing backup or archival environment. This includes

not only EMC’s offerings with Networker but also Symantec, Commvault, and so on.

Data can be transferred into the Data Domain storage system, using either Ethernet or Fibre

channel. With Ethernet it can use mass protocols and NFS or CIFS, it can also use optimized

protocols, such as open storage, custom API with Symantec.net backup.

After the data is stored and it’s deduplicated during the storage process, it can replicate for

disaster recovery, replicating only the compressed deduplicated unique data segments that have

been filtered out through the right process on the target tier.

Page 4: Data Domain Fundamental

EMC Data Domain - 4

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

A typical backup environment without Data Domain involves writing backup data to tape. In

order to protect against disasters, the tapes must be shipped offsite. This is an expensive and

labor intensive task.

Page 5: Data Domain Fundamental

EMC Data Domain - 5

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

When Data Domain is implemented in a backup environment, data is written to disk instead of

tape. Disk provides faster performance than tape and has other characteristics that provide

protection. Data Domain is able to deduplicate data which reduces the size of the data footprint.

Instead of physically shipping tapes to remote warehouses, data can be transferred across the

network to a remote Data Domain system.

Page 6: Data Domain Fundamental

EMC Data Domain - 6

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

A Data Domain Appliance is a controller with its own disk array. The controller handles the

deduplication processing and other processes necessary. It runs on its own Data Domain

operating system. Double controllers are available in order to provide redundancy.

Page 7: Data Domain Fundamental

EMC Data Domain - 7

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Shown in the slide is the Data Domain family and details on their specifications.

Refer to the following link for the latest information on Data Domain models:

http://www.datadomain.com/images/products/Appliances-Table.jpg

Page 8: Data Domain Fundamental

EMC Data Domain - 8

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Components under high mechanical or electrical stress are protected under a N+1 redundancy.

This means that the components have at least one extra independent backup component. This

extra component is able to resume operations should a primary component fail. As shown in the

picture, extra fans and power supplies are included. RAID 6 protects against dual disk drive

failures.

Page 9: Data Domain Fundamental

EMC Data Domain - 9

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

One of the most conventional approaches to deduplication competing with Data Domain is using

what’s known as a post process deduplication. In this architecture, data is stored to a disk before

deduplication, and then after it’s stored, it’s read back internally, deduplicated and written again

to a different area.

Although this approach may sound appealing, seeming as if it would allow for faster backups

and the use of less resources, it actually creates problems:

First, more disk is needed to store both the raw data temporarily and the deduplicated data. Post

Process deduplication also has an impact on speed because post process deduplication systems

are usually spindle-bound. There are typically three or four times more disks in a post-process

configuration than you’ll see in a Data Domain deployment.

An inline approach is also much simpler. If data is all filtered before it’s stored to disk, then it’s

just like a regular storage system: it just writes data; it just reads data. There’s no separate

administration involved in managing multiple pools, some with deduplication, some with

regular storage, managing the boundary conditions between them. Any less administration in the

storage system is always better. So by being simpler and smaller to provision, and in-line

approach and especially a CPU-centric in-line approach will always be more attractive.

Page 10: Data Domain Fundamental

EMC Data Domain - 10

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Within a Data Domain System, there are several levels of logical data abstraction above the

physical disk storage. Protocol namespaces, such as virtual tape libraries, EMC Data Domain

Boost, and CIFS/NFS shares act as an external interface to applications. A single Data Domain

may use any combination of these for storing and accessing data.

Files and directories for the namespaces are stored in the Data Domain filesystem. Non

CIFS/NFS data is stored under special directories.

A Unique segment collection is a collection of deduplicated data. It is here that sub-file objects

of about 8 KB are identified and deduplicated. Identical segments will be stored only once.

The last layer is the physical disk. Deduplicated data is stored on SATA disk drives and is RAID

6 protected.

Page 11: Data Domain Fundamental

EMC Data Domain - 11

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Stream Informed Segment Layout is the way that Data Domain approaches deduplication. It

provides deduplication in a highly efficient manner. Instead of being disk based, SISL uses a

CPU centric method. It does this by reducing the amount of times that disks need to be accessed.

In order to quickly identify segments, data is stored along with a “fingerprint” that represents the

data segment.

The Summary Vector is a data structure held in RAM. It is used to identify unique segments of

data. Almost all segments are identified through the Summary Vector. This saves the system

from doing a lookup in the on-disk index.

The Data Domain system stores neighboring segments of data together in a unit called Segment

Localities. These are held close together on disk. This way, consecutive data segments can be

accessed in a single disk access.

Page 12: Data Domain Fundamental

EMC Data Domain - 12

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

This slide shows how data is written to the Data Domain system using the SISL process. First,

data is stored in non-volatile RAM. Here it is broken into segments and fingerprints for each

segment are created. The fingerprint for each segment is compared to the Summary Vector. It

there are no matches, the segment and consecutive segments are compared to multiple segments

on disk. If the segment is unique, it is stored on disk.

Page 13: Data Domain Fundamental

EMC Data Domain - 13

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Data is compressed in order to further reduce the capacity needed. This is done during the write

process. Compression options are shown on the slide.

Page 14: Data Domain Fundamental

EMC Data Domain - 14

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Data Domain is designed using Data Invulnerability Architecture (DIA). DIA provides data

integrity and recoverability within the Data Domain system. Since data is deduplicated, a single

segment of data may be used across multiple files. If this segment were to become corrupted,

multiple files could become corrupt. This makes it crucial to ensure that data is intact.

There are four aspects of DIA. End to end verification is the process of ensuring that data has

been written correctly. After data is written to the system, it is checked against the original data

to make sure it was written correctly.

Fault avoidance and containment is used that data already on disk is not overwritten or

corrupted. This is accomplished using a special file system that does not overwrite old data.

Continuous fault detection and healing is a proactive process that continuously watches for

failures. RAID 6 and check sums are used to implement this.

Snapshots are used to provide file system recoverability. This protects against software and

hardware failure.

Page 15: Data Domain Fundamental

EMC Data Domain - 15

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Snapshots are a read-only copy of backup data. A snapshots is useful for saving a directory copy

at a specific point in time, where it can later be used as a restore point. The snapshot feature

creates a image of the Data Domain file system. This protects against both human and system

errors.

Page 16: Data Domain Fundamental

EMC Data Domain - 16

As an appliance, the Data Domain system automates all routine maintenance tasks. One of the

most important automated processes is the filesystem cleaning operation that must be scheduled

to reclaim physical storage occupied by deleted objects.

When application software expires backup or archive images, they are deleted in the sense that

they are no longer accessible or available for recovery from the application. However, the

images still occupy physical storage. Only a clean operation reclaims the segments used by files

that are deleted and are no longer referenced.

Cleaning can require a lot of system resources while it is occurring. Mechanisms are in place to

automatically adjust the priority assigned to cleaning tasks in favor of more time critical

processing tasks. Cleaning schedules are adjustable. By default, cleaning is scheduled to start

every Tuesday at 6:00am.

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Page 17: Data Domain Fundamental

EMC Data Domain - 17

Cleaning provides the opportunity to reorganize the data to improve the speed and efficiency of

deduplication.

Data invulnerability requires that data is always only written into new containers, and this

requirement also applies to the cleaning process. Copy forward segments are segments that for

read efficiency should be stored adjacent to each other and so they are copied forward together

into a single container.

Dead segments are dead because the files that referred to them have all been deleted, and the

pointers have been removed. Dead segments are not allowed to be re-written with new data

since this could put valid data at risk of corruption. Instead valid segments are copied forward

into free containers to group the remaining valid segments together. When the data is safe and

reorganized the original containers are appended back onto the available disk space.

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Page 18: Data Domain Fundamental

EMC Data Domain - 18

Administrators need to understand how to configure and monitor the reports and logs for error

conditions. Data Domain systems provide access to the following types of reports and logs that

provide information about error conditions:

Autosupports and Alerts can be sent by email. Autosupport sends a daily email to Data Domain

Support containing various log files and other system information. This allows Data Domain

Support to quickly be informed of any issues that may arise in the Data Domain system.

Syslog can be configured to publish logs, alerts, and messages. SNMP can also be configured to

send a subset of alerts as traps to third-party SNMP managers.

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Page 19: Data Domain Fundamental

EMC Data Domain - 19

The autosupport email list is used in two ways: send a daily detailed report on a specified

schedule and send a daily alerts summary about non-critical hardware situations and disk space

usage numbers that should be addressed soon.

The autosupport command can also be used to send the output of a specific command or the

contents of a file to the distribution list.

By default, Data Domain systems send daily autosupport reports to Data Domain tech-support

via email using SMTP. The autosupport report contains system configuration information, alerts

summaries, performance statistics and system messages.

By default, Data Domain systems are also configured to send daily alerts to the autosupport list

that notify Data Domain tech-support about non-critical error messages or warnings about

problems on the system that should be fixed as soon as possible.

Customers have the option to configure who receives autosupports and alerts and the time they

are sent.

For how to configure autosupports and alerts, see the “autosupport” and “alerts” command

descriptions and options in the DD OS Command Reference Guide.

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Page 20: Data Domain Fundamental

EMC Data Domain - 20

Alerts are sent with either a Warning or Critical severity. For example, a Warning alert is sent

when a fan fails. When alerted, customer support contacts the owner to arrange a replacement.

Warning alerts are sent when a non-critical system problem is detected. This type of problem

should be fixed as soon as possible. The warning is sent to the autosupport email list as soon as

the problem occurs. Warnings are also included in the Daily Alert Summary and with the

Autosupport Summary.

Critical alerts are sent when a sever problem occurs that should be fixed immediately. They are

sent to the alerts email list as soon as the problem occurs.

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Page 21: Data Domain Fundamental

EMC Data Domain - 21

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

The objective for this lesson is shown here. Please take a moment to review it.

Page 22: Data Domain Fundamental

EMC Data Domain - 22

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Replication is used to protect against disaster. This is accomplished by sending data from one

Data Domain to another over the network. In a Data Domain system, only unique data is

replicated. This is made possible because of the deduplication process. This saves enormous

amounts of bandwidth since only a small portion of data stored will be changed. Since not as

much data is transferred, the replication window is reduced.

There are three types of replication. These will be discussed on the following slides.

Page 23: Data Domain Fundamental

EMC Data Domain - 23

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Collection replication is the transfer of all backup data. It is able to replicated along with all

backup and recovery functions. Data at the target is accessible immediately. In addition to data,

user accounts and passwords are also replicated as are snapshots. Only a one-to-one

configuration is allowed.

Page 24: Data Domain Fundamental

EMC Data Domain - 24

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Directory replication is the transfer of individual directories on the Data Domain system. A Data

Domain system can be a source or destination for multiple directories and can also be a source

for some directories and a destination for others. Many topologies are supported with directory

replication. Normal backup and restore operations are still able to be performed during

replication.

Page 25: Data Domain Fundamental

EMC Data Domain - 25

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Pool replication is a type of directory replication that replicates directories that contain VTL tape

cartridges. Virtual tape libraries use a structure called storage pools within the Data Domain.

This data which is sent to the virtual tape can be replicated. Only one VTL license is required for

the source.

Page 26: Data Domain Fundamental

EMC Data Domain - 26

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

One way to send data to the Data Domain system is through the use of CIFS or NFS shares.

CIFS can be used by Windows clients while NFS is used by UNIX based operating systems. A

directory within the /backup directory is shared out to the client. When data is sent to the shared

directory, it is deduplicated and stored automatically.

Page 27: Data Domain Fundamental

EMC Data Domain - 27

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

OpenStorage server software, which is a feature of Symantec’s Veritas NetBackup, integrates

NetBackup with Data Domain system disk backup devices. It allows NetBackup media servers

to communicate with disk devices without emulating tape. In order to enable OST software, a

plugin must be installed on the NetBackup media server in order to integrate with Data Domain.

The Data Domain then creates Logical Storage Units which are used as NetBackup storage

servers.

Page 28: Data Domain Fundamental

EMC Data Domain - 28

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Using the Data Domain VTL feature, backup applications can connect to and manage Data

Domain as if it were a tape library. In this configuration, Data Domain creates virtual tapes that

will act as real SCSI tape drives. Tapes an pools can be replicated to other Data Domain systems

for disaster recovery. Tapes can also be locked with retention to prevent them from premature

deletion. The VTL feature can be used simultaneously with the other interfaces.

Page 29: Data Domain Fundamental

EMC Data Domain - 29

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Data Domain Boost is an option that distributes part of the deduplication process out of the Data

Domain system and onto the backup server. This makes the backup network more efficient, it

makes Data Domain systems 50% faster, and it makes the whole aggregate system more

manageable. It works across the entire Data Domain product line.

As shown in the diagram on the slide, the segmentation, identification, and compression is

handled n the backup server instead of on the Dat Domain system. This means that only the

unique segments are sent over the network.

Page 30: Data Domain Fundamental

EMC Data Domain - 30

The Data Domain retention lock licensed software feature enables organizations to protect

records in non-writeable and non-erasable formats for a specified length of time up to 70 years.

This means that although the protected data can be read, it cannot be modified or deleted until

the retention period has expired. This can be used in order to protect against accidents and user

errors. And also malicious activity. For example, a Data Domain system may be used to store

email records. A malicious person may attempt to delete some incriminating emails, but would

be unable to do so if the retention has not expired.

Retention minimums and maximums can be set globally for the Data Domain system. For

example, it can be configured so that all files must have a retention of at least 5 years. Retention

values can be set on a file by file basis.

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Page 31: Data Domain Fundamental

EMC Data Domain - 31

With the sanitize function, deleted files can be overwritten using a DoD/NIST compliant

algorithm and procedures.

No complex setup or disruption is needed. Sanitizing is electronic equivalent of data shredding;

it removes any trace of deleted files.

This feature is designed primarily to support the needs of organizations that are required to

remove and destroy confidential data if it was accidentally written to an unapproved system or

to delete data that is no longer required.

See the Electronic Data Shredding Technical brief at

http://www.datadomain.com/pdf/TechBrief-ElectronicShredding.pdf

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Page 32: Data Domain Fundamental

EMC Data Domain - 32

With Data Domain Encryption licensed software option enabled, all incoming data is encrypted

inline before it is written to disk. This is also referred to as “encryption at rest.” This improves

security by preventing data from being read directly from disk without being first decrypted by

the system. Data Domain implements software-based encryption, so no additional hardware is

required. Encryption is transparent to the access protocols. Because of this, no change is needed

in configuring the rest of the environment to deploy encryption at rest.

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Page 33: Data Domain Fundamental

EMC Data Domain - 33

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

The objective for this lesson is shown here. Please take a moment to review it.

Page 34: Data Domain Fundamental

EMC Data Domain - 34

Data Domain systems can replace both the large staging disk and the tape system. Replication

across the WAN is built into the Data Domain systems instead of requiring a separately managed

function of the primary storage. Configuration of the backup software such as the Oracle

Recovery Manager (RMAN) does not need to be changed; simply point the backup application

at the Data Domain storage as a replacement for the previous NFS, CIFS, or VTL device.

Copies of the data needed for longer term archive or compliance can continue to be written to

tape either onsite or at the offsite disaster recovery site.

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Page 35: Data Domain Fundamental

EMC Data Domain - 35

Backup and recovery for a Microsoft Exchange Server environment is a mission critical function

that benefits from all of the advantages of replacing tape based systems with Data Domain

appliances. In addition to being storage for the typical Exchange backups, Data Domain systems

can also be used as an efficient storage repository for email archiving applications. Instead of

email archives being stored on a separate system, the archives can be written to the same Data

Domain system that is storing the Exchange database backups.

The significant amount of duplicate data found in both the Exchange backups and in email

archive files is deduplicated across both data sets, to reduce the storage footprint even more.

Without Data Domain, different interface or file protocol support needs of the Exchange backup

server and the email archive server may have prevented these from backing up to the same

device. Being able to use CIFS, NFS, and VTL simultaneously to access a single Data Domain

system opens up many new possibilities for combining data from different sources to take

advantage of the savings from deduplication.

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Page 36: Data Domain Fundamental

EMC Data Domain - 36

VMware sites tend to create more data to manage and protect than their physical counterparts.

Making it simpler to multiply servers tends to increase the storage footprints. The operational

flexibility offered by being able to have multiple copies and variants of a virtual image with

various configurations comes at the expense of needing to buy more storage to back up and

protect these images. Since many of the elements are the same between virtual images, they tend

to deduplicate very well when stored on Data Domain systems. Deploying a system at the

disaster recovery site allows for replication of critical VM images that can be kept up to date and

ready to assume operation immediately in a disaster.

Data Domain systems are attached to the high capacity backbone network used for storing and

moving the VM images. Installation and configuration is similar whether the system is being

used with VMware infrastructure, third party enterprise backup software, or specialized

VMware backup applications.

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Page 37: Data Domain Fundamental

EMC Data Domain - 37

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

In this example, a nearline implementation is used to handle some version control software that

is using a Data Domain system as storage. The software tracks changes to documents as they are

being updated. Since file differences are usually minor, the opportunity for deduplication is

large. Data does not need to be accessed frequently, but needs to be immediately available for

the times that it is accessed.

Page 38: Data Domain Fundamental

EMC Data Domain - 38

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Data Domain is also useful in an archive situation. This example stores mostly static files. Files

are not read back frequently but access to files needs to be immediate. This example uses a CIFS

share to implement the solution.

Page 39: Data Domain Fundamental

EMC Data Domain - 39

Copyright © 2010 EMC Corporation. Do not Copy - All Rights Reserved.

These are the key points covered in this module. Please take a moment to review them.