enterprise reliability test for backup application · maintenance activities are performed on a...

Gururaj KulkarniConsultant QA [email protected]

Ravikumar MadhappanPrincipal Software QA [email protected]

Yogesh BSPrincipal Software QA [email protected]

ENTERPRISE RELIABILITY TEST FOR

BACKUP APPLICATION

mailto:[email protected]



2015 EMC Proven Professional Knowledge Sharing 2

Table of Contents 1. Abstract .............................................................................................................................. 3

2. Background Information ...................................................................................................... 7

3. Introduction to EMC NetWorker .......................................................................................... 7

4. Reliability Metrics ................................................................................................................ 8

5. Enterprise Reliability Measurement for Backup (NetWorker as case study) .......................11

6. NetWorker setup details .....................................................................................................12

7. Test Results .......................................................................................................................13

8. Conclusion .........................................................................................................................22

Disclaimer: The views, processes or methodologies published in this article are those of the

authors. They do not necessarily reflect EMC Corporation’s views, processes or methodologies.


1. Abstract Data growth is one of the biggest challenges for data protection software to perform and cater to

the data protection needs. Data Protection is one of the least glamorous yet important

disciplines in the data center. Customer demands are increasing for reliability of the product with

massive scale in order to protect the massive data growth. As organizations seek new ways to

drive efficiencies and reduce costs, the need for data protection is becoming more

important. Customers want their data to be protected wherever it resides, whether it is in their

data center or in the cloud. They also want it to be easily accessible irrespective of where the

data is located. Protecting a huge environment with a legacy backup method is not a solution;

rather, it presents huge challenges to engineering teams to provide efficient way to improve the

product reliability. Additionally, recovery point objective (RPO) challenges are becoming more

critical as environments age with data protection software running over a long time without

interruption. Application needs to be consistent in serving these reliability challenges.

Customer demands are changing. In early days (1990-2000), nonfunctional requirements such

as performance and reliability were ‘good to have’ within product but not necessarily a ‘must

have’. Now, with exponential data growth year on year, customers want more reliable and faster

backup application which meets their demands.

The annual data growth rate is growing exponentially year on year. As per CSC analysis, the

data growth rate by 2020 would see a 4300% increase.


Courtesy: Computer Science Corporation


Backup application reliability can be defined as “meeting the data protection (backup) and data

management (recovery) needs consistently over a period of time with 99.9999% success rate”.

Achieving more “9’s” after decimal requires more investment and cost. The backup application

has to be more robust to provide such reliability and it is always a challenge for backup

application software to meet 100% reliability requirements. The following components impose

key challenges to achieve this.

Application catalog size, i.e. metadata processing by backup application during its

operations

Environment component sizing

Application sizing considering future growth

Disaster recovery with increase in catalog size

Fault tolerance: Environment as well as application fault tolerance

With massive growth in the data and environment scalability challenge while data protection,

application and environment sizing as well as meeting fault tolerance needs, are key factors for

achieving a higher reliability quotient. The more robust the backup application, the more it can

guarantee the reliability by meeting these requirements.

To meet the above mentioned challenges, internal tests were performed to assess backup

application reliability with massive scaled environment. In this case study, EMC NetWorker® is

being used to protect such a huge environment. A series of operations were performed with

backup application in uniform and random order to validate the product behavior over a period

of time.

A test was designed to prove the ability of backup application NetWorker under different

workload scenarios. The server’s behavior was validated with longevity test by keeping the

same load pattern on daily basis for one week. The backend infrastructure (2000 clients) was

configured on EMC XtremIO®.


Tests were performed with following scenarios:

Scalability test with uniform workload pattern

Backup was performed with 500, 1000, and 2000 clients with EMC Data Domain®

DD880 and DD990.

Maintenance activities are performed on a daily basis. The maintenance activities

such are nsrck, nsrim, and nsrinfo are run to ensure the data integrity.

The clone and recover operations are run on a random basis.

Uninterrupted test

One week of uninterrupted tests were performed for backup, recover, clone, and

daily health check routines by overlapping them to evaluate application reliability and

availability.

While performing testing, the following attributes were monitored:

Throughput

Resource utilization

IOPS behavior of catalog disk

TCP Socket usages

NW save streams or concurrent sessions

DD active connections

NSRD port usages

NW Daemon response

This article details the following information for EMC data protection software, i.e. NetWorker.

1. Sizing the NetWorker server to meet data protection and management needs

2. Test efforts to set up challenges with the NetWorker backup application testing on single

data zone server with 2000 clients

3. Incremental testing results of 2000 clients

4. Information gathering challenges

5. Test results of these, i.e. datazone throughput ,memory /CPU utilization, I/O utilization,

daemon response

6. Sharing best practice as test result outcome


2. Background Information According to EMC global data protection Index, which surveyed 3,300 IT decision makers from

mid-size to enterprise segment customers across multiple countries, data loss and downtime

costs $1.7 trillion.

Seventy-one percent of IT professionals are not fully confident in their ability to recover

information following an incident.

Impact of Data Loss and Downtime

The good news is that the number of data loss incidents is decreasing overall. However, the

volume of data lost during an incident is growing exponentially:

64% of enterprises surveyed experienced data loss or downtime in the last 12 months

The average business experienced more than three working days (25 hours) of

unexpected downtime in the last 12 months

Other commercial consequences of disruptions were loss of revenue (36%) and delays

to product development (34%)

The information above clearly indicates that IT and CIOs are looking for high reliability and

availability quality within their data protection application.

3. Introduction to EMC NetWorker

NetWorker backup and recovery software helps organizations meet demanding data protection

requirements while lowering costs through centralized management. By enabling a single point

of control for multiple data protection technologies including backup-to-disk and next-generation

options such as deduplication, NetWorker helps IT departments deliver higher levels of backup

and recovery services and keep pace with round-the-clock business operations.

This application is devised as a 4-tier architecture

1. NetWorker Management Console (NMC) server

2. Monitor and Configure NW server

3. Collect reporting data and generate reports (optional)

4. Can connect to one or more NW servers

NetWorker Server

Single point of control, can be managed by one or more NMC servers

Hosts all NetWorker internal databases

Starts scheduled jobs and sends on-demand notifications

Controls all target devices


NetWorker Storage Node

Directs backup data to target device

Can accept local data to avoid TCP network transfers or data over TCP network

NetWorker Client

Runs agent and starts jobs as requested by NW server

Jobs can also be started manually

Application modules are installed on top of NW client

NMC server, NW server, NW storage node and NW client do not have to be on the same

version

Recommended that SN is same version as server

Long-term compatibility for obsolete clients

4. Reliability Metrics Application reliability depends on the underlying environment reliability specifications as well.

Hardware and system reliability needs to be more error-proof for measuring application

reliability.

The diagram below shows the dependency factors for application behavior. Application sits on

the system (operating system) and hardware layer.

So, if application is dependent on different components in the sub system, then application

probability of failure is defined as

P (Application failure) = P (Hardware failure) + P (System failure) + P (operator failure)

Application

System (OS)

Hardware Layer

User


The application failures are often design failures within code, whereas hardware failures are not

necessarily design flaws but result due to component failures.

In this article, EMC NetWorker is used as case study for application reliability.

Reliability metrics are units of measure for application reliability. For backup applications such

as NetWorker, the reliability metrics are defined as:

Time Vs Stability (i.e., % backup success rate over a period of time).

Stability is a constant success rate of backup. Application success rate needs to be measured

by running a set of operations from a predefined operational profile on daily basis and then

measure the stability over a period of time.

E.g. At any given time, application needs to meet the predefined backup window if there are no

changes in the system. If application backup window time changes at any given time without

any deviation in the system, the probability of success or meeting the backup SLA slips from

100%.

Time Vs failure rate

As stated above, application failure depends on various factors including underlying subsystem.

Application failures can also result due to improper tuning of application parameters itself, e.g.

too much queuing of backup sessions can result in overloading a particular component within an

application. So, it is very important to apply proper tunings to application based on defined SLA.

Once these parameters are defined correctly, failures need to be measured by running a set of

operations from a predefined operational profile on daily basis. This would help to measure the

impact of application failures causing backup windows to not be met. These errors need to be

classified based on severity, criticality, and warning for every component or subsystem.

Application behavior needs to be monitored based on these error patterns over a period of time.

E.g. Application logs X number of errors from Y-subsystem on a daily basis for a particular

operation being run from operational profile. Then on Nth day it hangs due to this subsystem.

The above metrics help to identify the culprit from a particular subsystem/component within the

application.

Another example is: If application fails once in 1000 similar operations, its probability of failure is

1/1000 = 0.001


System resource utilization over period of time

Underlying subsystem plays a critical role on backup application. The resource utilization

pattern needs to be measured on a daily basis for predefined operations from operational

profile. Measure the resource utilization over a period of time. It is necessary that application

resource utilization pattern for any given time on any day should remain the same unless and

until there is something else affecting the system (sharing of resources).


5. Enterprise Reliability Measurement for Backup (NetWorker as case

study) In the subsequent sections in this article, EMC’s backup application reliability tests and results

are discussed in detail based on practical example; these tests were conducted in a lab.

Entire infrastructure was dedicated and isolated from outside inhibitors

EMC XtremIO was used to host the 2000 NetWorker clients (virtual machines)

There were 16 high end blades used to host the above NetWorker clients

The NetWorker server was running on a dedicated Cisco rack server

The data from NetWorker clients were being backed up directly to EMC Data

Domain (DDR 990).

Entire infrastructure was 10G network and 8Gbps SAN

Self Service Portal and Service Catalog

vCloud Automation Center

Compute Servers 16

CPU: 512 GHZ RAM: 3136 GB

Network Ethernet:10G x 4

LACP FC: 8G x 8, Powerpath

Storage XtremIO, 2 Bricks, 400GB x 50, 8G x

8 FE


6. NetWorker setup details

The data from clients were read over SAN network and backed up to DDR over LAN

The data being directly backed up from EMC NetWorker client to Data Domain

device

The number of concurrent sessions adjusted were based on clients backup streams.

The maximum concurrent sessions being backed up were 512 with 500 clients, with

more than 500 client backups, the concurrent backup streams set on NetWorker

server was 1024. NetWorker supports 1024 concurrent streams.


7. Test Results 1) The backup throughput summary

Analysis

The backup application throughput scales linearly by adding more resources. In the

above case, with single and low end DDR (DD880), the application does a lot of

queuing resulting in low throughput and slower backup. In this example, it clearly

shows that the slower external hardware component, such as DDR in this case, can

affect the backup success rate by excessive queuing.

The general rule of thumb in backup is, “The maximum throughput is limited by

slowest component in backup chain”. So, in the above case, the more contention for

85.41

164.41

337.47

191

116

95

0

50

100

150

200

250

300

350

400

0

100

200

300

400

500

600

700

800

900

1000

500 Clients(DD880)

1000 Clients(1DD880)

2000 ClientsDef Pll

(DD880)

2000 Clients(DD990)

2000 Clients(2DD990)

2000 Clients(2DD990) -20 Devices

Du

rati

on

(in

Min

s)

Thro

ugh

pu

t (i

n M

Bp

s)

Throughput (in MBps) Backup Duration (in Mins)


resources, the more the queuing on backup application and therefore, less the

throughput.

After removing the slowest component (DDR880) with faster Data Domain box, the

throughput doubled. With additional DDR (another DD990), the throughput further

doubled.

The contention during data protection (less resource) will result in more queuing on

the application in turn resulting in sessions getting terminated due to timeout. This

affects the overall reliability of the application.

Let us consider an example for 2000 clients backup to DDR880 and DDR990.

Overall, 5TB of data is being protected from these clients. Adding a faster

component into the backup chain (DDR990) improved backup speed by 74%. So,

removing contention in the backup chain is always a top priority in achieving higher

efficiency.

Recommendations

Application environment needs to be tuned and needs to be better sized to meet

the protection needs.

Avoid as much contention as possible in backup chain to reduce the session

termination.

Avoid sharing of resources (backup target). Sharing of target can result in more

queuing on application as well as backup target resulting in frequent hanging or

session timeouts while performing data protection.


2) The environmental impact summary during backup

Legends: X-axis: Time; Y-axis: #TCP connections

Analysis

Environmental factors such as the underlying Operating system where backup

software runs plays a key role during data protection. With the default OS settings, the graph shows the impact of network stack from OS where backup server is

running.

If there is more queuing (as can be seen here for DD880), then more #socket

connections will be established for longer duration, consuming more resources on

the server. If the sessions are held for longer duration then, then based on OS

settings, underlying OS network stack will start terminating the TCP sessions.

Higher queuing of sessions can result in higher socket utilization. The DD880 has

used around 4000 TCP connections till backup is completed, whereas DD990 has

used only 3000 for lesser duration with less queuing.

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

TCP

Est

ablis

he

d S

ock

ets

2000 Clients backup with DD880 vs DD990

No_of_sock_ESTD (DD880) No_of_sock_ESTD (DD990)

No_of_sock_ESTD (2DD990)


Recommendations:

Tuning the underlying OS stack where backup software runs is a key factor to

achieve better reliability. Failure to do so will result in unreliable backups.

3) Impact of application tuning: Load distribution

Legends: X-axis: Time, Y-axis: #concurrent sessions

Analysis:

Improper application settings can affect the backup window during data protection.

Allow as many concurrent streams as possible through default settings in the

application. Modifying the parallelism in backup policy (EMC NetWorker

policy/savegroup) needs to be carefully designed.

0

100

200

300

400

500

600

700

Application Concurrency - Default vs Tuned

Active Sessions (Savegrp Default Parallelism) Active Sessions (Savegrp 150 Parallelism)

Took 20 mins to bounce back to 500 from below


4) Impact of system resource utilization by application

Analysis

The above graph clearly depicts the NetWorker application memory requirements for

protecting linearly scaled #clients. Not meeting these memory requirements

negatively affects application behavior.

Session in queue will increase memory utilization. Increase in Concurrent sessions

(last run) from 600 to 1000 also increases memory utilization considerably. What

matters in this case for application reliability is its resource requirement from the

underlying server. If the application is not designed to throttle the memory

requirement, it always tends to use available resources. Not meeting these memory

resources during concurrent operations can affect its operations.

Recommendations

If application does not have any intelligence in throttling its operations based on

available resources, then it is mandatory to size at least 15-20% more to meet the

reliability requirements.


5) Impact of application sub-system resource utilization with catalog growth

Legends: X-axis: Time, Y-axis: memory usage

Analysis:

In NetWorker, jobs daemon plays a key role in handling the sessions, keeping the

session records, and monitoring them during backup. Every operation in NetWorker

gets stored into JobsDB and it can grow over a period of time.

The size of JobsDB impacts backup operations. As size linearly increases, the

memory requirements for jobs daemon also increases. Understanding such patterns

for application sub-system is very important and sizing them properly will improve

application reliability.

Recommendations

Purge the catalog size within application based on retention time so that application

removes the older records from JobsDB and maintains the catalog size constantly

for better efficiency.

It is mandatory to tune the application memory requirement at least 10-15% more so

that any of the application’s sub system demanding more memory at any given point

in time gets it.


6) Impact of Application overall and component level CPU utilization

Analysis:

Similar to memory requirement, the above graph clearly depicts the NetWorker

application CPU requirements for protecting linearly scaled #clients. Not meeting

these memory requirements negatively affects application behavior.

The second graph shows the individual component (daemon/process) level CPU

usage. What matters in this case for application reliability is its resource requirement

from underlying server. If application is not designed to throttle the CPU

requirement, it tends to use available resources. Not meeting these CPU resources

during concurrent operations can affect its operations.

7) Impact of underlying storage sub system on backup application server

0

2

4

6

8

10

0

500

1000

1500

2000

#IO

PS

Time

IOPS during NetWorker operations

IOPs Stats

SVCTime Stats


Analysis

EMC NetWorker does lot of I/O operations on its catalog during concurrent backup,

recovery, maintenance, and purging of record operations.

IOPS and I/O service time to process these operations normally spiked to a higher

range when backup was initiated.

Huge number of IOPS were noticed during application catalog backup with scaled

number of records in its catalog.

During NetWorker maintenance operations such has consistency check of its

catalog, the IOPS and service time to process these IOPS are always high.

Recommendations:

It is recommended to ensure these IOPS requirements are met during these key

operations.

Each operation does certain IOPS on NetWorker catalog; if any of these operations

gets overlapped, then IOPS can significantly increase. It is recommended that IOPS

are sized in such a way that overall IOPS = sum of individual operation level IOPS.

8) Impact Longevity test and overlap operations

The graph below shows the backup application’s memory requirements over a period of

time. During these tests, a series of operations being run with similar load pattern on

daily basis and application behavior has been measured.


Analysis:

The memory spikes in the graph represent the full backup and consecutive full

backups.

During regular backup operations, there is a consistent memory pattern observed on

server.

Memory pattern changes drastically based on overlapping of NetWorker operations

(highlighted with circle) such as backup as well as maintenance operations.

Recommendations:

It is recommended to schedule the overlapping operations to some other time so

that they don’t overlap with regular backup/clone/recover operations.

9) Impact of underlying storage sub system on backup application server a over period

of time

Analysis:

The above graph shows the clear picture of I/O pattern over a period of time with

sequential and overlapping operations over a period of time.

During NetWorker sequential operations, the IOPS pattern on catalog disk was

consistent. However, overlapping of NetWorker operations will significantly put load


on catalog disk and hence IOPS and service time to process these I/Os significantly

increases.

There are reliability issues such as intermittent hang noticed when NetWorker does

not meet the overall IOPS from underlying storage subsystem.

During NetWorker maintenance operations such as consistency check of catalog,

the IOPS and SVC time are always in a higher range (as highlighted)

Recommendations:

If overall IOPS significantly increases, it is recommended to host the catalog on

faster disks

If overlapping operations cannot be avoided, it is always recommended to meet the

overall IOPS for application catalog to avoid reliability issues. The overall IOPS is

defined as follows

Overall IOPS = sum of NetWorker individual operation IOPS

8. Conclusion For any data protection software, achieving application reliability is the key task. Achieving

software reliability is difficult since it depends on the complexity of the software as well as

underlying subsystem. The application reliability depends on high software quality and its

design to adjust and auto-tune based on underlying sub-system changes. So sizing of

underlying sub-system and taking the preventive steps during high failure rate will improve

backup application reliability. It is important that backup application reliability be measured

during Requirements, Design and coding, and testing phases.

EMC believes the information in this publication is accurate as of its publication date. The

information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION

MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO

THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED

WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires an

applicable software license.

enterprise reliability test for backup application · maintenance activities are performed on a...

Documents