enterprise reliability test for backup application · maintenance activities are performed on a...
TRANSCRIPT
Gururaj KulkarniConsultant QA [email protected]
Ravikumar MadhappanPrincipal Software QA [email protected]
Yogesh BSPrincipal Software QA [email protected]
ENTERPRISE RELIABILITY TEST FOR
BACKUP APPLICATION
2015 EMC Proven Professional Knowledge Sharing 2
Table of Contents 1. Abstract .............................................................................................................................. 3
2. Background Information ...................................................................................................... 7
3. Introduction to EMC NetWorker .......................................................................................... 7
4. Reliability Metrics ................................................................................................................ 8
5. Enterprise Reliability Measurement for Backup (NetWorker as case study) .......................11
6. NetWorker setup details .....................................................................................................12
7. Test Results .......................................................................................................................13
8. Conclusion .........................................................................................................................22
Disclaimer: The views, processes or methodologies published in this article are those of the
authors. They do not necessarily reflect EMC Corporation’s views, processes or methodologies.
2015 EMC Proven Professional Knowledge Sharing 3
1. Abstract Data growth is one of the biggest challenges for data protection software to perform and cater to
the data protection needs. Data Protection is one of the least glamorous yet important
disciplines in the data center. Customer demands are increasing for reliability of the product with
massive scale in order to protect the massive data growth. As organizations seek new ways to
drive efficiencies and reduce costs, the need for data protection is becoming more
important. Customers want their data to be protected wherever it resides, whether it is in their
data center or in the cloud. They also want it to be easily accessible irrespective of where the
data is located. Protecting a huge environment with a legacy backup method is not a solution;
rather, it presents huge challenges to engineering teams to provide efficient way to improve the
product reliability. Additionally, recovery point objective (RPO) challenges are becoming more
critical as environments age with data protection software running over a long time without
interruption. Application needs to be consistent in serving these reliability challenges.
Customer demands are changing. In early days (1990-2000), nonfunctional requirements such
as performance and reliability were ‘good to have’ within product but not necessarily a ‘must
have’. Now, with exponential data growth year on year, customers want more reliable and faster
backup application which meets their demands.
The annual data growth rate is growing exponentially year on year. As per CSC analysis, the
data growth rate by 2020 would see a 4300% increase.
2015 EMC Proven Professional Knowledge Sharing 4
Courtesy: Computer Science Corporation
2015 EMC Proven Professional Knowledge Sharing 5
Backup application reliability can be defined as “meeting the data protection (backup) and data
management (recovery) needs consistently over a period of time with 99.9999% success rate”.
Achieving more “9’s” after decimal requires more investment and cost. The backup application
has to be more robust to provide such reliability and it is always a challenge for backup
application software to meet 100% reliability requirements. The following components impose
key challenges to achieve this.
Application catalog size, i.e. metadata processing by backup application during its
operations
Environment component sizing
Application sizing considering future growth
Disaster recovery with increase in catalog size
Fault tolerance: Environment as well as application fault tolerance
With massive growth in the data and environment scalability challenge while data protection,
application and environment sizing as well as meeting fault tolerance needs, are key factors for
achieving a higher reliability quotient. The more robust the backup application, the more it can
guarantee the reliability by meeting these requirements.
To meet the above mentioned challenges, internal tests were performed to assess backup
application reliability with massive scaled environment. In this case study, EMC NetWorker® is
being used to protect such a huge environment. A series of operations were performed with
backup application in uniform and random order to validate the product behavior over a period
of time.
A test was designed to prove the ability of backup application NetWorker under different
workload scenarios. The server’s behavior was validated with longevity test by keeping the
same load pattern on daily basis for one week. The backend infrastructure (2000 clients) was
configured on EMC XtremIO®.
2015 EMC Proven Professional Knowledge Sharing 6
Tests were performed with following scenarios:
Scalability test with uniform workload pattern
Backup was performed with 500, 1000, and 2000 clients with EMC Data Domain®
DD880 and DD990.
Maintenance activities are performed on a daily basis. The maintenance activities
such are nsrck, nsrim, and nsrinfo are run to ensure the data integrity.
The clone and recover operations are run on a random basis.
Uninterrupted test
One week of uninterrupted tests were performed for backup, recover, clone, and
daily health check routines by overlapping them to evaluate application reliability and
availability.
While performing testing, the following attributes were monitored:
Throughput
Resource utilization
IOPS behavior of catalog disk
TCP Socket usages
NW save streams or concurrent sessions
DD active connections
NSRD port usages
NW Daemon response
This article details the following information for EMC data protection software, i.e. NetWorker.
1. Sizing the NetWorker server to meet data protection and management needs
2. Test efforts to set up challenges with the NetWorker backup application testing on single
data zone server with 2000 clients
3. Incremental testing results of 2000 clients
4. Information gathering challenges
5. Test results of these, i.e. datazone throughput ,memory /CPU utilization, I/O utilization,
daemon response
6. Sharing best practice as test result outcome
2015 EMC Proven Professional Knowledge Sharing 7
2. Background Information According to EMC global data protection Index, which surveyed 3,300 IT decision makers from
mid-size to enterprise segment customers across multiple countries, data loss and downtime
costs $1.7 trillion.
Seventy-one percent of IT professionals are not fully confident in their ability to recover
information following an incident.
Impact of Data Loss and Downtime
The good news is that the number of data loss incidents is decreasing overall. However, the
volume of data lost during an incident is growing exponentially:
64% of enterprises surveyed experienced data loss or downtime in the last 12 months
The average business experienced more than three working days (25 hours) of
unexpected downtime in the last 12 months
Other commercial consequences of disruptions were loss of revenue (36%) and delays
to product development (34%)
The information above clearly indicates that IT and CIOs are looking for high reliability and
availability quality within their data protection application.
3. Introduction to EMC NetWorker
NetWorker backup and recovery software helps organizations meet demanding data protection
requirements while lowering costs through centralized management. By enabling a single point
of control for multiple data protection technologies including backup-to-disk and next-generation
options such as deduplication, NetWorker helps IT departments deliver higher levels of backup
and recovery services and keep pace with round-the-clock business operations.
This application is devised as a 4-tier architecture
1. NetWorker Management Console (NMC) server
2. Monitor and Configure NW server
3. Collect reporting data and generate reports (optional)
4. Can connect to one or more NW servers
NetWorker Server
Single point of control, can be managed by one or more NMC servers
Hosts all NetWorker internal databases
Starts scheduled jobs and sends on-demand notifications
Controls all target devices
2015 EMC Proven Professional Knowledge Sharing 8
NetWorker Storage Node
Directs backup data to target device
Can accept local data to avoid TCP network transfers or data over TCP network
NetWorker Client
Runs agent and starts jobs as requested by NW server
Jobs can also be started manually
Application modules are installed on top of NW client
NMC server, NW server, NW storage node and NW client do not have to be on the same
version
Recommended that SN is same version as server
Long-term compatibility for obsolete clients
4. Reliability Metrics Application reliability depends on the underlying environment reliability specifications as well.
Hardware and system reliability needs to be more error-proof for measuring application
reliability.
The diagram below shows the dependency factors for application behavior. Application sits on
the system (operating system) and hardware layer.
So, if application is dependent on different components in the sub system, then application
probability of failure is defined as
P (Application failure) = P (Hardware failure) + P (System failure) + P (operator failure)
Application
System (OS)
Hardware Layer
User
2015 EMC Proven Professional Knowledge Sharing 9
The application failures are often design failures within code, whereas hardware failures are not
necessarily design flaws but result due to component failures.
In this article, EMC NetWorker is used as case study for application reliability.
Reliability metrics are units of measure for application reliability. For backup applications such
as NetWorker, the reliability metrics are defined as:
Time Vs Stability (i.e., % backup success rate over a period of time).
Stability is a constant success rate of backup. Application success rate needs to be measured
by running a set of operations from a predefined operational profile on daily basis and then
measure the stability over a period of time.
E.g. At any given time, application needs to meet the predefined backup window if there are no
changes in the system. If application backup window time changes at any given time without
any deviation in the system, the probability of success or meeting the backup SLA slips from
100%.
Time Vs failure rate
As stated above, application failure depends on various factors including underlying subsystem.
Application failures can also result due to improper tuning of application parameters itself, e.g.
too much queuing of backup sessions can result in overloading a particular component within an
application. So, it is very important to apply proper tunings to application based on defined SLA.
Once these parameters are defined correctly, failures need to be measured by running a set of
operations from a predefined operational profile on daily basis. This would help to measure the
impact of application failures causing backup windows to not be met. These errors need to be
classified based on severity, criticality, and warning for every component or subsystem.
Application behavior needs to be monitored based on these error patterns over a period of time.
E.g. Application logs X number of errors from Y-subsystem on a daily basis for a particular
operation being run from operational profile. Then on Nth day it hangs due to this subsystem.
The above metrics help to identify the culprit from a particular subsystem/component within the
application.
Another example is: If application fails once in 1000 similar operations, its probability of failure is
1/1000 = 0.001
2015 EMC Proven Professional Knowledge Sharing 10
System resource utilization over period of time
Underlying subsystem plays a critical role on backup application. The resource utilization
pattern needs to be measured on a daily basis for predefined operations from operational
profile. Measure the resource utilization over a period of time. It is necessary that application
resource utilization pattern for any given time on any day should remain the same unless and
until there is something else affecting the system (sharing of resources).
2015 EMC Proven Professional Knowledge Sharing 11
5. Enterprise Reliability Measurement for Backup (NetWorker as case
study) In the subsequent sections in this article, EMC’s backup application reliability tests and results
are discussed in detail based on practical example; these tests were conducted in a lab.
Entire infrastructure was dedicated and isolated from outside inhibitors
EMC XtremIO was used to host the 2000 NetWorker clients (virtual machines)
There were 16 high end blades used to host the above NetWorker clients
The NetWorker server was running on a dedicated Cisco rack server
The data from NetWorker clients were being backed up directly to EMC Data
Domain (DDR 990).
Entire infrastructure was 10G network and 8Gbps SAN
Self Service Portal and Service Catalog
vCloud Automation Center
Compute Servers 16
CPU: 512 GHZ RAM: 3136 GB
Network Ethernet:10G x 4
LACP FC: 8G x 8, Powerpath
Storage XtremIO, 2 Bricks, 400GB x 50, 8G x
8 FE
2015 EMC Proven Professional Knowledge Sharing 12
6. NetWorker setup details
The data from clients were read over SAN network and backed up to DDR over LAN
The data being directly backed up from EMC NetWorker client to Data Domain
device
The number of concurrent sessions adjusted were based on clients backup streams.
The maximum concurrent sessions being backed up were 512 with 500 clients, with
more than 500 client backups, the concurrent backup streams set on NetWorker
server was 1024. NetWorker supports 1024 concurrent streams.
2015 EMC Proven Professional Knowledge Sharing 13
7. Test Results 1) The backup throughput summary
Analysis
The backup application throughput scales linearly by adding more resources. In the
above case, with single and low end DDR (DD880), the application does a lot of
queuing resulting in low throughput and slower backup. In this example, it clearly
shows that the slower external hardware component, such as DDR in this case, can
affect the backup success rate by excessive queuing.
The general rule of thumb in backup is, “The maximum throughput is limited by
slowest component in backup chain”. So, in the above case, the more contention for
85.41
164.41
337.47
191
116
95
0
50
100
150
200
250
300
350
400
0
100
200
300
400
500
600
700
800
900
1000
500 Clients(DD880)
1000 Clients(1DD880)
2000 ClientsDef Pll
(DD880)
2000 Clients(DD990)
2000 Clients(2DD990)
2000 Clients(2DD990) -20 Devices
Du
rati
on
(in
Min
s)
Thro
ugh
pu
t (i
n M
Bp
s)
Throughput (in MBps) Backup Duration (in Mins)
2015 EMC Proven Professional Knowledge Sharing 14
resources, the more the queuing on backup application and therefore, less the
throughput.
After removing the slowest component (DDR880) with faster Data Domain box, the
throughput doubled. With additional DDR (another DD990), the throughput further
doubled.
The contention during data protection (less resource) will result in more queuing on
the application in turn resulting in sessions getting terminated due to timeout. This
affects the overall reliability of the application.
Let us consider an example for 2000 clients backup to DDR880 and DDR990.
Overall, 5TB of data is being protected from these clients. Adding a faster
component into the backup chain (DDR990) improved backup speed by 74%. So,
removing contention in the backup chain is always a top priority in achieving higher
efficiency.
Recommendations
Application environment needs to be tuned and needs to be better sized to meet
the protection needs.
Avoid as much contention as possible in backup chain to reduce the session
termination.
Avoid sharing of resources (backup target). Sharing of target can result in more
queuing on application as well as backup target resulting in frequent hanging or
session timeouts while performing data protection.
2015 EMC Proven Professional Knowledge Sharing 15
2) The environmental impact summary during backup
Legends: X-axis: Time; Y-axis: #TCP connections
Analysis
Environmental factors such as the underlying Operating system where backup
software runs plays a key role during data protection. With the default OS settings, the graph shows the impact of network stack from OS where backup server is
running.
If there is more queuing (as can be seen here for DD880), then more #socket
connections will be established for longer duration, consuming more resources on
the server. If the sessions are held for longer duration then, then based on OS
settings, underlying OS network stack will start terminating the TCP sessions.
Higher queuing of sessions can result in higher socket utilization. The DD880 has
used around 4000 TCP connections till backup is completed, whereas DD990 has
used only 3000 for lesser duration with less queuing.
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
TCP
Est
ablis
he
d S
ock
ets
2000 Clients backup with DD880 vs DD990
No_of_sock_ESTD (DD880) No_of_sock_ESTD (DD990)
No_of_sock_ESTD (2DD990)
2015 EMC Proven Professional Knowledge Sharing 16
Recommendations:
Tuning the underlying OS stack where backup software runs is a key factor to
achieve better reliability. Failure to do so will result in unreliable backups.
3) Impact of application tuning: Load distribution
Legends: X-axis: Time, Y-axis: #concurrent sessions
Analysis:
Improper application settings can affect the backup window during data protection.
Allow as many concurrent streams as possible through default settings in the
application. Modifying the parallelism in backup policy (EMC NetWorker
policy/savegroup) needs to be carefully designed.
0
100
200
300
400
500
600
700
Application Concurrency - Default vs Tuned
Active Sessions (Savegrp Default Parallelism) Active Sessions (Savegrp 150 Parallelism)
Took 20 mins to bounce back to 500 from below
2015 EMC Proven Professional Knowledge Sharing 17
4) Impact of system resource utilization by application
Analysis
The above graph clearly depicts the NetWorker application memory requirements for
protecting linearly scaled #clients. Not meeting these memory requirements
negatively affects application behavior.
Session in queue will increase memory utilization. Increase in Concurrent sessions
(last run) from 600 to 1000 also increases memory utilization considerably. What
matters in this case for application reliability is its resource requirement from the
underlying server. If the application is not designed to throttle the memory
requirement, it always tends to use available resources. Not meeting these memory
resources during concurrent operations can affect its operations.
Recommendations
If application does not have any intelligence in throttling its operations based on
available resources, then it is mandatory to size at least 15-20% more to meet the
reliability requirements.
2015 EMC Proven Professional Knowledge Sharing 18
5) Impact of application sub-system resource utilization with catalog growth
Legends: X-axis: Time, Y-axis: memory usage
Analysis:
In NetWorker, jobs daemon plays a key role in handling the sessions, keeping the
session records, and monitoring them during backup. Every operation in NetWorker
gets stored into JobsDB and it can grow over a period of time.
The size of JobsDB impacts backup operations. As size linearly increases, the
memory requirements for jobs daemon also increases. Understanding such patterns
for application sub-system is very important and sizing them properly will improve
application reliability.
Recommendations
Purge the catalog size within application based on retention time so that application
removes the older records from JobsDB and maintains the catalog size constantly
for better efficiency.
It is mandatory to tune the application memory requirement at least 10-15% more so
that any of the application’s sub system demanding more memory at any given point
in time gets it.
2015 EMC Proven Professional Knowledge Sharing 19
6) Impact of Application overall and component level CPU utilization
Analysis:
Similar to memory requirement, the above graph clearly depicts the NetWorker
application CPU requirements for protecting linearly scaled #clients. Not meeting
these memory requirements negatively affects application behavior.
The second graph shows the individual component (daemon/process) level CPU
usage. What matters in this case for application reliability is its resource requirement
from underlying server. If application is not designed to throttle the CPU
requirement, it tends to use available resources. Not meeting these CPU resources
during concurrent operations can affect its operations.
7) Impact of underlying storage sub system on backup application server
0
2
4
6
8
10
0
500
1000
1500
2000
#IO
PS
Time
IOPS during NetWorker operations
IOPs Stats
SVCTime Stats
2015 EMC Proven Professional Knowledge Sharing 20
Analysis
EMC NetWorker does lot of I/O operations on its catalog during concurrent backup,
recovery, maintenance, and purging of record operations.
IOPS and I/O service time to process these operations normally spiked to a higher
range when backup was initiated.
Huge number of IOPS were noticed during application catalog backup with scaled
number of records in its catalog.
During NetWorker maintenance operations such has consistency check of its
catalog, the IOPS and service time to process these IOPS are always high.
Recommendations:
It is recommended to ensure these IOPS requirements are met during these key
operations.
Each operation does certain IOPS on NetWorker catalog; if any of these operations
gets overlapped, then IOPS can significantly increase. It is recommended that IOPS
are sized in such a way that overall IOPS = sum of individual operation level IOPS.
8) Impact Longevity test and overlap operations
The graph below shows the backup application’s memory requirements over a period of
time. During these tests, a series of operations being run with similar load pattern on
daily basis and application behavior has been measured.
2015 EMC Proven Professional Knowledge Sharing 21
Analysis:
The memory spikes in the graph represent the full backup and consecutive full
backups.
During regular backup operations, there is a consistent memory pattern observed on
server.
Memory pattern changes drastically based on overlapping of NetWorker operations
(highlighted with circle) such as backup as well as maintenance operations.
Recommendations:
It is recommended to schedule the overlapping operations to some other time so
that they don’t overlap with regular backup/clone/recover operations.
9) Impact of underlying storage sub system on backup application server a over period
of time
Analysis:
The above graph shows the clear picture of I/O pattern over a period of time with
sequential and overlapping operations over a period of time.
During NetWorker sequential operations, the IOPS pattern on catalog disk was
consistent. However, overlapping of NetWorker operations will significantly put load
2015 EMC Proven Professional Knowledge Sharing 22
on catalog disk and hence IOPS and service time to process these I/Os significantly
increases.
There are reliability issues such as intermittent hang noticed when NetWorker does
not meet the overall IOPS from underlying storage subsystem.
During NetWorker maintenance operations such as consistency check of catalog,
the IOPS and SVC time are always in a higher range (as highlighted)
Recommendations:
If overall IOPS significantly increases, it is recommended to host the catalog on
faster disks
If overlapping operations cannot be avoided, it is always recommended to meet the
overall IOPS for application catalog to avoid reliability issues. The overall IOPS is
defined as follows
Overall IOPS = sum of NetWorker individual operation IOPS
8. Conclusion For any data protection software, achieving application reliability is the key task. Achieving
software reliability is difficult since it depends on the complexity of the software as well as
underlying subsystem. The application reliability depends on high software quality and its
design to adjust and auto-tune based on underlying sub-system changes. So sizing of
underlying sub-system and taking the preventive steps during high failure rate will improve
backup application reliability. It is important that backup application reliability be measured
during Requirements, Design and coding, and testing phases.
EMC believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION
MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO
THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an
applicable software license.