recovering from the ground up case study

5
RECOVERING FROM THE GROUND UP Dell builds a blueprint for recovery, leading to a service that can help customers minimize risk when disaster strikes CUSTOMER PROFILE COUNTRY: United States INDUSTRY: Technology FOUNDED: 1984 NUMBER OF EMPLOYEES: 80,000 WEB ADDRESS: www.dell.com CHALLENGE Dell’s failover plans for disaster recovery needed to be refined to minimize downtime for essential applications and processes and help ensure business continuity in the event of a disaster. SOLUTION The Dell disaster recovery team assessed the potential for loss, prioritized applications and processes, and established thorough and repeatable best practices, ultimately evolving its methodology into Dell Disaster Recovery Consulting Services. BENEFITS Run IT Better Dedication to documentation, testing, and improvement has helped Dell speed failover from eight hours to mere minutes in some cases Working with auditors to improve processes and create an accountable methodology for fixes has accelerated the audit process and minimized the impact on daily workload Planning risk avoidance alongside Dell’s insurance carrier allowed Dell to reduce its monthly premium after implementing a secondary data center and failover procedures The ability to stay up and running in the face of a hurricane, fire, malicious attack, or even a blackout can be crucial to the long-term survival of every business. Without effective disaster recovery (DR) and business continuity plans, businesses of all sizes risk lost income, productivity, or worse. As a result, more organizations are developing plans to resume business as usual after an unplanned outage while balancing the risks and costs of disaster recovery. SOLUTION BACKUP/RECOVERY/ARCHIVING

Upload: dell-enterprise

Post on 15-May-2015

856 views

Category:

Technology


2 download

DESCRIPTION

Dell builds a blueprint for recovery, leading to a service that can help customers minimize risk when disaster strikes.

TRANSCRIPT

Page 1: Recovering From the Ground Up Case Study

RecoveRing FRom the gRound up Dell builds a blueprint for recovery, leading to a service that can help customers minimize risk when disaster strikes

customeR pRoFiLe Country: United States

Industry: Technology

Founded: 1984

number oF employees: 80,000

Web Address: www.dell.com

chaLLenge

Dell’s failover plans for disaster recovery needed to be refined to minimize downtime for essential applications and processes and help ensure business continuity in the event of a disaster.

soLution

The Dell disaster recovery team assessed the potential for loss, prioritized applications and processes, and established thorough and repeatable best practices, ultimately evolving its methodology into Dell Disaster Recovery Consulting Services.

BeneFitsrun It better

Dedication to documentation, testing, •and improvement has helped Dell speed failover from eight hours to mere minutes in some cases

Working with auditors to improve •processes and create an accountable methodology for fixes has accelerated the audit process and minimized the impact on daily workload

Planning risk avoidance alongside •Dell’s insurance carrier allowed Dell to reduce its monthly premium after implementing a secondary data center and failover procedures

The ability to stay up and running in the face of a hurricane, fire, malicious attack, or even a blackout can be crucial to the long-term survival of every business. Without effective disaster recovery (DR) and business continuity plans, businesses of all sizes risk lost income, productivity, or worse. As a result, more organizations are developing plans to resume business as usual after an unplanned outage while balancing the risks and costs of disaster recovery.

SOLUTIONBACKUP/RECOVERY/ARCHIVING•

Page 2: Recovering From the Ground Up Case Study

hoW it WoRKs

haRdWaRe

Dell• ™ PowerEdge™ R900 and Dell PowerEdge 2950 servers with Intel® Xeon® processors

soFtWaRe

Microsoft• ® SharePoint® Server

Oracle• ® Database 10g

Oracle Data Guard software•

Oracle Enterprise Manager 10• g Grid Control

Oracle Real Application Clusters 10• g

seRvices

Dell Disaster Recovery Consulting Services•

“at deLL, the disasteR RecoveRy poLicy RequiRes that eveRy cLass 1 appLication conducts a disasteR RecoveRy test eveRy yeaR. noW ouR staFF KnoWs the pRoceduRe WeLL enough to FaiL oveR quicKLy Because they have tested yeaR aFteR yeaR.”Debi Higdon, practice lead for DR Services, Dell

At Dell, disaster recovery and business continuity are top priorities, and have been for years. A successful disaster recovery plan is most often proven via audits that assess clear and concise plans for recovery procedures, particularly in regard to mission-critical applications. Auditors also look for proof of successful disaster recovery testing. Since launching its first disaster recovery plan into production in September of 2002, Dell has successfully passed every annual internal and external audit.

Over the years, Dell has refined and improved those plans, developing real-world experience that can help other companies in need of effective disaster recovery. Those efforts have today evolved into Dell Disaster Recovery Consulting Services, a division dedicated to sharing the disaster recovery knowledge assembled over the last eight years. “We started with a blank whiteboard and an absence of preconceived notions, and worked on our own to develop a recovery plan from the

ground up,” says Debi Higdon, practice lead for DR Services at Dell, and DR Test Manager at Dell from 2001 to 2008. “As a result, we quickly developed a set of core best practices around assessing needs and provisioning a plan to meet those needs.”

aLigning it With Business unitsAn effective disaster recovery and business continuity plan depends on an enterprise’s ability to identify critical processes and technologies and balance risks with the costs of continuity efforts. In order to achieve the critical assessment necessary for success, Dell recommends first closely aligning IT and business staff to make decisions as a team. “We asked ourselves, who from the business would call Event Management the quickest if an application is down?” says Higdon. “While direct, that usually provides a great impression of how much downtime your business units can withstand. Once you have a starting point, IT and business staff can refine that feedback into clearly articulated business goals.”

Page 3: Recovering From the Ground Up Case Study

deFining pRioRities in a Business contextThe Dell disaster recovery team worked with Dell business units to conduct a rigorous analysis, identifying the applications and business processes that were most critical and thus needed to be online first after a disaster. “At Dell, our customers come first, so we classified the most critical applications as those that touch sales, manufacturing, shipping, or service,” explains Higdon. “But it’s not always about revenue—for example, health care and government organizations will have very different priorities in the event of a disaster. To establish rules for criticality, each organization has to determine what could happen in a disaster and what the impact would be.”

Prioritizing processes, applications, and data according to their business impact helps to ensure that the appropriate investment is made to recover the most crucial systems first. In the tiered system at Dell, class 1 applications fail over to a secondary data center within four hours, while class 2 applications have a recovery time objective (RTO) of 4 to 48 hours after an incident. Class 3 applications are recovered as a “best effort” whenever resources become available. If the mission-critical systems are not set up to be active/active across the entire application stack, those applications will need to be prioritized if there are not enough resources to support the recovery at the time of disaster.

estaBLishing RecoveRy oBjectivesDell’s business units and recovery IT staff also worked together to establish RTOs and recovery point objectives (RPOs). Like most organizations, Dell had grown rapidly and in an ad hoc manner, making a complete application inventory a necessary first step. “We went through each application one at a time and identified how much downtime the application could withstand,” says Higdon. Dell’s mission-critical applications had much less leeway when considering RPO. “RPO is really about how much data you can afford to lose,” explains Higdon. “Since our mission-critical applications are centered on customer interactions, we want to establish a low RPO.”

thinKing outside the data centeR BoxIn many cases, enterprises build a disaster recovery plan that assumes communications modes will be operational and the network will be available. Dell

also learned that a company of its size couldn’t depend on every employee to be available should service be interrupted. “You definitely have to consider logistics when establishing a plan or building a disaster recovery site,” says Higdon. “How quickly can people get to this site? How much of the IT staff can work remotely? How can you ensure power can be supplied 24/7? These are essential considerations.”

BaLancing RecoveRy and Business needs With oRacLe Rac on deLL poWeRedge seRveRsOnce RTOs and RPOs were determined, Dell began exploring disaster recovery solutions. By assessing the potential financial losses of a disaster as well as the risks to its data center, the company could better balance business needs with the cost of recovery. “To determine an appropriate budget for disaster recovery, we calculated all of the potential financial risks associated with a worst-case scenario,” says Higdon. “Working with our financial teams, insurance carriers, and even a local meteorologist helped us establish a realistic budget for a disaster recovery plan.”

Dell settled on an active/active approach for some of its mission-critical applications that would provide rapid failover to a secondary facility in the event of a disaster, helping to ensure as short an RTO and RPO as possible. Dell data centers run Oracle® Real Application Clusters (RAC) 10g technology on Dell™ PowerEdge™ R900 and Dell PowerEdge 2950 servers with Intel® Xeon® processors. Dell data centers also use Oracle Data Guard software, which helps manage standby databases, and Oracle Enterprise Manager 10g Grid Control, which provides a single point of management for all of the Dell global production databases.

Oracle RAC technology helps to ensure high application availability. If one system in a cluster fails or is taken down for maintenance, the others can pick up its workload instantly. “About 72 percent of the Oracle databases we have in production are Oracle RAC 10g,” says Logan McLeod, IT strategist at Dell. “Oracle RAC provides high availability and scalability, and it enables us to dynamically respond to ever-changing workloads in our environment.”

While supporting extremely short RTOs and RPOs was critical to helping Dell reduce the impact of downtime on sales and customer-oriented processes, other organizations may have different priorities. “As a general rule, shorter recovery times

mean more expensive solutions,” says Higdon. “If data isn’t needed to keep the business operational in the near term, tape backups stored in a secure off-site location may be appropriate. Virtualization can also present a cost-effective option for many businesses, providing flexible disaster recovery in a very secure one-to-many relationship.”

Reducing Rto to Within minutes With disasteR RecoveRy Best pRacticesOver time, Dell’s continued dedication to improving its disaster recovery plan, processes, and the use of new technologies has led to a drastic improvement in its RTO. “In building a disaster recovery practice from the ground up, we began with what appeared to be a very complex set of systems and then simplified them through process analysis, automation, and rigorous testing,” says Higdon. “While Dell’s initial recovery efforts resulted in six to eight hours to fully transition to the failover environment, persistence and experience have helped us shorten failover of some mission-critical systems to minutes.”

ReguLaRLy updating documentation encouRages ReLiaBLe, RepeataBLe pRocessesTo develop and enforce a reliable, repeatable disaster recovery plan, Dell documents the recovery process for each of its mission-critical applications and infrastructure elements. “Our rigorous dedication to step-by-step documentation has been a secret to our success,” says Higdon. “Over time, we’ve developed a solid template for all applications that includes step-by-step instructions for failover and information about where the servers are located, IP addresses, upstream and downstream dependencies, schematics, and more.”

Those documents as well as related information like application failover plans, master recovery plans, classification information, and test plans, test requirements and scripts are stored on a central Microsoft® SharePoint® server dedicated to disaster recovery and also kept on CDs in three separate locations. Even if a disaster cripples phone service, network availability, or transportation, Dell can still begin the recovery process. “If documentation is not updated at least once every six months, a red flag is raised at the executive level through the disaster recovery scorecard,” says Higdon. “By keeping

Page 4: Recovering From the Ground Up Case Study

the step-by-step recovery process front of mind, Dell ensures that its plan is ready when the company needs it most. Dell is moving forward with virtualization within its own environment.”

assessing appLication and pRocess inteRaction pRevents domino-eFFect doWntimeOver time, Dell Disaster Recovery Consulting Services has learned to assess the interactions of applications and business processes both before and after an outage. “Good planning requires looking ahead a step,” says Higdon. “If a class 2 application could cause a domino effect in your mission-critical applications by going down, it should be reclassified as class 1. Likewise, chronological priorities should be considered when planning recovery—for example, shipping shouldn’t come back online before order management. You have to be able to prioritize applications so that if you don’t have enough resources to bring everything back up at the same time, you can bring them up in an order that makes sense.”

RigoRous testing Reduces FaiLoveR time and pRepaRes FoR compLex RecoveRyDell Disaster Recovery Consulting Services is quick to point out that disaster recovery should not end with the failover plan. “Our number-one best practice is to test our disaster recovery plan, again and again,” says Higdon. “At Dell, the Disaster Recovery Policy requires that every class 1 application conducts a disaster recovery test every year. Now our staff knows the procedure well enough to fail over quickly because they have tested year after year. And once we’re running from the secondary data center, we process real transactions from some applications to make sure everything’s running smoothly.”

By adding integration testing to its course of failover testing, Dell gets a clearer idea of what could happen during a real recovery process. “Failing over a single application may be easy,” says Higdon. “But when that application talks to 20 other applications and they all go down at the same time, that’s when you really know what’s going to happen in a real disaster. By simulating catastrophic outages, Dell learned how to account for application interdependencies as it restores service.”

cRoss-tRaining impRoves disasteR pRepaRednessBy ensuring that multiple staff members are prepared to recover any given system, Dell has improved disaster preparedness. “Realistically, we had to assume a certain degree of chaos in the event of a real disaster,” explains Higdon. “To counter that, we’ve instituted a large degree of cross-pollination when it comes to recovery assignments in IT. During a DR test, we rely on a well-trained networking team halfway around the world that is far removed from any disaster at our headquarters. Database administrators rotate the support of the databases during a disaster recovery test. If something catastrophic happens, I feel confident because we don’t have just a single employee who knows any given application.”

RecoveRy pLanning eases auditing compLiance pRocessesIn addition to yearly internal audits, Dell’s extensive disaster recovery planning has drastically eased compliance with internal and external audits as well as yearly Sarbanes-Oxley compliance audits. “Instead of forming an adversarial relationship with auditors, we’ve learned to work closely with them and incorporate their feedback,” says Higdon. “As a result, action items are assigned a case handler and

a management action plan until they are completed. Our documentation and processes have helped accelerate the auditing process and minimize the amount of time it takes to conduct an audit.”

end-to-end pLanning pRevents singLe points oF FaiLuRe“There can be no gaps,” says Higdon. “A single point of failure could be a particular application or database server, a lone backup generator in a data center, or the long-haul network itself. Organizations should perform a specific and detailed single-point-of-failure analysis across the entire infrastructure. That kind of gap analysis can help prevent a major outage if a relatively minor component fails.”

The Dell disaster recovery team also determined that its disaster recovery plans had to look past keeping the data center up and running. Effective business continuity planning must support all vital business functions, such as shipping and manufacturing. “It’s not just about applications—it’s about buildings, infrastructure, and people,” says Higdon. “By creating a plan to reroute manufacturing orders from a destroyed manufacturing facility or rerouting calls to a call center that has been taken off the grid, we can improve business continuity for processes outside of our data centers.”

pRotecting systems heLp LoWeR insuRance pRemiumsDell also discovered a hidden cost benefit as it completed the first phase of its disaster recovery plan. By implementing a secondary data center and failover procedures, the company saved on its insurance premium. “From day one, we worked with our insurance company—who better to teach you about risk than someone that deals with insurance claims every day?” says Higdon. “Once a disaster recovery solution is in place and working,

“By Keeping the step-By-step RecoveRy pRocess FRont oF mind, deLL ensuRes that its pLan is Ready When the company needs it most.”Debi Higdon, practice lead for DR Services, Dell

Page 5: Recovering From the Ground Up Case Study

Simplify your total Solution at DEll.Com/SimplifyAugust 2009. © 2009 Dell, Inc. Dell is a trademark of Dell Inc. Intel, the Intel logo, and Intel Xeon are registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. Microsoft and the Microsoft logo are registered trademarks of Microsoft Corporation in the United States and/or other countries. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. This case study is for informational purposes only. DELL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS CASE STUDY. G910009224

Dell Disaster Recovery Consulting Services recommends contacting your insurance carrier to ask about a reduction in premium.”

LeveRaging RecoveRy expeRience to BeneFit customeRsDell now has tested disaster recovery plans that are ready to help the organization recover mission-critical applications rapidly and continue to serve customers even in the event of disasters—as well as the experience that comes from developing those plans from the

ground up. “Everything that Dell has developed—templates, documentation, best practices, and methodology—is now being shared with our customers through the DR Service Offering,” says Higdon. “Dell Disaster Recovery Consulting Services helps customers design solutions that fit their unique needs while also better protecting their business should the unthinkable happen.”

For more information on this case study or to read additional case studies, go to dell.Com/Casestudies.

“deLL disasteR RecoveRy consuLting seRvices heLps customeRs design soLutions that Fit theiR unique needs WhiLe aLso BetteR pRotecting theiR Business shouLd the unthinKaBLe happen.”Debi Higdon, practice lead for DR Services, Dell