Download - Disaster Recovery Planning (DRP)

Disaster Recovery Planning (DRP)

Disaster Recovery Planning (DRP)

DRP is the process of regaining access to the data, hardware and software necessary to resume critical business operations after a natural or -human induced disaster .

A disaster recovery plan (DRP) should also inclu de plans for coping with the unexpected or sud

den loss of key personnel, although this is not c overed in this article, the focus of which is data

protection. DRP is part of a larger process known as

business continuity planning (BCP).

What is the difference DRP and BCP (1/2)

Disaster recovery is the process by which you resume business after a disruptive event.

The event might be -something huge like an earthquake or the terrorist

attacks on the World Trade Center something small , like malfunctioning software caus

ed by a computer virus. Given the human tendency to look on the brig

ht side, many business executives are prone t o ignoring "disaster recovery" because disast er seems an unlikely event .

What is the difference DRP and BCP (2/2)

"Business continuity planning" suggests a mor e comprehensive approach to making sure you

can keep making money. Often, the two terms are married under the acr

onym BC/DR. At any rate, DR and/or BC determines how a co

mpany will keep functioning after a disruptive event until its normal facilities are restored.

What do these plans include (1/2)

All BC/DR plans need to encompass how employees will communicate where they will go how they will keep doing their jobs.

The details can vary greatly, depending o n the size and scope of a company and th e way it does business.

What do these plans include (2/2)

For example, The plan at one global man ufacturing company

restore critical mainframes with vital data at a backup site within four to six days of a disru

ptive event, obtain a mobile PBX unit with 3000 telephon

es within two days recover the company's 1000- plus LANs in ord

er of business need set up a temporary call center for 100 agent

s at a nearby training facility.

Events that necessitate dis aster recovery

Natural disasters Fire Power failure Terrorist attacks Organized or deliberate disruptions Theft System and/or equipment failures Human error Computer viruses Testing

Prevention against data loss (1/2)

- Backups sent off site in regular intervals Includes software as well as all data information

, to facilitate recovery Create an insurance copy on Microfilm

- or similar and store the records off site. Use a Remote backup

facility if possible to minimize data loss Storage Area Networks (SANs) over multiple

sites make data immediately available with out the need to recover or synchronize it

Prevention against data loss (2/2)

Surge Protectors — to minimize the effect of power surges on delicate electronic equi

pment Uninterruptible Power Supply

(UPS) and/or Backup Generator Fire Preventions — more alarms, accessibl

e extinguishers - Anti virus software and other security

measures

Techniques and technology Mirroring

Disk mirroring : Redundant arrays of inexpensive disks 1 (RAID1)

Server mirroring: web / ftp /email RAID : RAID0 – 6 and combination On-site data storage

Back up - Tape / optical disk Off-site data storage (backup-site)

Cold sites Warm sites Hot site

Mirroring Mirroring can occur locally or remotely.

Locally means that a server has a second hard drive that stores data .

A remote mirror means that a remote server contains an exact duplicate of the data. The secon

d drive is called a mirrored drive . Data is written to the original drive when a writ

e request is issued. Data is then copied to the mirrored drive, providing a mirror image of the

primary drive. If one of the hard drives fails, all data is protec

ted from loss.

Disk mirroring (RAID1) T he replication of logical

disk volumes onto separ ate physical hard disks in

real time to ensure

continuous availability

, currency and accuracy. A mirrored volume is a

complete logical represe ntation of separate volu

me copies

Server mirroring Mirror sites are most commonly used to provide multiple

sources of the same information, and are of particular value as a way of providing reliable access to large downloads.

Mirroring is a type of file synchronization Web server

To preserve a website or page, especially when it is closed or is about to be closed.

To counteract censorship and promote freedom of information Email server

To protect loss of email information ftp server

To allow faster downloads for users at a specific geographical location

Load balancing

Redundant arrays of inexpensive disks (RAID)

The organization distributes the data across mul tiple

smaller disks, offering protection froma crash tha t could wipe out all data on a single, shared disk.

B enefits of RAID include the following Increased storage capacity per logical disk volume High data transfer or I/O rates that improve informati

on throughput Lower cost per megabyte of storage Improved use of data center floor space

RAID0 RAID Level 0 -aka. a stripe set

or striped volume) splits data evenly across two or more disks (striped) with no parity information for redundancy.

It is important to note that RAID 0 provides zero data redundancy.

RAID 0 is normally used to increase performance

A RAID 0 can be created with dis ks of differing sizes, but the stora ge space added to the array by e

ach disk is limited to the size of t he smallest disk

RAID1 A RAID 1 creates an exact c

opy (or mirror ) of a set of d ata on two or more disks.

This is useful when read per formance or reliability are

more important than data st orage capacity .

Such an array can only be a s big as the smallest memb er disk.

A classic RAID1 mirrored pa ir contains two disks (see di

agram), which increases reli ability

RAID2 A RAID2 stripes data at the bi t (rather than block) level, and uses a

Hamming code for er r or cor r ect i on . Extremely high data transfer rates are possible. RAID 2 is the only standard RAID level which can automatically recover

- accurate data from single bit corruption in data. -At the moment, there are no commercial implementations of RAID 2

RAID3 RAID Level3 uses byte

- level striping with a dedicate d parity disk.

RAID3 is very rare in practic e.

- One of the side effects of RAI D3 is that it generally cannot

service multiple requests sim ultaneously.

This comes about because a ny single block of data will, b

y definition, be spread across all members of the set and wi

ll reside in the same location. So, any I/O operation require

s activity on every disk.

RAID4 RAI DLevel 4 uses block

- level striping with a dedicated pari t y di sk.

Thisal l ows each member of t he set t o act independently when only a sing

le block is requested. RAID 4 looks similar to RAID 3 ex

cept that it stripes at the block leve l, rather than the byte level.

In the example , a read request for block "A1" would be serviced by dis

k0. Asimultaneousreadrequestforbl ock B1woul d have t o wai t , but a read request for B2 could be

serviced concurrently by disk 1 .

RAID5 A RAID5 uses block

- level striping with parity data distri buted across all member disks.

RAI D5 has achieved popularity du e to its low cost of redundancy.

A minimum of3 disks is generally r equired for a complete RAID5 confi

guration. In the example, a read request for

block "A1 "woul d be ser vi ced by d isk0.

A simultaneous read request for bl ock B1 would have to wait, but a re

ad request for B2 could be serviced concurrently by disk 1

RAID6 A RAID6 extends RAID5 by

adding an additional parity bl ock, thus it uses block

- level striping with two parity blocks distributed across all

member disks. Improve reliability Like RAID 5, the parity is distr

ibuted in stripes, with the par ity blocks in a different place

in each stripe.

Nested RAID

Storage Model

Storage Area Network The Storage Network Industry Association (SNI

A) defines the SAN as a network whose primar y purpose is the transfer of data between com

puter systems and storage elements.

A SAN consists of a communication infrastruct ure, which provides physical connections; and

a management layer, which organizes the con nections, storage elements, and computer syst

ems so that data transfer is secure and robust.

SAN ‘s definition Put in simple terms, a SAN is a specialized,

- high speed network attaching servers and storage devices

It is sometimes referred to as “the network behind the servers.”

A SAN introduces the flexibility of networki ng to enable one server or many heteroge

neous servers to share a common storage utility, which may comprise many storage

devices, including disk, tape, and optical storage.

SAN Component SAN Connectivity

the connectivity of storage and server components typically using Fibre Channel (FC).

SAN Storage TAPE /RAID /ESS (Enterprise Storage

System) /JBOD (Just Bunch of Disk) /SSA (Serial Storage Architecture)

SAN Server Windows /Unix /Linux and etc

Switched Fabric A n infrastructure specially designed to ha

ndle storage communications called a fabr ic.

A typical Fibre Channel SAN fabric is made up of a number of Fibre Channel switches.

Today, all major SAN equipment vendors a lso offer some form of Fibre Channel routi ng solution, and these bring substantial sc

alability benefits to the SAN architecture b y allowing data to cross between different

fabrics without merging them.

Fiber Channel protocol Fibre Channel is a layered protocol. It consists of 5

layers, namely: 0FC The physical layer, which includes cables, fiber op

tics, connectors, pinouts etc. 1FC The data link layer, which implements the 8b/10b

encoding and decoding of signals. 2FC - - The network layer, defined by the FC PI 2 standar

d, consists of the core of Fibre Channel, and defines the main protocols.

3FC The common services layer, a thin layer that coul d eventually implement functions like encryption or RAI D.

4FC The Protocol Mapping layer. Layer in which other protocols, such as SCSI, are encapsulated into an infor

mation unit for delivery to FC2.

IP Storage Networking FCIP (Fiber Channel over IP)

It is a method for allowing the transmission of Fibre Channel information to be tunneled through the IP network.

iFCP (Internet Fiber Channel Protocol) It is a mechanism for transmitting data to and

from Fibre Channel storage devices in a SAN, or on the Internet using TCP/IP

Internet SCSI (iSCSI) It is a transport protocol that carries SCSI

commands from an initiator to a target.

FCIP (Fiber Channel over IP)

FCIP encapsulates FC frames within TCP/IP, allow ing islands of FC SANs to be interconnected over

- an IP based network TCP/IP is used as the underlying transport to pro

- vide congestion control and in order delivery FCFrames

All classes of FC frames are treated the same asdatagrams

- End station addressing, address resolution, mes sage routing, and other elements of the FC netw

ork architecture remain unchanged

iFCP - - iFCP is a gateway to gateway protocol for imple

menting a fibre channel fabric over a TCP/IP Traffic between fibre channel devices is routed

and switched by TCP/IP network The iFCP layer maps Fibre Channel frames to a

predetermined TCP connection for transport FC messaging and routing services are termina

ted at the gateways so the fabrics are not merg ed to one another

iSCSI iSCSI is a SCSI transport protocol for mapping

- of block oriented storage data over TCP/IP networks

The iSCSI protocol enables universal access t o storage devices and Storage Area Networks

(SANs) over standard TCP/IP networks

Back up site A backup site is a location where a business can

easily relocate following a disaster, such as fire ,flood , or terrorist threat. This is an integral part of

the disaster recovery plan of a business.

A backup site can be another location operated b y the business, or contracted via a company that

specializes in disaster recovery services. In some cases, a business will have an agreemen

t with a second business to operate a joint disast er recovery facility.

Cold Sites A cold site is the most inexpensive type of back

up site for a business to operate. It provides office spaces to operate It does not include backed up copies of data an

d information from the original location of the b usiness, nor does it include hardware already s

et up. The lack of hardware contributes to the minima

l startup costs of the cold site, but requires addi tional time following the disaster to have the op

eration running at a capacity close to that prior to the disaster.

Warm Sites

A warm site is a location where the busin ess can relocate to after the disaster tha

t is already stocked with computer hard ware similar to that of the original site, b

ut does not contain backed up copies of data and information.

Hot Sites A hot site is a duplicate of the original site of the bus

-iness, with full computer systems as well as near co mplete backups of user data.

Ideally, a hot site will be up and running within a ma tter of hours. This type of backup site is the most ex

pensive to operate. Hot sites are popular with stock exchanges

and other financial institutions

who may need to evacuate due to potential bomb threats and must resume normal operations as soon

as possible.

How to choose Choosing the type is mainly decided by a c

ompany's cost vs. benefit strategy. Hot sites are traditionally more expensive t

han cold sites since much of the equipmen t the company needs has already been pur

chased and thus the operational costs are higher.

However if the same company loses a subs tantial amount of revenue for each day the

y are inactive then it may be worth the cost.

--The advantages of a cold site are simple c ost. It requires much fewer resources to op

erate a cold site because no equipment ha s been bought prior to the disaster.

The downside with a cold site is the potenti al cost that must be incurred in order to m

ake the cold site effective. The costs of purchasing equipment on very

short notice may be higher and the disaste r may make the equipment difficult to obta

in.

Discovery Planning steps (1)

I. Information Gathering - Step One Organize the Project

Appoint coordinator/project leader, if the le ader is not the dean or chairperson.

Determine most appropriate plan organizat ion for the unit (e.g., single plan at college l

evel or individual plans at unit level) Set project timetable Draft project plan, including assignment of t

ask responsibilities


Step Two – Conduct Business Impact Analysis In order to complete the business impact analy

sis, most units will perform the following steps: Identify functions, processes and systems Interview information systems support personnel Interview business unit personnel Analyze results to determine critical systems, applic

ations and business processes Prepare impact analysis of interruption on critical sys

tems


Step Three – Conduct Risk Assessment

The risk assessment will assist in determining t he probability of a critical system becoming sev

erely disrupted and documenting the acceptabi lity of these risks to a unit.

Discovery Planning steps (3/1)

Review physical security (e.g. secure office, buil ding access off hours, etc.)

Review backup systems Review data security Review policies on personnel termination and tr

ansfer Identify systems supporting mission critical func

tions Identify vulnerabilities ( Such as flood, tornado, p

hysical attacks, etc.) Assess probability of system failure or disruption Prepare risk and security analysis


- Step Four Develop Strategic Outline for Recovery

1 Assemble groups as appropriate for: Hardware and operating systems Communications Applications Facilities Other critical functions and business processe

s as identified in the Business Impact Analysis


For each system/process above quantify the following processing requirements:

Light, normal and heavy processing days Transaction volumes

Dollar volume (if any) Estimated processing time Allowable delay (days, hours, minutes, etc.)


3 Detail all the steps in your workflow for each cri tical business function (e.g., for student payr

oll processing each step that must be complet e and the order in which to complete them.)

4 Identify systems and applications Component name and technical id (if any) Type (online, batch process, script) Frequency Run time Allowable delay (days, hours, minutes, etc.)


Identify vital records (e.g., libraries, proc essing schedules, procedures, research,

advising records, etc.) Name and description Type (e.g., backup, original, master,

history, etc.) Where are they stored Source of item or record Can the record be easily replaced from

another source (e.g., reference materials)


Backup Backup generation frequency Number of backup generations available onsite Number of backup generations available off-site Location of backups Media type Retention period Rotation cycle Who is authorized to retrieve the backups?


6 Identify if a severe disruption occurred what would be the minimum requirements/replacement needs to perform the critical function during the disruption. Type (e.g. server hardware, software, research

materials, etc.) Item name and description Quantity required Location of inventory, alternative, or offsite

storage Vendor/supplier


7 Identify if alternate methods of processing either exist or could be developed, quantifying where possible, impact on processing. (Include manual processes.)

8 Identify person(s) who supports the system or application

9 Identify primary person to contact if system or application cannot function as normal

10 Identify secondary person to contact if system or application cannot function as normal


11 Identify all vendors associated with the system or application

12 Document unit strategy during recovery (conceptually how will the unit function?)

13 Quantify resources required for recovery, by time frame (e.g., 1 pc per day, 3 people per hour, etc.)

14 Develop and document recovery strategy, including: Priorities for recovering system/function components Recovery schedule Form – critical system processing requirement for recovery


Step Five – Review Onsite and Offsite Backup and Recovery Procedures

The planning team as identified in Step 1 Task 3 would normally perform this task.

Review current records (OS, Code, System Instructions, documented processes, etc.) requiring protection

Review current offsite storage facility or arrange for one

Review backup and offsite storage policy or create one

Present to unit leader for approval


Step Six – Select Alternate Facility ALTERNATE SITE: A location, other than the

normal facility, used to process data and/or conduct critical business functions in the event of a disaster. Determine resource requirements Assess platform uniqueness of unit systems (e.g.,

MacIntosh, IBM Compatible, Oracle database, Windows 3.1, etc.)

Identify alternative facilities Review cost/benefit Evaluate and make recommendation


II. Plan Development and Testing Step Seven – Develop Recovery Plan This step would ordinarily be completed

by the coordinator/Project Manager working with the planning team.

Sample Plan Outline


1 Objective 2 Plan Assumptions 3 Criteria for invoking the plan

Document emergency response procedures to occur during and after an emergency

Document procedures for assessment and declaring a state of emergency

Document notification procedures for alerting unit and university officials

Document notification procedures for alerting vendors Document notification procedures for alerting unit staff

and notifying of alternate work procedures or locations.


4 Roles Responsibilities and Authority Identify unit personnel Recovery team description and charge Recovery team staffing Transportation schedules for media and

teams


5 Procedures for operating in contingency mode Process descriptions Minimum processing requirements Determine categories for vital records identify location of vital records Identify forms requirements Document critical forms Establish equipment descriptions Document equipment - in the recovery site Document equipment - in the unit


Software descriptions Software used in recovery Software used in production Produce logical drawings of communication and

data networks in the unit Produce logical drawings of communication and

data networks during recovery Vendor list Review vendor restrictions Miscellaneous inventory Communication needs - production Communication needs - in the recovery site


6 Resource plan for operating in contingency mode 7 Criteria for returning to normal operating mode 8 Procedures for returning to normal operating

mode 9 Procedures for recovering lost or damaged data 10 Testing and Training

Document Testing Dates Complete disaster/disruption scenarios Develop action plans for each scenario

Sample Testing Diagram


11 Plan Maintenance Document Maintenance Review Schedule

(yearly, quarterly, etc.) Maintenance Review action plans Maintenance Review recovery teams Maintenance Review team activities Maintenance Review/revise tasks Maintenance Review/revise documentation


12 Appendices for Inclusion inventory and report forms maintenance forms hardware lists and serial numbers software lists and license numbers contact list for vendors contact list for staff with home and work

numbers


contact list for other interfacing departments

network schematic diagrams equipment room floor grid diagrams contract and maintenance agreements special operating instructions for sensitive

equipment cellular telephone inventory and

agreements


Step Eight - Test the Plan 1 Develop test strategy 2 Develop test plans 3 Conduct tests 4 Modify the plan as necessary Samples Test Plan Strategy Test Plan Scenario Test Results/Test Evaluation


III. Ongoing Maintenance Step Nine - Maintain the Plan Dean/Director/Unit Administrator will be

responsible for overseeing this. 1 Review changes in the environment, technology, and procedures 2 Develop maintenance triggers and procedures 3 Submit changes for systems development procedures 4 Modify unit change management procedures 5 Produce plan updates and distribute


Step Ten – Perform Periodic Audit 1 Establish periodic review and update

procedures

Important factors (1/3) Communication

Personnel — notify all key personnel of the pr oblem and assign them tasks focused toward

the recovery plan. Customers — notifying clients about the prob

lem minimizes panic. Recall backups

If backup tapes are taken offsite, these need to be recalled. If using remote backup service s, a network connection to the remote backu p location (or the Internet) will be required.

Important factors (2/3)

Facilities having backup hot sites or cold sites for larg

er companies. Mobile recovery facilities are also available from many suppliers.

Prepare your employees during a disaster, employees are required to

work longer, more stressful hours, and a sup port system should be in place to alleviate s ome of the stress. Prepare them ahead of ti

me to ensure that work runs smoothly.

Important factors (3/3)

Business information backups should be stored in a completely sep

arate location from the company

Testing the plan provisions, directions, frequency for testing t

he plan should be stipulated.

Things to do in DRP (1/4) Here are 10 absolute basics your plan should co

ver: 1 . Develop and practice a contingency plan tha

t includes a succession plan for your CEO.

2. Train backup employees to perform emerge ncy tasks. The employees you count on to lead i

n an emergency will not always be available. 3. Determine offsite crisis meeting places for t

op executives.

Things to do in DRP (2/4)

4 - . Make sure that all employees as well as executives are involved in the exercises so that the

y get practice in responding to an emergency.

5 . Make exercises realistic enough to tap into e mployees' emotions so that you can see how th

ey'll react when the situation gets stressful.

6. Practice crisis communication with employee s, customers and the outside world.

Things to do in DRP (3/4)7 Invest in an alternate means of communicatio

n in case the phone networks go down.

8. Form partnerships with local emergency resp - onse groups firefighters, police to establish a good working relationship. Let them become f

amiliar with your company and site.

Things to do in DRP (3/3)

9. Evaluate your company's performance during each test, and work toward consta

nt improvement. Continuity exercises sh ould reveal weaknesses.

10. Test your continuity plan regularly to r eveal and accommodate changes. techno

logy, personnel and facilities are in a cons tant state of flux at any company.

T op mistakes in disaster re covery (1/3)

1. Inadequate planning: Have you identified all critical systems, do you have detailed plans to recover them to the cu

rrent day? Everybody thinks they know what they have on their

networks, but most people don't really know how ma ny servers they have,

how they're configured, or what applications reside o - n them what services were running,

what version of software or operating systems they were using.


2 Failure to bring the business into the planning an d

testing of your recovery efforts.

3 - Failure to gain support from senior level manage rs.

The largest problems here are: Not demonstrating the level of effort required for full r

ecovery. Not conducting a business impact analysis and addres

sing all gaps in your recovery model.


Not building adequate recovery plans that o utline your recovery time objective, critical

systems and applications, vital documents needed by the business, and business funct

ions by building plans for operational activit ies to be continued after a disaster.

Not having proper funding that will allow for

a minimum of semiannual testing .

Download - Disaster Recovery Planning (DRP)

Top Related