Download - Disaster Recovery Planning (DRP)
Disaster Recovery Planning (DRP)
Disaster Recovery Planning (DRP)
DRP is the process of regaining access to the data, hardware and software necessary to resume critical business operations after a natural or -human induced disaster .
A disaster recovery plan (DRP) should also inclu de plans for coping with the unexpected or sud
den loss of key personnel, although this is not c overed in this article, the focus of which is data
protection. DRP is part of a larger process known as
business continuity planning (BCP).
What is the difference DRP and BCP (1/2)
Disaster recovery is the process by which you resume business after a disruptive event.
The event might be -something huge like an earthquake or the terrorist
attacks on the World Trade Center something small , like malfunctioning software caus
ed by a computer virus. Given the human tendency to look on the brig
ht side, many business executives are prone t o ignoring "disaster recovery" because disast er seems an unlikely event .
What is the difference DRP and BCP (2/2)
"Business continuity planning" suggests a mor e comprehensive approach to making sure you
can keep making money. Often, the two terms are married under the acr
onym BC/DR. At any rate, DR and/or BC determines how a co
mpany will keep functioning after a disruptive event until its normal facilities are restored.
What do these plans include (1/2)
All BC/DR plans need to encompass how employees will communicate where they will go how they will keep doing their jobs.
The details can vary greatly, depending o n the size and scope of a company and th e way it does business.
What do these plans include (2/2)
For example, The plan at one global man ufacturing company
restore critical mainframes with vital data at a backup site within four to six days of a disru
ptive event, obtain a mobile PBX unit with 3000 telephon
es within two days recover the company's 1000- plus LANs in ord
er of business need set up a temporary call center for 100 agent
s at a nearby training facility.
Events that necessitate dis aster recovery
Natural disasters Fire Power failure Terrorist attacks Organized or deliberate disruptions Theft System and/or equipment failures Human error Computer viruses Testing
Prevention against data loss (1/2)
- Backups sent off site in regular intervals Includes software as well as all data information
, to facilitate recovery Create an insurance copy on Microfilm
- or similar and store the records off site. Use a Remote backup
facility if possible to minimize data loss Storage Area Networks (SANs) over multiple
sites make data immediately available with out the need to recover or synchronize it
Prevention against data loss (2/2)
Surge Protectors — to minimize the effect of power surges on delicate electronic equi
pment Uninterruptible Power Supply
(UPS) and/or Backup Generator Fire Preventions — more alarms, accessibl
e extinguishers - Anti virus software and other security
measures
Techniques and technology Mirroring
Disk mirroring : Redundant arrays of inexpensive disks 1 (RAID1)
Server mirroring: web / ftp /email RAID : RAID0 – 6 and combination On-site data storage
Back up - Tape / optical disk Off-site data storage (backup-site)
Cold sites Warm sites Hot site
Mirroring Mirroring can occur locally or remotely.
Locally means that a server has a second hard drive that stores data .
A remote mirror means that a remote server contains an exact duplicate of the data. The secon
d drive is called a mirrored drive . Data is written to the original drive when a writ
e request is issued. Data is then copied to the mirrored drive, providing a mirror image of the
primary drive. If one of the hard drives fails, all data is protec
ted from loss.
Disk mirroring (RAID1) T he replication of logical
disk volumes onto separ ate physical hard disks in
real time to ensure
continuous availability
, currency and accuracy. A mirrored volume is a
complete logical represe ntation of separate volu
me copies
Server mirroring Mirror sites are most commonly used to provide multiple
sources of the same information, and are of particular value as a way of providing reliable access to large downloads.
Mirroring is a type of file synchronization Web server
To preserve a website or page, especially when it is closed or is about to be closed.
To counteract censorship and promote freedom of information Email server
To protect loss of email information ftp server
To allow faster downloads for users at a specific geographical location
Load balancing
Redundant arrays of inexpensive disks (RAID)
The organization distributes the data across mul tiple
smaller disks, offering protection froma crash tha t could wipe out all data on a single, shared disk.
B enefits of RAID include the following Increased storage capacity per logical disk volume High data transfer or I/O rates that improve informati
on throughput Lower cost per megabyte of storage Improved use of data center floor space
RAID0 RAID Level 0 -aka. a stripe set
or striped volume) splits data evenly across two or more disks (striped) with no parity information for redundancy.
It is important to note that RAID 0 provides zero data redundancy.
RAID 0 is normally used to increase performance
A RAID 0 can be created with dis ks of differing sizes, but the stora ge space added to the array by e
ach disk is limited to the size of t he smallest disk
RAID1 A RAID 1 creates an exact c
opy (or mirror ) of a set of d ata on two or more disks.
This is useful when read per formance or reliability are
more important than data st orage capacity .
Such an array can only be a s big as the smallest memb er disk.
A classic RAID1 mirrored pa ir contains two disks (see di
agram), which increases reli ability
RAID2 A RAID2 stripes data at the bi t (rather than block) level, and uses a
Hamming code for er r or cor r ect i on . Extremely high data transfer rates are possible. RAID 2 is the only standard RAID level which can automatically recover
- accurate data from single bit corruption in data. -At the moment, there are no commercial implementations of RAID 2
RAID3 RAID Level3 uses byte
- level striping with a dedicate d parity disk.
RAID3 is very rare in practic e.
- One of the side effects of RAI D3 is that it generally cannot
service multiple requests sim ultaneously.
This comes about because a ny single block of data will, b
y definition, be spread across all members of the set and wi
ll reside in the same location. So, any I/O operation require
s activity on every disk.
RAID4 RAI DLevel 4 uses block
- level striping with a dedicated pari t y di sk.
Thisal l ows each member of t he set t o act independently when only a sing
le block is requested. RAID 4 looks similar to RAID 3 ex
cept that it stripes at the block leve l, rather than the byte level.
In the example , a read request for block "A1" would be serviced by dis
k0. Asimultaneousreadrequestforbl ock B1woul d have t o wai t , but a read request for B2 could be
serviced concurrently by disk 1 .
RAID5 A RAID5 uses block
- level striping with parity data distri buted across all member disks.
RAI D5 has achieved popularity du e to its low cost of redundancy.
A minimum of3 disks is generally r equired for a complete RAID5 confi
guration. In the example, a read request for
block "A1 "woul d be ser vi ced by d isk0.
A simultaneous read request for bl ock B1 would have to wait, but a re
ad request for B2 could be serviced concurrently by disk 1
RAID6 A RAID6 extends RAID5 by
adding an additional parity bl ock, thus it uses block
- level striping with two parity blocks distributed across all
member disks. Improve reliability Like RAID 5, the parity is distr
ibuted in stripes, with the par ity blocks in a different place
in each stripe.
Nested RAID
Storage Model
Storage Area Network The Storage Network Industry Association (SNI
A) defines the SAN as a network whose primar y purpose is the transfer of data between com
puter systems and storage elements.
A SAN consists of a communication infrastruct ure, which provides physical connections; and
a management layer, which organizes the con nections, storage elements, and computer syst
ems so that data transfer is secure and robust.
SAN ‘s definition Put in simple terms, a SAN is a specialized,
- high speed network attaching servers and storage devices
It is sometimes referred to as “the network behind the servers.”
A SAN introduces the flexibility of networki ng to enable one server or many heteroge
neous servers to share a common storage utility, which may comprise many storage
devices, including disk, tape, and optical storage.
SAN Component SAN Connectivity
the connectivity of storage and server components typically using Fibre Channel (FC).
SAN Storage TAPE /RAID /ESS (Enterprise Storage
System) /JBOD (Just Bunch of Disk) /SSA (Serial Storage Architecture)
SAN Server Windows /Unix /Linux and etc
Switched Fabric A n infrastructure specially designed to ha
ndle storage communications called a fabr ic.
A typical Fibre Channel SAN fabric is made up of a number of Fibre Channel switches.
Today, all major SAN equipment vendors a lso offer some form of Fibre Channel routi ng solution, and these bring substantial sc
alability benefits to the SAN architecture b y allowing data to cross between different
fabrics without merging them.
Fiber Channel protocol Fibre Channel is a layered protocol. It consists of 5
layers, namely: 0FC The physical layer, which includes cables, fiber op
tics, connectors, pinouts etc. 1FC The data link layer, which implements the 8b/10b
encoding and decoding of signals. 2FC - - The network layer, defined by the FC PI 2 standar
d, consists of the core of Fibre Channel, and defines the main protocols.
3FC The common services layer, a thin layer that coul d eventually implement functions like encryption or RAI D.
4FC The Protocol Mapping layer. Layer in which other protocols, such as SCSI, are encapsulated into an infor
mation unit for delivery to FC2.
IP Storage Networking FCIP (Fiber Channel over IP)
It is a method for allowing the transmission of Fibre Channel information to be tunneled through the IP network.
iFCP (Internet Fiber Channel Protocol) It is a mechanism for transmitting data to and
from Fibre Channel storage devices in a SAN, or on the Internet using TCP/IP
Internet SCSI (iSCSI) It is a transport protocol that carries SCSI
commands from an initiator to a target.
FCIP (Fiber Channel over IP)
FCIP encapsulates FC frames within TCP/IP, allow ing islands of FC SANs to be interconnected over
- an IP based network TCP/IP is used as the underlying transport to pro
- vide congestion control and in order delivery FCFrames
All classes of FC frames are treated the same asdatagrams
- End station addressing, address resolution, mes sage routing, and other elements of the FC netw
ork architecture remain unchanged
iFCP - - iFCP is a gateway to gateway protocol for imple
menting a fibre channel fabric over a TCP/IP Traffic between fibre channel devices is routed
and switched by TCP/IP network The iFCP layer maps Fibre Channel frames to a
predetermined TCP connection for transport FC messaging and routing services are termina
ted at the gateways so the fabrics are not merg ed to one another
iSCSI iSCSI is a SCSI transport protocol for mapping
- of block oriented storage data over TCP/IP networks
The iSCSI protocol enables universal access t o storage devices and Storage Area Networks
(SANs) over standard TCP/IP networks
Back up site A backup site is a location where a business can
easily relocate following a disaster, such as fire ,flood , or terrorist threat. This is an integral part of
the disaster recovery plan of a business.
A backup site can be another location operated b y the business, or contracted via a company that
specializes in disaster recovery services. In some cases, a business will have an agreemen
t with a second business to operate a joint disast er recovery facility.
Cold Sites A cold site is the most inexpensive type of back
up site for a business to operate. It provides office spaces to operate It does not include backed up copies of data an
d information from the original location of the b usiness, nor does it include hardware already s
et up. The lack of hardware contributes to the minima
l startup costs of the cold site, but requires addi tional time following the disaster to have the op
eration running at a capacity close to that prior to the disaster.
Warm Sites
A warm site is a location where the busin ess can relocate to after the disaster tha
t is already stocked with computer hard ware similar to that of the original site, b
ut does not contain backed up copies of data and information.
Hot Sites A hot site is a duplicate of the original site of the bus
-iness, with full computer systems as well as near co mplete backups of user data.
Ideally, a hot site will be up and running within a ma tter of hours. This type of backup site is the most ex
pensive to operate. Hot sites are popular with stock exchanges
and other financial institutions
who may need to evacuate due to potential bomb threats and must resume normal operations as soon
as possible.
How to choose Choosing the type is mainly decided by a c
ompany's cost vs. benefit strategy. Hot sites are traditionally more expensive t
han cold sites since much of the equipmen t the company needs has already been pur
chased and thus the operational costs are higher.
However if the same company loses a subs tantial amount of revenue for each day the
y are inactive then it may be worth the cost.
--The advantages of a cold site are simple c ost. It requires much fewer resources to op
erate a cold site because no equipment ha s been bought prior to the disaster.
The downside with a cold site is the potenti al cost that must be incurred in order to m
ake the cold site effective. The costs of purchasing equipment on very
short notice may be higher and the disaste r may make the equipment difficult to obta
in.
Discovery Planning steps (1)
I. Information Gathering - Step One Organize the Project
Appoint coordinator/project leader, if the le ader is not the dean or chairperson.
Determine most appropriate plan organizat ion for the unit (e.g., single plan at college l
evel or individual plans at unit level) Set project timetable Draft project plan, including assignment of t
ask responsibilities
Discovery Planning steps (2)
Step Two – Conduct Business Impact Analysis In order to complete the business impact analy
sis, most units will perform the following steps: Identify functions, processes and systems Interview information systems support personnel Interview business unit personnel Analyze results to determine critical systems, applic
ations and business processes Prepare impact analysis of interruption on critical sys
tems
Discovery Planning steps (3)
Step Three – Conduct Risk Assessment
The risk assessment will assist in determining t he probability of a critical system becoming sev
erely disrupted and documenting the acceptabi lity of these risks to a unit.
Discovery Planning steps (3/1)
Review physical security (e.g. secure office, buil ding access off hours, etc.)
Review backup systems Review data security Review policies on personnel termination and tr
ansfer Identify systems supporting mission critical func
tions Identify vulnerabilities ( Such as flood, tornado, p
hysical attacks, etc.) Assess probability of system failure or disruption Prepare risk and security analysis
Discovery Planning steps (4/1)
- Step Four Develop Strategic Outline for Recovery
1 Assemble groups as appropriate for: Hardware and operating systems Communications Applications Facilities Other critical functions and business processe
s as identified in the Business Impact Analysis
Discovery Planning steps (4/2)
For each system/process above quantify the following processing requirements:
Light, normal and heavy processing days Transaction volumes
Dollar volume (if any) Estimated processing time Allowable delay (days, hours, minutes, etc.)
Discovery Planning steps (4/3)
3 Detail all the steps in your workflow for each cri tical business function (e.g., for student payr
oll processing each step that must be complet e and the order in which to complete them.)
4 Identify systems and applications Component name and technical id (if any) Type (online, batch process, script) Frequency Run time Allowable delay (days, hours, minutes, etc.)
Discovery Planning steps (4/4)
Identify vital records (e.g., libraries, proc essing schedules, procedures, research,
advising records, etc.) Name and description Type (e.g., backup, original, master,
history, etc.) Where are they stored Source of item or record Can the record be easily replaced from
another source (e.g., reference materials)
Discovery Planning steps (4/5)
Backup Backup generation frequency Number of backup generations available onsite Number of backup generations available off-site Location of backups Media type Retention period Rotation cycle Who is authorized to retrieve the backups?
Discovery Planning steps (4/6)
6 Identify if a severe disruption occurred what would be the minimum requirements/replacement needs to perform the critical function during the disruption. Type (e.g. server hardware, software, research
materials, etc.) Item name and description Quantity required Location of inventory, alternative, or offsite
storage Vendor/supplier
Discovery Planning steps (4/7)
7 Identify if alternate methods of processing either exist or could be developed, quantifying where possible, impact on processing. (Include manual processes.)
8 Identify person(s) who supports the system or application
9 Identify primary person to contact if system or application cannot function as normal
10 Identify secondary person to contact if system or application cannot function as normal
Discovery Planning steps (4/8)
11 Identify all vendors associated with the system or application
12 Document unit strategy during recovery (conceptually how will the unit function?)
13 Quantify resources required for recovery, by time frame (e.g., 1 pc per day, 3 people per hour, etc.)
14 Develop and document recovery strategy, including: Priorities for recovering system/function components Recovery schedule Form – critical system processing requirement for recovery
Discovery Planning steps (5)
Step Five – Review Onsite and Offsite Backup and Recovery Procedures
The planning team as identified in Step 1 Task 3 would normally perform this task.
Review current records (OS, Code, System Instructions, documented processes, etc.) requiring protection
Review current offsite storage facility or arrange for one
Review backup and offsite storage policy or create one
Present to unit leader for approval
Discovery Planning steps (6)
Step Six – Select Alternate Facility ALTERNATE SITE: A location, other than the
normal facility, used to process data and/or conduct critical business functions in the event of a disaster. Determine resource requirements Assess platform uniqueness of unit systems (e.g.,
MacIntosh, IBM Compatible, Oracle database, Windows 3.1, etc.)
Identify alternative facilities Review cost/benefit Evaluate and make recommendation
Discovery Planning steps (7/1)
II. Plan Development and Testing Step Seven – Develop Recovery Plan This step would ordinarily be completed
by the coordinator/Project Manager working with the planning team.
Sample Plan Outline
Discovery Planning steps (7/2)
1 Objective 2 Plan Assumptions 3 Criteria for invoking the plan
Document emergency response procedures to occur during and after an emergency
Document procedures for assessment and declaring a state of emergency
Document notification procedures for alerting unit and university officials
Document notification procedures for alerting vendors Document notification procedures for alerting unit staff
and notifying of alternate work procedures or locations.
Discovery Planning steps (7/3)
4 Roles Responsibilities and Authority Identify unit personnel Recovery team description and charge Recovery team staffing Transportation schedules for media and
teams
Discovery Planning steps (7/4)
5 Procedures for operating in contingency mode Process descriptions Minimum processing requirements Determine categories for vital records identify location of vital records Identify forms requirements Document critical forms Establish equipment descriptions Document equipment - in the recovery site Document equipment - in the unit
Discovery Planning steps (7/4)
Software descriptions Software used in recovery Software used in production Produce logical drawings of communication and
data networks in the unit Produce logical drawings of communication and
data networks during recovery Vendor list Review vendor restrictions Miscellaneous inventory Communication needs - production Communication needs - in the recovery site
Discovery Planning steps (7/5)
6 Resource plan for operating in contingency mode 7 Criteria for returning to normal operating mode 8 Procedures for returning to normal operating
mode 9 Procedures for recovering lost or damaged data 10 Testing and Training
Document Testing Dates Complete disaster/disruption scenarios Develop action plans for each scenario
Sample Testing Diagram
Discovery Planning steps (7/6)
11 Plan Maintenance Document Maintenance Review Schedule
(yearly, quarterly, etc.) Maintenance Review action plans Maintenance Review recovery teams Maintenance Review team activities Maintenance Review/revise tasks Maintenance Review/revise documentation
Discovery Planning steps (7/7)
12 Appendices for Inclusion inventory and report forms maintenance forms hardware lists and serial numbers software lists and license numbers contact list for vendors contact list for staff with home and work
numbers
Discovery Planning steps (7/8)
contact list for other interfacing departments
network schematic diagrams equipment room floor grid diagrams contract and maintenance agreements special operating instructions for sensitive
equipment cellular telephone inventory and
agreements
Discovery Planning steps (8)
Step Eight - Test the Plan 1 Develop test strategy 2 Develop test plans 3 Conduct tests 4 Modify the plan as necessary Samples Test Plan Strategy Test Plan Scenario Test Results/Test Evaluation
Discovery Planning steps (9)
III. Ongoing Maintenance Step Nine - Maintain the Plan Dean/Director/Unit Administrator will be
responsible for overseeing this. 1 Review changes in the environment, technology, and procedures 2 Develop maintenance triggers and procedures 3 Submit changes for systems development procedures 4 Modify unit change management procedures 5 Produce plan updates and distribute
Discovery Planning steps (10)
Step Ten – Perform Periodic Audit 1 Establish periodic review and update
procedures
Important factors (1/3) Communication
Personnel — notify all key personnel of the pr oblem and assign them tasks focused toward
the recovery plan. Customers — notifying clients about the prob
lem minimizes panic. Recall backups
If backup tapes are taken offsite, these need to be recalled. If using remote backup service s, a network connection to the remote backu p location (or the Internet) will be required.
Important factors (2/3)
Facilities having backup hot sites or cold sites for larg
er companies. Mobile recovery facilities are also available from many suppliers.
Prepare your employees during a disaster, employees are required to
work longer, more stressful hours, and a sup port system should be in place to alleviate s ome of the stress. Prepare them ahead of ti
me to ensure that work runs smoothly.
Important factors (3/3)
Business information backups should be stored in a completely sep
arate location from the company
Testing the plan provisions, directions, frequency for testing t
he plan should be stipulated.
Things to do in DRP (1/4) Here are 10 absolute basics your plan should co
ver: 1 . Develop and practice a contingency plan tha
t includes a succession plan for your CEO.
2. Train backup employees to perform emerge ncy tasks. The employees you count on to lead i
n an emergency will not always be available. 3. Determine offsite crisis meeting places for t
op executives.
Things to do in DRP (2/4)
4 - . Make sure that all employees as well as exe- cutives are involved in the exercises so that the
y get practice in responding to an emergency.
5 . Make exercises realistic enough to tap into e mployees' emotions so that you can see how th
ey'll react when the situation gets stressful.
6. Practice crisis communication with employee s, customers and the outside world.
Things to do in DRP (3/4)7 Invest in an alternate means of communicatio
n in case the phone networks go down.
8. Form partnerships with local emergency resp - onse groups firefighters, police to establish a good working relationship. Let them become f
amiliar with your company and site.
Things to do in DRP (3/3)
9. Evaluate your company's performance during each test, and work toward consta
nt improvement. Continuity exercises sh ould reveal weaknesses.
10. Test your continuity plan regularly to r eveal and accommodate changes. techno
logy, personnel and facilities are in a cons tant state of flux at any company.
T op mistakes in disaster re covery (1/3)
1. Inadequate planning: Have you identified all critical systems, do you have detailed plans to recover them to the cu
rrent day? Everybody thinks they know what they have on their
networks, but most people don't really know how ma ny servers they have,
how they're configured, or what applications reside o - n them what services were running,
what version of software or operating systems they were using.
T op mistakes in disaster re covery (2/3)
2 Failure to bring the business into the planning an d
testing of your recovery efforts.
3 - Failure to gain support from senior level manage rs.
The largest problems here are: Not demonstrating the level of effort required for full r
ecovery. Not conducting a business impact analysis and addres
sing all gaps in your recovery model.
T op mistakes in disaster re covery (3/3)
Not building adequate recovery plans that o utline your recovery time objective, critical
systems and applications, vital documents needed by the business, and business funct
ions by building plans for operational activit ies to be continued after a disaster.
Not having proper funding that will allow for
a minimum of semiannual testing .