disaster recovery solution - eil

5
Disaster Recovery Solution A Case Study (A design using VMware/Netapp/Doubletake/esxpress) EIL Global Pty. Ltd. Level 30 Westpac House, 91 King William Street Adelaide, SA 5000, Australia Phone: +61 871 001350 Website: www.eilglobal.com.au

Upload: others

Post on 04-Apr-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Disaster Recovery Solution - EIL

Disaster Recovery Solution

A Case Study (A design using VMware/Netapp/Doubletake/esxpress)

EIL Global Pty. Ltd.

Level 30 Westpac House,

91 King William Street Adelaide,

SA 5000, Australia

Phone: +61 871 001350

Website: www.eilglobal.com.au

Page 2: Disaster Recovery Solution - EIL

`

A hosting firm with leased datacenter space wanted to offer one

of its hosting clients ABC, a disaster recovery solution.

Hosting Firm had created a basic design for ABC with their in-

house engineers, but the design had some obvious glitches.

The firm approached EIL Global for help on the design and

implementation of the DR solution.

Consultants from EIL Global were accurately able to find out the

design issues and rectify the problem areas, leading to project

success.

Scenario

ABC operated an environment of both physical and virtual

servers (VMware based) in their production environment.

The DR solution designed in-house provided a DR solution with 2

NetApp, one FAS2040 at the ABC production site and the second

(FAS 2050) at the DR site (Hosti g Fir ’s datacenter)

The idea was to use Netapp snapmirror to transfer data from

production to DR.

A 15 Mbps MPLS dedicated site to site link connected had been

obtained from AT&T for replication traffic.

For the physical servers at the production, double-take software

was being used to dump data to volumes on the FAS2040.

The virtual servers (VMware VI3) at production site were backed

up by esxpress and backups targeted to volumes on the

FAS2040.

The FAS2040 replicated these volumes to FAS2050 at the DR site

via snapmirror.

The DR would involve restoring the esxpress and Doubletake

backups (from FAS2050) to a VI3 environment in the DR data

center.

Case Study (Our design using

Netapp/Doubletake/esxpress)

Issues and solutions

These are just a few of the many issues

that were rectified.

1. The replication needed more than the

estimated maximum bandwidth on the

link, this lead to snapmirror lagging back

by a few days for each volume. This

means the data on the DR site was older

than production site by a few days (high

RPO)

Solution:

We enabled dedupe on all source

volumes and changed the original NetApp

snapmirror replication design to use

Volume snapmirror (VSM) instead of

Qtree snapmirror (QSM). This led to

dedupe savings being passed to

replication traffic savings. Replication

traffic came down by 30%.

We also changed esxpress settings to

daily DELTA backups and to take FULL

backups only when DELTA backup differs

by 60% or every 2 months. This reduced

the load of monthly FULL backups of

virtual machines clogging the link around

the same dates.

Page 3: Disaster Recovery Solution - EIL

2. In an event of an actual disaster, restoration of backups had to

been done manually. This increased the RTO of the DR solution.

Solution:

We used a feature of esxpress called auto-restore, which

automatically restores backups from a target when newer

backups are placed. We configured the cron jobs needed to

achieve this and were able to successfully demonstrate this by

performing a DR test. The cron job ran every hour to check if new

backups had arrived to the FAS2050 volume, and restore these

backups to the DR VI3 environment.

During the DR test, instead of manually restoring backups on the

DR side, Fir ’s engineers simply needed to power on the DR

VMs.

3. Exchange and SQL databases came up "dirty" during DR

testing. Database repairs had to be performed to get the

databases to mount successfully.

Solution:

The root cause of the problem was Netapp snapmirror usage with

Doubletake. Doubletake backs up data continuously to the

FAS2040 volumes.

Snapmirror snapshots this volume and sends it over to the

NetApp FAS2050 on the DR site. Although the backups taken by

Doubletake are "quiesced" backups, the snapmirror snapshots

can happen at anytime rendering the backups "unquiesced". This

problem does not affect regular files too much, it considerable

affects databases. We found that 20% of all databases came up

corrupt on the DR site.

To have a proper working solution, we implemented Exchange

SCR (standby continuous replication) instead of using NetApp

snapmirror. SCR is a well tested and documented solution

recommended by Microsoft, and can be used over WAN links.

For SQL, we found an easier solution by redirecting regular daily

backups of SQL databases to a NetApp 2040 volume. This NetApp

2040 volume was configured to be replicated over to the DR site

by using Netapp VSM.

Page 4: Disaster Recovery Solution - EIL

FAS

2050

A B

Ap

pN

et

FAS

2050

A B

Ap

pN

et

SN

AP

Mirr

or

Asy

nchr

onou

s

App

licat

ion

Ope

ratin

g S

yste

m ES

X S

erve

r

Ope

ratin

g S

yste

m

App

licat

ion

Sto

rage

1S

tora

ge2

CIF

S/S

MB

VM

war

e E

SX

CIF

S/S

MB

Aut

o R

esto

re m

ost r

ecen

t del

ta +

full

back

up

A X

I O

MS

LM 5

00

DA

TA

SY

ST

EM

S

FC

Tier

1Ti

er2

VM

FS w

/ VM

DK

rest

ores

VM

war

e E

SX

Clu

ster

Sto

rage

1S

tora

ge2

esX

pres

s R

eplic

atio

n

Sto

rage

2

Page 5: Disaster Recovery Solution - EIL

FAS

2050

A B

Ap

pN

et

FAS

2050

A B

Ap

pN

et

SN

AP

Mirr

or

Asy

nchr

onou

s

vmdk

vmdk

vmdk

NFS

Vol

ume

VM

war

e E

SX

NFS

2003

32b

it

Phy

sica

l Ser

vers

2003

x64

LAN

vmdk

vmdk

vmdk

NFS

Vol

ume

vmdk

vmdk

vmdk

Net

App

Flex

clon

e

writ

able

volu

me

VM

war

e E

SX

Clu

ster

2003

64b

it

Phy

sica

l Ser

vers

2003

x86Dou

ble

Take

Rep

licat

ion