virtual linux server dr planning - the conference exchange · 2012. 2. 24. · virtual linux server...

36
Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session 10326

Upload: others

Post on 21-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Virtual Linux ServerDisaster Recovery Planning

Rick BarlowNationwide Insurance

March 2012Session 10326

Page 2: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Agenda

• Definitions• Our Environment• Business Recovery Philosophy at Nationwide• Planning

This information is for sharing only and not an endorsement by Nationwide Insurance2

• Execution

Page 3: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Definitions

• High Availability

– “With any IT system it is desirable that the system and its components (be they hardware or software) are up and running and fully functional for as long as possible, at their highest availability. The most desirable high availability rate is known as “five 9s”™, or 99.999% availability. A good

This information is for sharing only and not an endorsement by Nationwide Insurance3

deal of planning for high availability centers around backup and failover processing and data storage and access.”

– Deal with significant outage within data center• LPAR failure

• Operating System outage

• Application ABEND

Page 4: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

High AvailabilityLPAR

1LPAR

2LPAR

6LPAR

3LPAR

4LPAR

8VLAN 1

VLAN 2

HTTPServer

HTTPServer

WASNode Session

Node

WASNode WAS

Node

VLAN 3WASNode WAS

Node

WASNode Session

Node

VLAN 4DBServer

DBServer

VLAN 5DBServer

DBServer

Production

This information is for sharing only and not an endorsement by Nationwide Insurance4

Cisco Switchw/ Firewall

Internet

Cisco Switchw/ Firewall

OSA OSA OSA OSA OSA OSA OSA OSA OSA OSA OSA OSA

intranet

Cisco Switchw/ Firewall

Cisco Switchw/ Firewall

Cisco Switchw/ Firewall

Cisco Switchw/ Firewall

OSA OSA OSA OSA

Front End Back End Tools

vSwitchvSwitch vSwitch vSwitch

vSwitch vSwitch vSwitchvSwitch vSwitch vSwitch

vSwitch vSwitch

Server Server

VMTCPIP

VMTCPIP

VMTCPIP

VMTCPIP

VMTCPIP

VMTCPIP

Page 5: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Definitions

• Disaster Recovery

– “Disaster recovery in information technology is the ability of an infrastructure to restart operations after a disaster. While many of today's larger computer systems contain built-in programs for disaster recovery, standalone recovery programs often provide

This information is for sharing only and not an endorsement by Nationwide Insurance5

disaster recovery, standalone recovery programs often provide enhanced features. Disaster recovery is used both in the context of data loss prevention and data recovery.”

– Deal with complete outage• Natural catastrophe

• Data center

• Major hardware failure

Page 6: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Our Environment

• Four z196 installed in 2011Current configuration:

– Production boxes• 36 IFLs – 18 on each z196

• 404GB memory

This information is for sharing only and not an endorsement by Nationwide Insurance6

• 404GB memory

• 6 z/VM LPARs (plus system programmer configuration test)

• Tier 4+ data center

– Fully redundant power, telecom, generators, etc

– Development boxes• 21 IFLs – 10 on one z196 and 11 on the other

• 663GB memory

• 6 z/VM LPARs (plus system programmer test)

Page 7: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Planning

• Design• Challenges• Priorities• Setup

This information is for sharing only and not an endorsement by Nationwide Insurance7

• Automation• Documentation• Teamwork

Page 8: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Design

• What– Identify what needs to be recovered

• Everything or subset?• Priorities – recovery order

• When– Need to know recovery objectives

This information is for sharing only and not an endorsement by Nationwide Insurance8

– Need to know recovery objectives• Where

– Identify where recovery will occur• Second site• Vendor site

• How– Identify how to transfer programs and data– Identify how to perform recovery

Page 9: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Challenges

• Production configuration changes may require DR configuration changes– Processor model changes (not necessarily)– Capacity– Network configuration

This information is for sharing only and not an endorsement by Nationwide Insurance9

– Transition• Consistent distribution and synchronization of boot and

DR scripts• Commitment to regular testing• DR Infrastructure ≠ Application DR

Page 10: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Priorities

• Prioritizing your application recovery must be done by the people in your organization that understand the business processes.

• Business requirements that drive recoverytime-frame

This information is for sharing only and not an endorsement by Nationwide Insurance10

time-frame– regulated– financial / investments– responsive to customers

Page 11: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Setup

• Asynchronous replication with SRDF between production and recovery sites

• Replicated volumes at recovery site– z/VM different unit address between prod and dev/test/dr– SAN target WWPN and LUN differ between production and

recovery• Changes for recovery are automated in Linux script run at boot

This information is for sharing only and not an endorsement by Nationwide Insurance11

• Changes for recovery are automated in Linux script run at boot• Manual processes

– Initiate Clone copies– Vary ECKD DR volumes online– Start Linux servers– Update DNS to reflect DR IP addresses for all servers– Middleware and application hard-coded parameters (e.g. IP

addresses)

Page 12: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Setup

• The DR process would be the same for the following failures– System z failure– DASD / SAN storage frame failure

• Long distance (inter-date center) fiber failure

This information is for sharing only and not an endorsement by Nationwide Insurance12

• Long distance (inter-date center) fiber failure– No DR required, (redundant links installed)

Page 13: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Setup

ReplicationFibre

FICON(x8)IBM

z196EMC V-Max

StorageECKD

OpenSystems

SAN

FCP (x12)

IBMz196

This information is for sharing only and not an endorsement by Nationwide Insurance13

IBMz196

EMC V-MaxStorageECKD

OpenSystems

SAN

FCP (x12)Fibre

Prod

Dev

SRDF Replication

FICON(x8)

ReplicationFibre

IBMz196

Page 14: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Normal Production

Production DR

This information is for sharing only and not an endorsement by Nationwide Insurance14

SAN SAN

DMX DMX

Fiber

Fiber

Replication

ReplicationReplication

R1 R2

Page 15: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Failure Happens

Production DRIf failure occurs: manually stop replication; initiate DR Clones; Bring ECKD volumes on-line at DR site, start Linux servers

This information is for sharing only and not an endorsement by Nationwide Insurance15

SAN SAN

DMX DMX

Fiber

Fiber

Page 16: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Recovery Begins

Production DRServers identify DR configuration, change IP address, change SAN parameters, register new IP with DNS, start replication south

This information is for sharing only and not an endorsement by Nationwide Insurance16

SAN SAN

V-Max V-Max

Fiber

Fiber

replication south to north.

Page 17: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Recovery Begins

Production DRServers identify DR configuration, change IP address, change SAN parameters, register new IP with DNS, start replication south

This information is for sharing only and not an endorsement by Nationwide Insurance17

SAN SAN

V-Max V-Max

Fiber

Fiber

Replication

R2 R1

replication south to north.

Page 18: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Automation

• Avoid manual processes– Dependence on key individuals

– Prone to mistakes

– Slow

• Automated processes

This information is for sharing only and not an endorsement by Nationwide Insurance18

• Automated processes– Requires only basic knowledge of environment and technologies in use

– Accuracy

– Repeatable

– Faster

– Does not mean build it once then ignore;Requires regular review and updates

Page 19: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Automation

• Automation begins at provisioning

– DR configuration stored with production configuration

– CMS NAMES file• Contains all information about provisioned server

This information is for sharing only and not an endorsement by Nationwide Insurance19

• Copy stored on DR disk also

• Also used to generate report of server definitions for easy lookup

– Linux PARM file stored on CMS disk• Stored on disk accessible at boot time

• Copy stored on DR disk also

– Define everything needed to provision server and at boot time

Page 20: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Automation

• Extract from LINUX NAMES file forone guest

:nick.WS001:userid.PZVWS001:node.VN2:desc.Prod web server 1:env.PROD:hostname.PZVMWS001:load.LINUXWS

:ip_dr.10.221.1.1:vsw_dr.PRODVSW1:vlan_dr.2102:ip_drbu.10.222.1.1:vsw_drbu.NETBKUP1:vlan_drbu.3940:oth_ip.10.1.1.5 10.1.1.15:dr_oth_ip.10.221.1.5 10.221.1.15:status.2005-09-08:gold.V1.2:memory.256M

This information is for sharing only and not an endorsement by Nationwide Insurance20

:load.LINUXWS:ip.10.1.1.1:vswitch.PRODVSW1:vlan.2102:ip_nb.10.2.1.1 :vsw_nb.NETBKUP1:vlan_nb.3940

:memory.256M:cpus.1:share.200,LS-200:comments.:storage.2.3G:storage_os.7.1G:bootdev.251

Page 21: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Automation

• Extract from LINUX NAMES file for one guest (cont’d):storage_san.16.86G :sanluns.R1:0100:5006048AD52D2588:0059000000000000:8.43 R1:0100:5006048AD52D2588:005A000000000000:8.43 R1:0200:5006048AD52D2587:0059000000000000:8.43 R1:0200:5006048AD52D2587:005A000000000000:8.43

:storage_san_dr.16.86G

This information is for sharing only and not an endorsement by Nationwide Insurance21

:storage_san_dr.16.86G :sanluns_dr.R2:0100:5006048AD52E4F87:006A000000000000:8.43 R2:0100:5006048AD52E4F87:006B000000000000:8.43 R2:0200:5006048AD52E4F88:006A000000000000:8.43 R2:0200:5006048AD52E4F88:006B000000000000:8.43

Page 22: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Automation

• PARM fileHOST=pzvmws001ADMIN=10.1.1.1BCKUP=10.2.1.1DRADMIN=10.221.1.1DRBCKUP=10.222.1.1ENV=PROD

This information is for sharing only and not an endorsement by Nationwide Insurance22

ENV=PRODDRVIP=10.1.1.5,10.1.1.15BOOTDEV=251VIP=10.221.1.5,10.221.1.15SAN_1=0100:5006048AD52D2588:005A000000000000,0200:5006048AD52D2587:005A000000000000SAN_2=0100:5006048AD52D2588:0059000000000000,0200:5006048AD52D2587:0059000000000000SAN_3=0100:5006048AD52E4F87:006A000000000000,0200:5006048AD52E4F88:006A000000000000SAN_4=0100:5006048AD52E4F87:006B000000000000,0200:5006048AD52E4F88:006B000000000000

Page 23: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Automation

• Alternate Start-up Scripts

– Identify production or DR mode• VMCP – interact with CP

• CMSFS – read CMS files

– Set parameters for environment

This information is for sharing only and not an endorsement by Nationwide Insurance23

– Set parameters for environment• Hostname to /etc/HOSTNAME

• IP addresses to /etc/sysconfig/network/ifcfg-qeth-bus-ccw-0.0.xxxx

• SAN LUN information

• Color prompt by environment– Prod = Red

– DR = Yellow

– Tools = Green

Page 24: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Automation

• Extract from boot.config

# Setup variablesecho "1" > /sys/bus/ccw/devices/0.0.0191/onlinesleep 5# modprobe required just in casemodprobe cpint

This information is for sharing only and not an endorsement by Nationwide Insurance24

NZVWS001 AT VN1VN1NPARMDEV=`grep 191 /proc/dasd/devices|awk '{print $7}'`QUSERID=`hcp query userid`GUEST=`echo $QUSERID|cut -d" " -f 1`LOCO=`echo $QUSERID|cut -c14`LPAR=`echo $QUSERID|cut -c13-15`BOX=`echo $QUSERID|cut -c1`cmsfscat -d /dev/$PARMDEV -a ${GUEST}.PARMFILE > /tmp/sourceinfo. /tmp/sourceinfo

echo "0" > /sys/bus/ccw/devices/0.0.0191/online

NZVWS001N

Page 25: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Automation

• Result of cmsfscat

cat /tmp/sourceinfoHOST=pzvmws001ADMIN=10.1.1.1BCKUP=10.2.1.1DRADMIN=10.221.1.1DRBCKUP=10.222.1.1

This information is for sharing only and not an endorsement by Nationwide Insurance25

DRBCKUP=10.222.1.1ENV=PRODDRVIP=10.1.1.5,10.1.1.15BOOTDEV=251VIP=10.221.1.5,10.221.1.15SAN_1=0100:5006048AD52D2588:005A000000000000,0200:5006048AD52D2587:005A000000000000SAN_2=0100:5006048AD52D2588:0059000000000000,0200:5006048AD52D2587:0059000000000000SAN_3=0100:5006048AD52E4F87:006A000000000000,0200:5006048AD52E4F88:006A000000000000SAN_4=0100:5006048AD52E4F87:006B000000000000,0200:5006048AD52E4F88:006B000000000000

Page 26: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Automation

ST)CLR="44"; #Blue;;

PT)CLR="45"; #Purple;;

UAT | IT)CLR="46"; #Turq;;

*)

• More extract from boot.config

case "$ENV" inPROD)if [ "$LOCO" == "$BOX" ]

thenCLR="41"; #Red

else

This information is for sharing only and not an endorsement by Nationwide Insurance26

*)CLR="42"; #GreenENV="UNK";;;

esac

elseCLR="43"; #Yellow/GoldENV="DR";

fi;;

DEV | JT | TOOLS | TOOL)CLR="42"; #Green;;

Examples:barlowr@szvmjt002:JT:barlowr>barlowr@nzvmws001:PROD:barlowr>

Page 27: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Documentation

• Document everything– Declaration criteria– Contact information

• Operating System• Middleware

This information is for sharing only and not an endorsement by Nationwide Insurance27

• Application• Network• Security

– Lists of servers– Recovery process– Verification process– Fail-back process

Page 28: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Documentation

• DR Procedure:– Confirm DISASTER declaration

– Begin shutdown all test/development guests to insure sufficient capacity.

This information is for sharing only and not an endorsement by Nationwide Insurance28

sufficient capacity.

– Bring up production DR guests identified by business units for each application environment.

– Make appropriate emergency DNS changes to point users to DR environment per definitions for each application environment.

Page 29: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Documentation

• Return Procedure:

– Confirm DISASTER OVER declaration

– Reverse disk replication; confirm synchronization

– Follow instructions for confirmation of original production

This information is for sharing only and not an endorsement by Nationwide Insurance29

environment for each application.

– Bring down DR guests identified by business units for each application environment.

– Make appropriate DNS changes to point users to non-DR environment per definitions for each application environment.

– Resume normal disk replication.

Page 30: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Teamwork

• Recovery coordinator• z/VM System Programmers• Linux System Administrators• Middleware

– WAS Administrators– Database Administrators

This information is for sharing only and not an endorsement by Nationwide Insurance30

– Database Administrators– MQ Administrators

• Application Teams– Testing methodology– Expected results

Avoid processes that are dependent on subject matter experts (SME) when a disaster happens

Page 31: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Execution

• Test• Document results• Compare to plan• Repeat

This information is for sharing only and not an endorsement by Nationwide Insurance31

Page 32: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Execution

• Where…– … to recover the systems

• Your own second site

• A recovery vendor

This information is for sharing only and not an endorsement by Nationwide Insurance32

• A recovery vendor

– … do the people go• Identify what personnel need to travel to recovery site

– Document travel procedures

• Identify alternate (local) office space

– Some office locations may be able to access recovery site if connectivity is available

Page 33: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Execution

• Testing– Test as often as feasible

• Frequency may depend on having your own site or contracting with a vendor

This information is for sharing only and not an endorsement by Nationwide Insurance33

– Tests should be as close as possible to real recovery conditions

– Operating systems are easy

– Some subsystems are not so easy (e.g. large database)

– Multi-platform applications can be are more complex

– Automate as much as possible to avoid manual effort

Page 34: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Document ResultsCompare to Plan

• Detailed plans for all test scenarios• Carefully track tests• Document action items and follow up for improvements• Build on successes

This information is for sharing only and not an endorsement by Nationwide Insurance34

Page 35: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Repeat

• Do it again• Do it regularly• Corporate emphasis may be required to encourage all

applications to test

This information is for sharing only and not an endorsement by Nationwide Insurance35

Page 36: Virtual Linux Server DR Planning - the Conference Exchange · 2012. 2. 24. · Virtual Linux Server Disaster Recovery Planning Rick Barlow Nationwide Insurance March 2012 Session

Contact Information

Rick Barlow

“And I thought we were busy before Linux showed up!”

This information is for sharing only and not an endorsement by Nationwide Insurance36

Rick BarlowSenior z/VM Systems Programmer

Phone: (614) 249-5213

Internet: [email protected]