a comparison of hp-ux disaster tolerant solutions

A Comparison of HP-UX Disaster Tolerant Solutions (Formerly titled “Design Consideration for HP-UX Disaster Tolerant Solutions”)

Executive Summary ......................................................................................................................... 5 Section 1: Introduction..................................................................................................................... 6

Target Audience ..................................................................................................................... 6 Purpose of Document............................................................................................................. 6

Section 2: What is a Disaster Tolerance Architecture ............................................................................ 7 Section 3: General Requirements....................................................................................................... 8 Section 4: Cluster File System (CFS) Support........................................................................................ 9 Section 5: Oracle 10g .................................................................................................................... 9 Section 6: DTS and HP’s Virtualization Strategy................................................................................. 10 Section 7: Types of Disaster Tolerant Clusters .................................................................................... 10

Extended Campus Cluster .................................................................................................... 10 Benefits of Extended Campus Cluster......................................................................................... 11 Limitations of Extended Campus Cluster..................................................................................... 11

Extended Cluster for RAC..................................................................................................... 12 Benefits of Extended Cluster for RAC......................................................................................... 12 Limitations of Extended Cluster for RAC ..................................................................................... 12

Metrocluster ......................................................................................................................... 12 Benefits of Metrocluster........................................................................................................... 14 Limitations of Metrocluster ....................................................................................................... 14

Continentalclusters............................................................................................................... 14 Benefits of Continentalclusters (CC)........................................................................................... 16 Limitations of Continentalclusters .............................................................................................. 17

Comparison of Solutions...................................................................................................... 17 Differences Between Extended Campus Cluster and Metrocluster.................................................... 17 Comparison - All DTS Solutions ................................................................................................ 18

Section 8: Disaster Tolerant Cluster Limitations................................................................................... 21 Section 9: Recommendations .......................................................................................................... 21 Appendix A – DTS Design Considerations......................................................................................... 23

Cluster Arbitration................................................................................................................ 23 Dual cluster lock disks............................................................................................................. 23

Quorum Server in a third location............................................................................................. 23 Arbitrator node(s) in a third location ......................................................................................... 24

Protecting Data through Replication.................................................................................... 25 Off-line Data Replication......................................................................................................... 25 On-line Data Replication......................................................................................................... 25

Using Alternative Power Sources ........................................................................................ 28 Creating Highly Available Networking............................................................................... 28 Managing a Disaster Tolerant Environment........................................................................ 29

For more information..................................................................................................................... 31

3

This page is intentionally left blank.

4

Revision History

Printing history

0.1 Review

1.0 Initial publication (written and published by Hue Vu)

1.5 Revised to include following configurations/additions, and distribute for review:

• Extended Cluster for RAC • Continentalclusters with RAC • Continentalclusters with single IP Subnet configuration

2.0 Updated to reflect feedback from review of version 1.5

• Document wording revised, per feedback • Executive summary enhanced • New section (General Requirements) added, to precede “Types of

Disaster Tolerant Clusters” section • For readability Design considerations and implementation attributes

have been moved to an Appendix • Section 4 (comparison of 4 HP-UX solutions) table reformatted for

usability • Metrocluster section updated to reflect policy to determine maximum

supported distance

3.0 Second publication of document 3.1 Updated to reflect enhancements to DTS, including

• CFS Support • Oracle 10g Discussion • Virtual Server Environment (VSE) Support • Support of SRDF asynchronous data replication with MC/SRDF

Please send all feedback directly to Deb Alston ([email protected])

5

Executive Summary

In a Serviceguard cluster configuration, high availability is achieved by using redundant hardware to eliminate single points of failure. This protects the cluster against hardware faults, such as a single node failure. This architecture, which is typically implemented on one site in a single data center, is sometimes called a local cluster. For some installations, the level of protection provided by a local cluster is insufficient for the business. Consider an order-processing center where power outages are common during harsh weather. Or consider the systems running the stock market, where multiple system failures, for any reason, have a significant financial impact. For these types of installations, and many more like them, it is important to guard not only against single points of failure, but against multiple points of failure (MPOF), or against single massive failures that cause many components to fail (such as the failure of a data center, an entire site, or a small area).

Creating clusters that are resistant to multiple points of failure or single massive failures requires a different type of cluster architecture from the local cluster. This architecture is called a disaster tolerant architecture – often referred to as a Disaster Tolerant Solution (DTS). This architecture provides you with the ability to fail over automatically to another part of the cluster or manually to a different cluster after certain disasters. Specifically, the disaster tolerant solution provides appropriate failover in the case where an entire data center becomes unavailable. HP has a rich portfolio of disaster tolerant cluster offerings, including Extended Campus Cluster1, Metrocluster, and Continentalclusters. While each of these solutions has its own characteristics, their common goal is to protect users from a site-wide outage. To achieve this, the common feature they all implement is multiple data centers with multiple copies of the user’s data. Effectively, if one data center fails, a second data center is available to continue processing.

Both Metrocluster and Extended Campus Cluster solutions are single Serviceguard clusters, meaning an application can automatically fail over from one data center to the other in the event of a failure. Although similar in nature, these topologies have key differences that provide different levels of disaster tolerance. For example, a key difference between these two topologies is the method of data replication used. Metrocluster implements storage-based data replication with one of the following three storage subsystems

– HP StorageWorks Continuous Access XP (aka Metrocluster/CAXP) – EMC’s Symmetrix arrays (aka Metrocluster/SRDF) – HP StorageWorks Continuous Access EVA (aka Metrocluster/CAEVA)

Extended Campus Cluster is a host-based data replication product. While Extended Campus Cluster spans two data centers up to 100km apart, the distance between Metrocluster sites is based on the cluster network and data replication link. In a Metrocluster configuration, maximum distance is the shortest of the distances defined by:

• Cluster network – maximum distance cannot exceed roundtrip cluster heartbeat network latency requirement of 200ms

• DWDM provider – distance cannot exceed the maximum as specified for the product supplied by the DWDM provider

• Data replication link – maximum supported distance as stated by the storage partner The third solution – Continentalclusters – is built on top of two individual Serviceguard clusters, and uses semi-automatic failover to start up an application on its recovery cluster. When a site fails, the 1 Extended Campus Cluster is also known as “CampusCluster” and “Extended Distance Cluster”. Throughout this document, this configuration will be referred to as “Extended Campus Cluster”.

6

user is notified, and must initiate a “recovery” process on the secondary site for the affected applications to be brought up. Continentalclusters has no distance limitation (i.e., it may span very short to very long distances, implementing both LAN and WAN technologies).

Continentalclusters also supports a configuration with three data centers. In the three data center configuration, the first two data centers implement Metrocluster. The third data center is a traditional single data center Serviceguard cluster. This configuration is suited for environments that (may already) have two data centers implemented, but for business reasons, require a third data center. Deployment of this configuration is rare. Typically, Continentalclusters is implemented with two data centers (i.e., two single data center Serviceguard clusters), with semi-automatic failover between data centers.

From initial observation, the solutions appear to be interchangeable. The key to selecting the appropriate fit for a customer’s environment is often driven by the customer’s Recovery Time and Recovery Point Objectives (referred to as RTO and RPO). Customers requiring the least amount of downtime will require a solution that tightly integrates data currency with application availability. The best solution for this customer is one that offers automatic failover of the application. On the other hand, customers who want control over application failover would prefer a solution that allows the user to decide when an application starts at the recovery site. Please refer to “Section 9: Recommendations” for guidelines on selecting and recommending a disaster tolerant solution.

Section 1: Introduction

Many decisions have to be made when designing a disaster tolerant solution. These decisions can have a tremendous impact on the availability of the solution, consistency of the data, and the overall cost of the solution. This paper discusses the overall disaster tolerant architecture and its general requirements, solutions that HP currently offers for HP-UX, differences between them, and offers a high-level design guideline. Architectures discussed include:

• Extended Campus Cluster2 • Extended distance support for Oracle Real Application Server

– In an active/active configuration (Extended Cluster with RAC) – In an active/standby configuration (Continentalclusters with RAC)

• Metrocluster • Continentalclusters

Target Audience This paper is only available internally to HP personnel. It is intended for use by HP’s pre-sales force to aid in providing recommendations to customers on disaster tolerant solutions. Purpose of Document The purpose of this document is two-fold:

• Discuss and compare disaster tolerant cluster solutions that HP currently offers for HP-UX • Provide recommendations on positioning our products relative to each other, enabling HP Field

personnel to help customers determine the best disaster tolerant solution for their environments As this document specifically discusses HP-UX solutions, it does not address implementations on platforms other than HP-UX.

2 Extended Campus Cluster is also known as “CampusCluster” and “Extended Distance Cluster”. Throughout this document, this configuration will be referred to as “Extended Campus Cluster”.

7

Section 2: What is a Disaster Tolerance Architecture

In a conventional Serviceguard cluster configuration, all components are in a single data center. This is referred to as a local cluster. High availability is achieved by using redundant hardware to guard against single points of failure, such as protection against the node failure in Figure 1.

D AT A C EN TER

D ata LAN + H eartbeat H eartbea t

S AN

Failover

Figure 1. High Availability Architecture

However, for many types of installations, it is important to guard not only against single points of failure, but against multiple points of failure (MPOF), or against single massive failures that cause many components to fail, such as the failure of a data center, of an entire site, or of a small area. A data center, in the context of disaster recovery, is a physically proximate collection of servers, storage, network, and power source that can be used to run a business application(s), usually all in one room. Creating clusters that are resistant to multiple points of failure or single massive failures requires a different type of cluster architecture called a disaster tolerant architecture. This architecture provides you with the ability to fail over automatically to another part of the cluster or manually to a different cluster after certain disasters. Specifically, the disaster tolerant cluster provides appropriate failover in the case where an entire data center becomes unavailable, as in the sample configuration in Figure 2.

8

D A T A C E N T E R 1 D a ta L A N + H e a r t b e a t

H e a r tb e a t

S A N

D A T A C E N T E R 2 D a ta L A N + H e a r t b e a t

H e a r tb e a t

S A N

F a i l o v e r

Figure 2. Disaster Tolerant Sample Configuration

Section 3: General Requirements

What do customers need in a disaster tolerant solution? As a first step, the customer needs to go through business impact and risk assessment exercises to understand the application’s availability requirements. The customer also needs to define the recovery time objectives (RTO) for the applications that are critical to the business, and the recovery point objectives (RPO) - the point in time to which data must be restored to resume transaction processing. Two common design requirements of disaster tolerant architecture affecting RTO and RPO are the ability to protect the data from data loss or corruption, and the ability to access the data. Since a solution that keeps the applications running but allows data to become corrupt is useless, data protection should always take precedence over application availability.

Depending on the type of disaster you are protecting against and the available technology, the nodes can be as close as partitions within a single node, nodes in another room in the same building, or as far away as another continent. Whatever the distance, the goal of a disaster tolerant architecture is to survive the loss of a data center that contains critical resources to run a business application. Putting clustered nodes further apart increases the likelihood that alternate nodes will be available for failover in the event of a disaster. The most significant losses during a disaster are the loss of access to data, and the loss of data itself. You protect against this loss through data replication (i.e., creating extra copies of the data). Data replication should:

• Ensure data consistency by replicating data in a logical order so that it is immediately usable or recoverable. Inconsistent data is unusable and is not recoverable for processing. Consistent data may or may not be current.

• Ensure data currency by replicating data quickly so that a replica of the data can be recovered to include all committed disk writes that were applied to the local disks.

• Ensure data recoverability so that there are some actions that can be taken to make the data consistent, such as applying logs or rolling a database.

• Minimize data loss by configuring data replication to address consistency, currency, and recoverability.

9

Section 4: Cluster File System (CFS) Support

Traditionally, the only storage management options in Serviceguard (SG) environments have been either Logical Volume Manager (LVM) or Symantec Volume Manager (VxVM). Similarly, the only options available to SG Extension for RAC (SG/SGeRAC) were the Shared Logical Volume Manager (SLVM) and the Symantec Cluster Volume Manager (CVM), where Oracle’s application software is typically installed on a local file system. In December 2005, support was extended to include Symantec’s Cluster File System (CFS) by both SG and SG/SGeRAC. With CFS, executables and data alike can be managed by the file system (e.g., Oracle data files and Oracle binaries can both be put in a CFS). CFS provides major enhancements such as improved manageability and improved maintenance. For instance, with CFS, Oracle binaries are installed only once, and are visible to all cluster nodes. A central location is available to store runtime logs, archive logs, etc. From a maintenance perspective, software updates, patches, and changes have to be applied only once.

CFS support – which requires CVM 4.1 - is currently available for single data centers only. Support for CFS and CVM 4.1 with Extended Campus Cluster, Extended Cluster for RAC, and Continentalclusters are all targeted for (calendar year) 2006. Until the time at which support is provided, CVM 4.1 is not supported in DTS configurations. Please note that the support of CFS requires the HP Storage Management Suite which includes appropriate versions of both CFS and CVM in the Management Suite. There are presently no plans to support CFS with Metrocluster.

Section 5: Oracle 10g

The advent of Oracle 10g has introduced several new Oracle features, including Automatic Storage Management (ASM). ASM was introduced as a component of the Oracle database. ASM provides an alternative to platform file systems and volume managers for the management of file types used to store most Oracle files, including data files, control files, and redo logs. A big advantage of ASM is the ease of management it provides for Oracle database files. However, there are several file types that are not supported – and cannot be managed - by ASM, including Oracle database server binaries, trace files, audit files, alert logs, backup files export files, tar files core files, and Oracle’s (clusterware) quorum and registry devices.

Support of ASM by SG/SGeRAC (version A.11.17 and beyond) is available on HP-UX 11iv2 for RAC databases only (i.e., there is no ASM support for Oracle single instance database with SG). Additionally, SG/SGeRAC configurations using ASM must use raw logical volumes managed by SLVM (i.e., ASM “sits on top of” SLVM). The primary reason SLVM is required is to leverage the multipathing capabilities provided by SLVM so that ASM can be supported by SG/SGeRAC on HP-UX 11iv2. There are presently no plans to support any Disaster Tolerant cluster configuration with ASM.

Extended Distance SG/SGeRAC and Continentalclusters currently support the Oracle 10g RAC database server in non-ASM, non-CFS configurations. Additionally, Metrocluster and Continentalclusters support the Oracle 10g single instance database server in non-ASM, non-CFS configurations. CFS support by Extended Distance SG/SGeRAC and Continentalclusters is targeted for 2006.

More information on SG/SGeRAC integration with Oracle 10g may be found at the HA ATC website: http://haweb.cup.hp.com/ATC/, and in the product user’s guides (i.e., Designing Disaster Tolerant High Availability Clusters 14th Edition, and Using Serviceguard Extension for RAC 3rd Edition)

10

Section 6: DTS and HP’s Virtualization Strategy

DTS products support HP’s VSE strategy. Serviceguard is integrated with HP VSE products related to partitioning, utility pricing, workload management and tools for managing the overall VSE environment. The addition of DTS leverages this integration to extend support from a single data center to multiple data centers. More information on the integration of DTS with VSE may be found in the following document: http://www.hp.com/products1/unix/operating/docs/wlm.serviceguard.pdf additionally, a demo that implements Metrocluster in a VSE may be downloaded from the HA ATC website, http://haweb.cup.hp.com/ATC/. Once on the website, the download is available on the “Demos” webpage.

Section 7: Types of Disaster Tolerant Clusters

Four HP disaster-tolerant cluster configurations are described in this guide, including:

• Extended Campus Cluster • Extended Cluster for RAC • Metrocluster • Continentalclusters Extended Campus Cluster An Extended Campus Cluster is a normal Serviceguard cluster with nodes spread over two data centers. All nodes are on the same IP subnet. An application runs on one node in the cluster with other nodes configured to take over in the event of a failure in an active/standby configuration. Either HP-UX MirrorDisk/UX or Symantec VERITAS VxVM mirroring is used to replicate application packages' data between the two data centers in an Extended Campus Cluster, even if the data is stored on RAID.

Extended Campus Cluster relies on the capability of the Fibre Channel (FC) technology. It uses FC switches and/or hubs, and Dense Wavelength Division Multiplexing (DWDM) to provide host-to-storage connectivity across two data centers up to 100km apart.

In the Extended Campus Cluster architecture, each clustered server is directly connected to the storage in both data centers. The following diagram depicts a 4-node Extended Campus Cluster using dual cluster lock disks for arbitration. Cluster locks are discussed in Appendix A of this document.

11

DATA CENTER 1 Data LAN + Heartbeat

Heartbeat

SAN

DATA CENTER 2 Data LAN + Heartbeat

Heartbeat

SAN

SOFTWARE MIRRORING (e.g., MirrorDisk/UX)

UP TO 100 KM If DWDM is used

Primary cluster lock

disk Secondary cluster lock

disk

Figure 3. Extended Campus Cluster with two Data Centers (dual cluster lock disks used for cluster arbitration) Benefits of Extended Campus Cluster • This configuration implements a single Serviceguard cluster across two data centers, and uses either

MirrorDisk/UX or Symantec VERITAS VxVM mirroring for data replication. No (cluster) license beyond SG is required for this solution, making it the least expensive to implement. The addition of CFS support is targeted for 2006.

• Customers may choose any storage supported by Serviceguard, and the storage can be a mix of any SG-supported storage.

• This configuration may be the easiest for customers to understand and manage, as it “looks and feels” just like SG.

• Application failover is minimized. All disks are available to all nodes, so that if a primary disk fails but the node stays up and the replica is available, there is no failover (i.e., the application continues to run on the same node while accessing the replica).

• Data copies are peers, so there is no issue with reconfiguring a replica to function as a primary disk after failover.

• Writes are synchronous unless the link or disk is down, so data remains current between the primary disk and its replica.

Limitations of Extended Campus Cluster • Extended Campus Cluster provides no built-in mechanism for Serviceguard to determine the state of

the data before starting up the application. An application package will start successfully if volume group activation is successful. For example, nothing prevents an application from starting if the Logical Volume Manager (LVM) mirrors are split. This scenario will increase the exposure to data loss in the event of a site disaster. Only a carefully designed architecture coupled with proper implementation (e.g. adding additional intelligence to package control scripts, selecting appropriate volume group activation options, incorporating monitoring tools like Event Monitoring Services, etc.) can help to avoid undesirable behavior or consequences.

• Extended Campus Cluster does not support asynchronous data replication. While data currency is maintained between the two data centers in normal operations, longer distances between the data centers increases the likelihood of performance impact.

12

• With MirrorDisk/UX, there is an increased I/O load for writes, since each write has to be done twice by the host. If data resynchronization is required, based on the amount of data involved, this can have a major performance impact on the host.

Extended Cluster for RAC Serviceguard Extension for RAC (SGeRAC) is a specialized configuration that enables Oracle Real Application Clusters (RAC) to run in an HP-UX environment on high availability clusters. RAC in a Serviceguard environment lets you maintain a single (Oracle) database image that is accessed by the servers in parallel in an active/active configuration, thereby providing greater processing power without the overhead of administering separate databases.

Extended Cluster for RAC merges Extended Campus Cluster with SGeRAC. One key difference between the two configurations is the volume manager. While Extended Campus Cluster uses LVM and VxVM, Extended Cluster for RAC implements SLVM and CVM 3.5. Additionally, CFS support is targeted for (calendar year) 2006.

Benefits of Extended Cluster for RAC • In addition to the benefits of Extended Campus Cluster, RAC runs in active/active mode in the

cluster, so that all resources in both data centers are utilized. The database and data are synchronized and replicated across two data centers up to 100km apart. In event of a site failure, no failover is required, since the instance is already running at the remote site.

• Extended Cluster for RAC implements SLVM so that SGeRAC has a “built-in” mechanism for determining the status of volume group extents in both data centers (i.e., the state of the volume groups is kept in memory at the remote site), and SLVM will not operate on non-current data.

Limitations of Extended Cluster for RAC • There is a limit on cluster size, based on the underlying volume manager. If the volume manager

used is SLVM, the (RAC) configuration is limited to 2 nodes (i.e., while the actual cluster size can be up to 16 nodes, only 2 nodes in the cluster can be configured with RAC, since SLVM supports 2-node mirroring. All other nodes can be configured to run “non-RAC” applications.) In the Extended Cluster for RAC configuration, if one of the RAC nodes is unreachable, the surviving node has no backup.

• With MirrorDisk/UX, there is an increased I/O load for writes, since each write has to be done twice by the host. If data resynchronization is required, based on the amount of data involved, this can have a major performance impact on the host.

• In addition to SLVM, Extended Cluster for RAC also supports Symantec’s Cluster Volume Manager (CVM 3.5). With CVM 3.5, the cluster may be increased up to four nodes, but the distance for a 4-node cluster is limited to 10km (like SLVM, a 2-node CVM 3.5 cluster supports a maximum distance of 100km).

• Link distance and latency may affect the application’s performance, as RAC uses the network for data block passing (Oracle’s Cache Fusion architecture).

Metrocluster Similar to Extended Campus Cluster, a Metrocluster is a normal Serviceguard cluster that has clustered nodes and storage devices located in different data centers separated by some distance. Applications run in an active/standby mode (i.e., application resources are only available to one node at a time). The distinct characteristic of Metrocluster is its integration with array-based data replication. Currently, Metrocluster implements three different solutions:

• Metrocluster/CAXP – HP StorageWorks Continuous Access XP

13

• Metrocluster/CAEVA – HP StorageWorks Continuous Access EVA • Metrocluster/SRDF - EMC’s Symmetrix arrays Each data center has a set of nodes connected to the storage local to that data center. Disk arrays in the two data centers are physically connected to each other. Since the data replication/mirroring is done by the storage subsystem, there is no need to have storage connection from a local server to the disk array at the remote data center. Either arbitrator nodes, located in a third location, or a quorum server is used for cluster arbitration.

The following diagram provides an example of Metrocluster/CAXP, configured with arbitrator nodes at a location separate from either of the two data centers.

1st Site: DATA CENTER 1

Data LAN + Heartbeat 2nd Site: DATA CENTER 2 Data LAN + Heartbeat

DWDM DWDM

3rd Site Arbitrator node Arbitrator node

Figure 4. Metrocluster/CAXP CA with two data centers & a 3rd location for arbitrator nodes

NOTE: DETAILED INFORMATION ON ARBITRATOR NODES AND QUORUM SERVERS IS DISCUSSED IN APPENDIX A OF THIS DOCUMENT. The distance separating the data centers in a Metrocluster is based on the cluster network and data replication link. In a Metrocluster configuration, maximum distance is the shortest of the distances defined by:

– Cluster network – maximum distance cannot exceed roundtrip cluster heartbeat network latency requirement of 200ms

– DWDM provider – distance cannot exceed the maximum as specified for the product supplied by the DWDM provider

– Data replication link – maximum supported distance as stated by the storage partner Since this is a single SG cluster, all cluster nodes have to be on the same IP subnet for cluster network communication.

14

Benefits of Metrocluster • Metrocluster offers a more resilient solution than Extended Campus Cluster, as it provides full

integration between Serviceguard’s application package and the data replication subsystem. The storage subsystem is queried to determine the state of the data on the arrays. Metrocluster knows that application package data is replicated between two data centers. It takes advantage of this knowledge to evaluate the status of the local and remote copies of the data, including whether the local site holds the primary copy or the secondary copy of data, whether the local data is consistent or not, and whether the local data is current or not. Depending on the result of this evaluation, it decides if it is safe to start the application package, whether a resynchronization of data is needed before the package can start, or whether manual intervention is required to determine the state of the data before the application package is started. Metrocluster allows for customization of the startup behavior for application packages depending on the customer's requirements, such as data currency or application availability. This means that by default, Metrocluster will always prioritize data consistency and data currency over application availability. If, however, the customer chooses to prioritize availability over currency, s/he can configure Metrocluster to start up even when the state of the data cannot be determined to be fully current (but the data is consistent).

• Users wishing to prioritize performance over data currency between the data centers have a choice of Metrocluster CAXP or Metrocluster SRDF, as each supports both synchronous and asynchronous replication modes.

• Because data replication and resynchronization are performed by the storage subsystem, Metrocluster may provide significantly better performance than Extended Campus Cluster during recovery. Unlike Extended Campus Cluster, Metrocluster does not require any additional CPU time, which minimizes the impact on the host.

• There is little or no lag time writing to the replica, so the data remains very current. • Data can be copied in both directions, so that if the primary site fails and the replica takes over,

data can be copied back to the primary site when it comes back up. • Disk resynchronization is independent from CPU failure (i.e., if the hosts at the primary site fail but

the disk remains up, the disk knows it does not have to be resynchronized). Limitations of Metrocluster • Specialized storage hardware is required in a Metrocluster environment, meaning customers are

not allowed to choose their own storage component. Supported storage subsystems include HP StorageWorks XP, HP StorageWorks EVA, and EMC Symmetrix with SRDF. In addition to specialized storage, disk arrays from different vendors are incompatible (i.e., a pair of disk arrays from the same vendor is required).

• There are no plans to support Oracle RAC (neither 9i nor 10g) in a Metrocluster configuration. • There are no plans to support CFS in a Metrocluster configuration. Continentalclusters Continentalclusters provides an alternative disaster tolerant solution in which short to long distances separate distinct Serviceguard clusters, with either a local area network (LAN) or a wide area network (WAN) between the clusters. Unlike Metrocluster and Extended Campus Cluster that have single-cluster architecture, Continentalclusters uses multiple clusters to provide application recovery. Applications run in the active/standby mode, with application data replicated between data centers by either storage array-based data replication products (such as Continuous Access XP or EMC's SRDF), or software-based data replication (such as Oracle 8i Standby DBMS and Oracle 9i Data Guard).

15

Two types of connections are needed between the two Serviceguard clusters in this architecture; one for the inter-cluster communication, and another for the data replication. Depending on the distance between the two sites, either LAN (i.e., single IP subnet) or WAN connections may be used for cluster network communication. For data replication, depending on the type of connection (ESCON or FC) that is supported by the data replication software, the data can be replicated over DWDM, 100Base-T and Gigabit Ethernet using Internet Protocol (IP), ATM, and T1 or T3/E3 leased lines or switched lines. The Ethernet links and ATM can be implemented over multiple T1 or T3/E3 leased lines.

Continentalclusters provides the ability to monitor a Serviceguard cluster and fail over mission critical applications to another cluster if the monitored cluster should become unavailable. In addition, Continentalclusters supports mutual recovery, which allows for mission critical applications to run on both clusters, with each cluster configured to recover the mission critical applications of the other. As of March 2003, Continentalclusters supports SGeRAC in addition to Serviceguard. In an SGeRAC configuration, Oracle RAC database instances are simultaneously accessible by nodes in the same cluster (i.e., the database is only accessible to one site at a time). The Oracle database and data are replicated to the 2nd data center, and the RAC instances are configured for recoverability, so that the 2nd data center stands by, ready to begin processing in event of a site failure at the 1st data center (i.e., across sites, this is an active/standby configuration such that the data base is only accessible to one site at a time).

If a participating cluster in Continentalclusters should become unavailable, Continentalclusters sends the administrator a notification of the problem. The administrator should verify that the monitored cluster has failed and then issue a recovery command to transfer mission critical applications from the failed cluster to the recovery cluster.

NOTE: THE MOVEMENT OF AN APPLICATION FROM ONE CLUSTER TO ANOTHER CLUSTER DOES NOT REPLACE LOCAL FAILOVER. APPLICATION PACKAGES SHOULD BE CONFIGURED TO FAIL BETWEEN NODES (OR PARTITIONS) IN THE LOCAL CLUSTER. The following diagram depicts a Continentalclusters configuration with two data centers. DATA CENTER 1

Data LAN + Heartbeat

Heartbeat

IP Router

IP Router

CNT Edge

IP Network

DATA CENTER 2Data LAN + Heartbeat

Heartbeat

CNT Edge

IP Router

IP Router

Figure 5. Contientalclusters with XP CA over IP

16

Benefits of Continentalclusters (CC) • Customers can virtually build data centers anywhere and still have the data centers provide disaster

tolerance for each other. Since Continentalclusters uses two clusters, theoretically there is no limit to the distance between the two clusters. The distance between the clusters is dictated by the required rate of data replication to the remote site, level of data currency, and the quality of networking links between the two data centers.

• Inter-cluster communication can be implemented with either WAN or LAN. LAN support is a great advantage for customers who have data centers in proximity of each other, but for whatever reason, do not want the data centers configured into a single cluster. One example may be a customer who already has two SG clusters close to each other. For business reasons, the customer cannot merge these two clusters into a single cluster, but is concerned about having one of the centers become unavailable. Continentalclusters can be added to provide disaster tolerance.

• Customers can integrate Continentalclusters with any storage component of choice that is supported by Serviceguard. Continentalclusters provides a structure to work with any type of data replication mechanism. A set of guidelines for integrating customers’ chosen data replication scheme with Continentalclusters is included in the “Designing Disaster Tolerant High Availability Clusters” manual.

• Besides selecting their own storage and data replication solution, customers can also take advantage of the following (HP) pre-integrated solutions – Storage subsystems implemented by Metrocluster are also pre-integrated with Continentalclusters.

Continentalclusters uses the same data replication integration module that Metrocluster implements to check for data status of the application package before package start up.

– If either Oracle8i or Oracle9i DBMS is used and logical data replication is the preferred method, depending on the version, either Oracle 8i Standby or Oracle 9i Data Guard with log shipping is used to replicate the data between two data centers. HP provides a supported integration toolkit for Oracle 8i Standby DB in the Enterprise Cluster Management Toolkit (ECMT). Contributed integration templates for Oracle 9i Data Guard are available at the following location: http://haweb.cup.hp.com/ATC/. While the integration templates for Oracle 9i Data Guard have been tested with Continentalclusters by ACSL, the scripts are provided at no charge, with no support from HP.

• Both Oracle9i and Oralce10g RAC are supported by Continentalclusters by integrating CC with SGeRAC. In this configuration, multiple nodes in a single cluster can simultaneously access the database (i.e., nodes in one data center can access the database). If the site fails, the RAC instances can be recovered at the second site.

• In a 2-data center configuration, Continentalclusters supports a maximum of 32 nodes – i.e., a maximum of 16 nodes per data center.

• Continentalclusters supports up to three data centers. In this configuration, the first two data centers must implement Metrocluster so that applications automatically fail over between the first two data centers before migrating to the third data center. The third data center is a traditional (single) Serviceguard data center. If both the first and second data centers fail, the customer will be notified and advised to migrate the application to the third site.

NOTE: THIS CONFIGURATION MUST BE VERY CAREFULLY DEPLOYED, AS APPLICATION AND DATA FAILBACK IS

VERY MANUALLY INTENSIVE • Failover for Continentalclusters is semi-automatic. If a data center fails, the administrator is advised,

and is required to take action to bring the application up on the surviving cluster. Per customer feedback via our Field personnel, some customers prefer notification that the site is down before the application migrates to the recovery site.

• CFS support is targeted for 2006.

17

Limitations of Continentalclusters • Semi-automatic failover is a concern for some customers, depending on their Recovery Time

Objectives (RTO). Per feedback from Field personnel, some customers would like the option of automatic failover as well as semi-automatic failover.

• Although not a limitation of the Continentalclusters product, it should be noted that increased distance could significantly complicate the solution. For example, operational issues, such as working with different staff with different processes, and conducting failover rehearsals, are more difficult the further apart the clusters are. In addition, the physical connection is one or more leased lines managed by a common carrier for configurations that require WAN between the clusters because of the distance separating them. Common carriers cannot guarantee the same reliability as a dedicated physical cable. The distance can introduce a time lag for data replication, which creates an issue with data currency. This could increase the overall solution cost by requiring higher speed connections to improve data replication performance and reduce latency.

Comparison of Solutions One of the major problems the Field faces is distinguishing between Extended Campus Cluster and Metrocluster. The following section is provided to highlight key differences between the two.

Differences Between Extended Campus Cluster and Metrocluster The major differences between an Extended Campus Cluster and a Metrocluster include:

• The methods used to replicate data between the storage devices in the two data centers. Generally speaking, there are two basic methods available for replicating data between the data centers for HP-UX clusters - either host-based or storage array-based. Extended Campus Cluster always uses host-based replication (either MirrorDisk /UX or Symantec VERITAS VxVM mirroring). Any (mix of) SG-supported storage can be implemented in an Extended Campus Cluster. Metrocluster always uses array-based replication/mirroring, and requires storage from the same vendor in both data centers (i.e, a pair of XPs with CA, a pair of Symmetrix arrays with SRDF, or a pair of EVAs with CA).

• Data centers in an Extended Campus Cluster can span up to100km, whereas the distance between data centers in a Metrocluster is defined by the shortest of the distances for – the maximum distance that guarantees a network latency of no more than 200ms – the maximum distance supported by the data replication link – the maximum supported distance for DWDM as stated by the provider

• In an Extended Campus Cluster, there is no built-in mechanism for determining the state of the data being replicated. When an application fails over from one data center to another, the package is allowed to start up if the volume group(s) can be activated. A Metrocluster implementation provides a higher degree of data integrity - the application is only allowed to start up based on the state of the data and the disk arrays.

• Extended Campus Cluster supports active/active access by implementing SGeRAC, whereas Metrocluster only supports active/standby access.

• Extended Campus Cluster reads may outperform Metrocluster in normal operations. On the other hand, Metrocluster performance is better than Extended Campus Cluster for data resynchronization and recovery.

18

Comparison - All DTS Solutions The following table extends the comparison to include Extended Cluster with RAC and Continentalclusters.

Attributes Extended Campus Cluster

Extended Cluster with RAC

Metrocluster Continentalclusters (CC)

The following attributes are included, as they must be considered, based on the type of disaster(s) about which the customer is concerned.

Key Benefit Excellent in “normal” operations, and partial failure. Since all hosts have access to both disks, in a failure where the node running the application is up but the disk becomes unavailable, no failover occurs. The node will access the remote disk to continue processing.

Excellent in “normal” operations, and partial failure. Active/active configuration provides maximum data throughput and reduces the need for fail over (since both data centers are active, the application is already up on the 2nd site).

Two significant benefits: - Provides maximum data protection. State of the data is determined before application is started. If necessary, data resynchronization is performed before application is brought up. - Better performance than Extended Campus Cluster for resync, as replication is done by storage subsystem (no impact to host)

Increased data protection by supporting unlimited distance between data centers (protects against such disasters as those caused by earthquakes or violent attacks, where an entire area can be disrupted).

Key Limitation No ability to check the state of the data before starting up the application. If the volume group (vg) can be activated, the application will be started. If mirrors are split or PV links are down, as long as the vg can be activated, the application will be started. Data resynchronization can have a big impact on system performance, as this is a host-based solution.

SLVM configuration is limited to 2 nodes. CVM 3.5 configuration supports up to 4 nodes. However, 4-node configuration is limited to a distance of10km. Data resynchronization can have a big impact on system performance as this is a host-based solution.

Specialized storage required. Currently, XP with continuous access, EVA with continuous access, and EMC’s Symmetrix with SRDF are supported.

No automatic failover between clusters.

Maximum Distance1

100 kilometers

100 km (maximum 2 nodes, with either SLVM or CVM 3.5) 10 km apart (maximum is 4 nodes with CVM 3.5)

Shortest of the distances between • Cluster network

latency (not to exceed 200ms)

• Data replication max distance

• DWDM provider max distance2

No distance restrictions3

19

Attributes Extended

Campus Cluster



The following attributes are included, as they directly affect data consistency, currency, and availability and must be considered when evaluating the customer’s RTO

Data Replication Mechanism

Host-based, via MirrorDisk/UX or (Symantec) VERITAS VxVM. Replication can affect performance (writes are synchronous). Re-syncs can impact performance (full re-sync is required in many scenarios that have multiple failures.)4

Host-based, via MirrorDisk/UX or (Symantec) VERITAS CVM 3.5. Replication can impact performance (writes are synchronous). Re-syncs can impact performance (full re-sync is required in many scenarios that have multiple failures.) 4

Array-based, via CAXP, CAEVA, or EMC SRDF. Replication and resynchronization performed by the storage subsystem, so the host does not experience a performance hit. Incremental re-syncs are done, minimizing the need for full re-syncs.

Customers have a choice of either selecting their own SG-supported storage and data replication mechanism, or implementing one of HP’s pre-integrated solutions (including CAXP, CAEVA, and EMC SRDF for array-based, or Oracle 8i Standby for host based.) Also, customers may choose Oracle 9i Data Guard as a host-based solution. Contributed (i.e., unsupported) integration templates for Oracle 9i Data Guard are available for download at the following location: http://haweb.cup.hp.com/ATC/

Application Failover type

Automatic (no manual intervention required)

Instance is already running at the 2nd site

Automatic (no manual intervention required)

Semi-automatic (user must “push the button” to initiate recovery)

Access Mode5 Active/Standby Active/Active Active/Standby Active/Standby

Client Transparency

Client detects the lost connection. User must reconnect once the application is recovered at 2nd site

Client may already have a standby connection to remote site

Client detects the lost connection. User must reconnect once the application is recovered at 2nd site

User must reconnect once the application is recovered at 2nd site

The following attributes are included, as they directly impact system scalability

Maximum Cluster size allowed

2 to 16 nodes (up to 4 when using dual lock disks)

2 nodes with SLVM or CVM 3.5 with a maximum distance of 100km 4 nodes with CVM 3.5 with a maximum distance of 10km

3 to 16 nodes 1 to 16 nodes in each cluster (max total of 32 nodes – 16 nodes per cluster in a 2-data center configuration)

20

Attributes Extended Campus Cluster



The following attributes are included, as they directly affect cost of implementation and maintenance

Storage Identical storage is not required (replication is host-based with either MirrorDisk/UX OR VxVM mirroring)

Identical storage is not required, replication is host-based with either MirrorDisk/UX OR CVM 3.5 Mirroring)

Identical storage is required

Identical storage is required if storage-based mirroring is used Identical storage is not required for other data replication implementations

Data replication link

Dark Fiber Dark Fiber Dark Fiber FC over IP FC over ATM

WAN LAN Dark Fiber (pre-integrated solution) FC over IP (pre-integrated solution) FC over ATM (pre-integrated solution)

Cluster network Single IP subnet Single IP subnet Single IP subnet Two configurations: Single IP subnet for both clusters (LAN connection between clusters) Two IP subnets – one per cluster (WAN connection between clusters)

DTS Software/Licenses Required

SG (no other clustering SW is required)

SG + SGeRAC

SG + Metrocluster

SG + Continentalclusters + (Metrocluster CAXP OR Metrocluster CAEVA Metrocluster SRDF OR Enterprise Cluster Master Toolkit) OR Customer-selected data replication subsystem CC with RAC: SG + SGeRAC + Continentalclusters

1Data centers that are farther apart increase the likelihood that alternate nodes will be available for failover in event of a disaster. 2 Metrocluster distance is determined by the shortest of

– the maximum distance that guarantees a network latency no more than 200ms, – the maximum supported distance for the data replication link, or – the DWDM provided maximum supported distance

As such, these values will vary between configurations, based on these factors. 3Continentalclusters has no limitation on distance between the two data centers. The distance is dictated by the required rate of data replication to the remote site, level of data currency, and the quality of networking links between the two data centers. 4 A full re-sync is required if a failure that caused one of the mirrors to be unavailable (such as a path failure to the remote site) is followed by a failure that causes a failover to the host at the remote site that uses the mirror that was unavailable. 5Active/standby access means one node at a time is accessing the application’s resources. Active/active access means all resources are available to multiple nodes.

21

Section 8: Disaster Tolerant Cluster Limitations

Disaster tolerant clusters have limitations, some of which can be mitigated by good planning. Some examples of multiple points of failure that may not be covered by disaster tolerant configurations include:

• Failure of all networks among the data centers — using a different route for all network cables can mitigate the risk.

• Loss of power in more than one site (e.g., a data center + the site housing arbitrator nodes) — This can be mitigated by making sure sites are on different power circuits, redundant power supplies are on different circuits, and power circuits are fed from different grids. If power outages are frequent in your area, and down time is expensive, you may want to invest in a backup generator.

• Loss of all copies of the on-line data — this can be mitigated by replicating data off-line (frequent backups). It can also be mitigated by taking snapshots of consistent data and storing it on-line; Business Copy XP and EMC Symmetrix BCV (Business Consistency Volumes) provide this functionality and the additional benefit of quick recovery should anything happen to both copies of on-line data.

• A rolling disaster is a disaster that occurs before the cluster is able to recover from a non-disastrous failure. An example is a data replication link that fails, then, as it is being restored and data is being resynchronized, a disaster causes an entire data center to fail. Ensuring that a copy of the data is stored either off-line or on a separate disk that can quickly be restored can mitigate the effects of rolling disasters. The trade-off is a lack of currency of the data in the off-line copy.

Section 9: Recommendations

As previously stated, customers’ recovery time and recovery point objectives (RTO and RPO) typically drive the type of disaster tolerant solution selected. The following guidelines are provided to help determine how to select a solution for recommendation.

• When should I recommend Extended Campus Cluster or Extended Cluster for RAC? Extended Campus Cluster is recommended for any of the following situations:

Ø A Customer needs to provide some level of protection, but has his own storage. Since any storage supported by SG is approved for Extended Campus Cluster, this may be the best solution for this customer.

Ø A customer has a requirement to implement disaster tolerance on a very limited budget. Metrocluster would be the customer’s choice, but the cost to deploy it exceeds his budget. Extended Campus Cluster is a good recommendation – as long as the customer understands and accepts its limitations.

Ø If a customer’s business is the financial industry (such as banking) with an extraordinarily large volume of real-time transactions, the customer needs to maximize resource usage. The customer is concerned about such natural events as flooding. In this instance, you may recommend Extended Cluster for RAC.

• When should I recommend Metrocluster? Metrocluster is recommended for any of the following situations:

22

Ø A customer has one data center running SG. The shared storage is a disk array (XP, EMC, or EVA). The customer is investigating building a 2nd data center a few miles away to be used primarily for development and test. This data center can also be used as a back up for the existing data center.

Ø A customer has two data centers that are within Metrocluster distance limits. One data center is running an SG cluster. The 2nd data center is used strictly to back up the data via physical data replication (such as EMC’s SRDF). The 2nd data center is not running any (business critical) applications. In this situation, the data is protected, such that in the event of an outage at the primary data center, the data can be physically moved to a location where a cluster can be brought up and transaction processing restored. This process is manually intensive. Because of its automatic failover capability, Metrocluster shortens recovery time, offering a much better solution.

Ø A customer has three data centers running independently from each other, and realizes the vulnerability of having unprotected data at each of the data centers. HP offers a solution for three data centers. The first two data centers implement Metrocluster for automatic failover. Continentalclusters is then implemented so that the third data center backs up the first two. In this configuration, if the entire Metrocluster fails, the third data center will take over operations.

• When should I recommend Continentalclusters? Continentalclusters is recommended for any of the following situations:

Ø A customer needs disaster tolerance, but wants to decide when an application is recovered (i.e., the customer wants to be informed/consulted before bringing up an application on the 2nd data center after its main site fails).

Ø A customer has two existing data centers that cannot be disrupted - each configured as a local cluster - and is concerned about a site failure. Regardless of the distance, the customer can “add on” disaster tolerance with data replication and Continentalclusters.

Ø A customer has data centers geographically dispersed, and is concerned about an outage at one of the sites. As an example, a customer’s location is subject to natural disasters – such as tornadoes – that could impact an entire Metropolitan area. Continentalclusters is an excellent solution, as it has no distance limitations.

Ø A customer is interested in disaster tolerance for a RAC application, they are in an area vulnerable to natural disasters that can affect an entire metropolitan area (e.g., earthquakes) and an active/passive configuration meets their business needs.

Ø A customer has three data centers running independently from each other, and realizes the vulnerability of having unprotected data at each of the data centers. HP offers a solution for three data centers. The first two data centers implement Metrocluster for automatic failover. Continentalclusters is then implemented so that the third data center backs up the first two. In this configuration, if the entire Metrocluster fails, the third data center will take over operations.

As you can see, disaster tolerant solutions require a significant investment in hardware with geographically dispersed data centers, a means to continuously replicate data from the primary site to the recovery site, clustering software to monitor faults and manage the failover of the applications, as well as IT staff in all data centers to operate the environment. With the defined RTO and RPO, the customer can then decide on whether implementing a disaster tolerant solution is worth the investment.

23

Appendix A – DTS Design Considerations

Once a customer defines his requirements and chooses to implement a disaster tolerant solution, he must make many decisions about the actual implementation. The following information is included to help with the selection of solution components.

Cluster Arbitration To protect application data integrity, Serviceguard uses a process called arbitration to prevent more than one incarnation of a cluster from running and starting up a second instance of an application. In the Serviceguard user’s manual, this process is known as tie breaking, because it is a means to decide on a definitive cluster membership when different competing cluster nodes are independently trying to re-form a cluster. Cluster re-formation takes place when there is a change in cluster membership. In general, the algorithm for cluster re-formation requires the new cluster to achieve a cluster quorum of a strict majority (that is, more than 50%) of the nodes previously running. If both halves (exactly 50%) of a previously running cluster were allowed to re-form, there would be a split-brain situation in which two instances of the same cluster were running. Serviceguard employs a lock disk, a quorum server, or arbitrator nodes to provide definitive arbitration to prevent split-brain conditions.

Serviceguard Cluster Quorum Requirements

Strictly, more than 50% of the active members from the previous cluster membership (all cluster members are required when a cluster is initially started unless a manual override is specified)

All supported Serviceguard cluster arbitration schemes apply to individual clusters in Continentalclusters (including cluster lock disk, quorum server, and arbitrator nodes).

Dual cluster lock disks Extended Campus Cluster and Extended Cluster for RAC can use cluster lock disks for cluster quorum.

In an Extended Campus Cluster where the cluster nodes are running in two separate data centers, a single cluster lock disk would be a single point of failure if the data center it resides in suffers a catastrophic failure. In this solution, there should be one lock disk in each of the two data centers, and all nodes must have access to both lock disks. In the event of a failure of one of the data centers, the nodes in the remaining data center will be able to acquire their local lock disk, allowing them to successfully reform a new cluster. A solution uses dual cluster lock disks is susceptible to split-brain syndrome. If it is properly designed, configured, and deployed, it would be very difficult for split-brain to occur (all storage links and cluster network links must all fail) but it is still possible. A dual cluster lock disk is only supported with Extended Cluster for RAC and Extended Campus Cluster in a cluster size of four nodes or less.

Quorum Server in a third location Extended Campus Cluster, Extended Cluster for RAC, and Metrocluster can use a quorum server for cluster quorum.

The quorum server is an alternate form of cluster lock that uses a server program running on a separate system for tie breaking rather than a lock disk. Should equal sized groups of nodes become separated from each other, the quorum server allows one group to achieve quorum and form the cluster, while the other group is denied quorum and cannot reform the cluster. The quorum server process runs on a machine outside of the cluster for which it is providing quorum services. In a disaster tolerant solution, you can see that the quorum server should be located in a separate location

24

away from the two data centers. The farther the third location is away from the two data centers, the higher disaster protection the solution can provide. If the customer chooses a building within the same campus as one of the data centers to house the quorum server, the customer may be protected from a fire or power outage, but may not be protected from an earthquake or a hurricane. One advantage of the quorum server is that additional cluster nodes do not have to be configured for arbitration. Also, one quorum server can serve multiple clusters.

Since you cannot configure redundant quorum server, an entire cluster will fail if the quorum server fails followed by a failure that requires cluster reformation. To reduce this exposure, you need to make sure that the quorum server is packaged in its own SG cluster so that when a disaster occurs to one of the main data centers, the quorum server is available to provide cluster quorum to the remaining cluster nodes to form a new cluster. A solution using quorum server is not susceptible to split brain syndrome.

Quorum server software is available free of charge.

Arbitrator node(s) in a third location Extended Campus Cluster, Extended Cluster for RAC, and Metrocluster can use arbitrator node(s) for cluster quorum.

An arbitrator node is the same as any other cluster node and is not configured in any special way in the cluster configuration file. It is used to make an even partition of the cluster impossible or at least extremely unlikely. A single failure in a four-node cluster could result in two equal-sized partitions, but a single failure in a five-node cluster could not. The fifth node in the cluster, then, performs the job of arbitration by virtue of the fact that it makes the number of nodes in the cluster odd. If one data center in the solution were down due to disaster, the surviving data center would still remain connected to the arbitrator node, so the surviving group of nodes would be larger than 50% of the previously running nodes in the cluster. It could therefore obtain the quorum and re-form the cluster. As in the case of quorum server, the arbitrator node should be located in a site separate from the two data centers to provide the appropriate degree of disaster tolerance. The farther the site is away from the two data centers, the higher disaster protection the solution can provide. A properly designed cluster solution with two data centers and a 3rd site using arbitrator node(s) will always be able to achieve cluster quorum after a site failure because a cluster quorum of a strict majority (that is, more than 50%) of the nodes previously running will always be available to form a new cluster.

It is recommended that two arbitrator nodes be configured in a site separate from either of the data centers to eliminate the single arbitrator node being an SPOF of the solution. The arbitrator nodes can be used to run an application that doesn’t need disaster tolerant protection. The arbitrator nodes can be configured to share some common local disk storage. A Serviceguard package can be configured to provide local fail over of the application between the two arbitrator nodes.

Recommended Arbitration Method For a single-cluster disaster tolerant solution, it is recommended to select the cluster arbitration method in the following order: • Two arbitrator nodes in a site separate from either of the two data centers to provide highest

protection with highest cost. • One arbitrator node or a quorum server in a site separate from the data centers is medium

cost, but the single node itself can potentially become an SPOF of the solution

25

• Dual cluster lock disk is lowest cost but is susceptible to a slight chance of split-brain syndrome – only supported with Extended Campus Cluster and Extended Cluster for RAC

Protecting Data through Replication Different data replication methods have different advantages about data consistency and currency. Your requirements will dictate your choice of data replication.

Off-line Data Replication Off-line data replication is the method most commonly used today. The data is stored on tape and is kept in a vault at a remote location away from the primary data center. If a disaster occurs at the primary data center, the off-line copy of data is used and a remote site functions in place of the failed site. Because data is replicated using physical off-line backup, data consistency is fairly high, barring human error or an untested corrupt backup. However, data currency is compromised by the amount of time that elapses between backups.

Off-line data replication is fine for many applications for which recovery time is not critical to the business. Although data might be replicated weekly or even daily, recovery could take from a day to a week depending on the volume of data. Some applications, depending on the role they play in the business, may need to have a faster recovery time, within hours or even minutes. For these applications, off-line data replication would not be appropriate.

On-line Data Replication On-line data replication is a method of copying data from one site to another across a link. It is used when very short recovery time, from minutes to hours, is required. To be able to recover use of an application in a short time, the data at the alternate site must be replicated in real time on all disks.

Data can be replicated either synchronously or asynchronously. Synchronous replication requires one disk write to be completed and replicated before another disk write can begin. This method guarantees data consistency and currency during replication. However, as distance increases between data centers, it greatly reduces data replication capacity and application performance, as well as system response time. Asynchronous replication does not require the primary site to wait for one disk write to be replicated before beginning another. This can be an issue with data currency, depending on the volume of transactions. An application that has a very large volume of transactions can get hours behind in replication using asynchronous replication. If the application fails over to the remote site, it would start up with data that is not current, and this may not be desirable. Where as data consistency and currency are inherent traits of synchronous replication mode, in asynchronous replication, guaranteed write ordering must be provided to ensure data consistency, and the level of data currency is based on customer requirements and the cost the customer is willing to pay. Note that not all asynchronous data replication facilities guarantee write ordering.

Currently the two ways of replicating data on-line are physical data replication and logical data replication. Either of these can be configured to use synchronous or asynchronous writes.

Physical Data Replication

Each physical write to disk is replicated on another disk at another site. Because the replication is a physical write to disk, it is not application dependent. This allows each node to run different

26

applications under normal circumstances. Then, if a disaster occurs, an alternate node can take ownership of applications and data, provided the replicated data is current and consistent.

Physical data replication can be done in software or hardware. MirrorDisk/UX is an example of physical replication done in the software; a disk I/O is written to each storage connected to the node, requiring the node to make multiple disk I/Os. Continuous Access XP on the HP StorageWorks Disk Array XP series is an example of physical replication in hardware; a single disk I/O is replicated across the Continuous Access link to a second XP disk array.

Replication Mode

Currently, there are three hardware physical data replication products integrated and supported with HP-UX Disaster Tolerant Solutions – CAXP, CAEVA, and EMC SRDF. Both CAXP and EMC SRDF are supported with both synchronous and asynchronous mode.

Advantages of physical replication in hardware are:

• There is little or no lag time writing to the replica. This means that data remains very current. • Replication consumes no additional CPU. • The hardware deals with resynchronization if the link or disk fails. Moreover, resynchronization is

independent of CPU failure; if the CPU fails and the disk remains up, the disk knows it does not have to be resynchronized.

• Data can be copied in both directions, so that if the primary fails and the replica takes over, data can be copied back to the primary when it comes back up.

• Easier and faster data recovery because data is available on the remote storage, no need to restore from tape.

Disadvantages of physical replication in hardware are:

• The logical order of data writes is not maintained during resync-recovery after a link failure and recovery. When a replication link goes down and transactions continue at the primary site, writes to the primary disk are queued in a bit-map. When the link is restored, if there has been more than one write to the primary disk, there is no way to determine the original order of transactions. This increases the risk of data inconsistency in the replica during resynchronization.

• Because the replicated data is a write to a physical disk block, database corruption and human errors, such as the accidental removal of a database table, are replicated at the remote site.

• Redundant disk hardware and cabling are required. This at least doubles data storage costs. Also, because the technology is in the disk itself, this solution requires specialized hardware.

• For architectures using dedicated cables, the distance between the sites are limited by the cable interconnect technology. Different technologies support different distances and provide different “data throughput” performance.

• For architectures using common carriers, the costs can vary dramatically, and the reliability of the connection can vary, depending on the Service Level Agreement.

Advantages of physical replication in software are:

• There is little or no time lag between the initial and replicated disk I/O, so data remains very current.

• The solution is independent of disk technology, so you can use any supported disk technology.

27

• Data copies are peers, so there is no issue with reconfiguring a replica to function as a primary disk after failover.

• Because there are multiple read devices, that is, the node has access to both copies of data, there may be improvements in read performance.

• Writes are synchronous unless the link or disk is down.

Disadvantages of physical replication in software are: • As with physical replication in the hardware, the logical order of data writes is not maintained.

When the link is restored, if there has been more than one write to the primary disk, there is no way to determine the original order of transactions.

• Distance between sites is limited by the physical disk link capabilities. • Performance is affected by many factors: CPU overhead for mirroring, double I/O writes, degraded

write performance, and CPU time for resynchronization. In addition, CPU failure may cause a resynchronization even if it is not needed, further affecting system performance.

Logical Data Replication

Logical data replication is a method of replicating data by repeating the sequence of transactions at the remote site. Logical replication often must be done at both the file system level, and the database level in order to replicate all of the data associated with an application. Most database vendors have one or more database replication products. An example is the Oracle Standby Database. Logical replication can be configured to use synchronous or asynchronous writes. Transaction processing monitors (TPMs) can also perform logical replication.

For logical data replication, currently the Continentalclusters product has a fully integrated and supported solution with Oracle 8i Standby Database. The integration script is available in the Enterprise Cluster Master Toolkit. Contributed integration templates for Continentalclusters with Oracle 9i Data Guard are available for downloaded from http://haweb.cup.hp.com/ATC/. While the integration templates for Oracle 9i Data Guard have been tested with Continentalclusters by ACSL, the scripts are provided at no charge, with no support from HP. Advantages of using logical replication are:

• The distance between nodes is limited only by the networking technology. • There is no additional hardware needed to do logical replication, unless you choose to boost CPU

power and network bandwidth. • Logical replication can be implemented to reduce risk of duplicating human error. For example, if a

database administrator erroneously removes a table from the database, a physical replication method will duplicate that error at the remote site as a raw write to disk. A logical replication method can be implemented to only replicate database transactions, not database commands, so such errors would not be replicated at the remote site. This also means that administrative tasks, such as adding or removing database tables, have to be repeated at each site.

• With database replication you can roll transactions forward or backward to achieve the level of currency desired on the replica, although this functionality is not available with file system replication.

Disadvantages of logical replication are:

• It uses significant CPU overhead because transactions are often replicated more than once and logged to ensure data consistency, and all but the most simple database transactions take significant CPU. It also uses network bandwidth, whereas most physical replication methods use a

28

separate data replication link. As a result, there may be a significant lag in replicating transactions at the remote site, which affects data currency.

• When a site disaster occurs, logical records or logs being prepared for shipment and in the process of being transferred to the recovery site will be lost. In this instance, the amount of data loss can be significant depending on the number of transactions contained within the logical records or logs (e.g., an Oracle archive log can potentially contain hundreds of database transactions). Reducing the number of transactions contained within a data transfer and increasing the frequency of the transfers, which will also improve data currency, can minimize data loss.

• If the primary database fails and is corrupt, and the replica takes over, the process for restoring the primary database so that it can be used as the replica is complex. It often involves recreating the database and doing a database dump from the replica.

• Logic errors in applications or in the RDBMS code itself that cause database corruption will be replicated to remote sites. This is also an issue with physical replication. However, with Oracle Standby it could be configured such that replicated logs do not get applied immediately to the standby database, providing a window for DBA intervention.

• Most logical replication methods do not support personality swapping, which is the ability after a failure to allow the secondary site to become the primary and the original primary to become the new secondary site. This capability can provide increased up time.

Recommended Data Replication

The recommended disaster tolerant architecture, if budgets allow, is the following combination:

• For performance and data currency—physical data replication. • For data consistency—either create a second physical data replication at the remote site as a point-

in-time snapshot using BC or BCV or logical data replication which would only be used in the cases where the primary physical replica was corrupt.

Using Alternative Power Sources In a high-availability cluster, redundancy is applied to cluster components, such as PV links, redundant network cards, power supplies, and disks. In disaster tolerant architectures another level of protection is required for these redundancies. Each data center that houses part of a disaster tolerant cluster should be supplied with power from a different circuit. In addition to a standard UPS (uninterrupted power supply), each node in a disaster tolerant cluster should be on a separate power circuit.

Housing remote nodes in another building often implies they are powered by a different circuit, so it is especially important to make sure all nodes are powered from a different source if the disaster tolerant cluster is located in two data centers in the same building. Some disaster tolerant designs go as far as ensuring their redundant power source is supplied by a different power substation on the grid, and the power circuits are fed from different grids. This adds protection against large-scale power failures, such as brownouts, sabotage, or electrical storms.

Creating Highly Available Networking The two critical elements in a disaster tolerant solution are the cluster communication link, and the data replication link or host to storage connections.

Standard high-availability guidelines require redundant networks. Redundant networks may be highly available, but they are not disaster tolerant if a single accident can interrupt both network connections. For example, if you use the same trench to lay cables for both networks, you do not have a disaster tolerant architecture because a single accident, such as backhoe digging in the wrong

29

place, can sever both cables at once. This may lead to a split-brain syndrome in an Extended Campus Cluster using dual cluster lock disks. In a disaster tolerant architecture, the reliability of the network is paramount. To reduce the likelihood of a single accident causing both networks to fail, redundant network cables should be installed to use physically different routes for each network.

In addition to redundant lines, you also need to consider what bandwidth you need to support the data replication method you have chosen. Bandwidth affects the rate of data replication, and therefore the currency of the data at the remote site. For Extended Campus Cluster, Extended Cluster with RAC, and Metrocluster, the networking link for cluster communication should have no more than 200 milliseconds latency.

The reliability of the data replication link affects whether or not data replication happens, and therefore the consistency of the data at the remote site. Dark fiber is more reliable but more costly than leased lines.

Cost influences both bandwidth and reliability. It is best to address data consistency issues first by installing redundant lines, then weigh the price of data currency and select the line speed accordingly.

Managing a Disaster Tolerant Environment In addition to the changes in hardware and software to create a disaster tolerant architecture, there are also changes in the way you manage the environment. Configuration of a disaster tolerant architecture needs to be carefully planned, implemented and maintained. There are additional resources needed, and additional decisions to make concerning the maintenance of a disaster tolerant architecture.

• Manage it in-house, or hire a service?

Hiring a service can remove the burden of maintaining the capital equipment needed to recover from a disaster. Most disaster recovery services provide their own off-site equipment, which reduces maintenance costs. Often the disaster recovery site and equipment are shared by many companies, further reducing cost. Managing disaster recovery in-house gives complete control over the type of redundant equipment used and the methods used to recover from disaster, giving you complete control over all means of recovery.

• Implement automated or manual recovery?

Manual recovery costs less to implement and gives more flexibility in making decisions while recovering from a disaster. Evaluating the data and making decisions can add to recovery time, but it is justified in some situations, for example if applications compete for resources following a disaster and one of them has to be halted. Automated recovery reduces the amount of time and in most cases eliminates human intervention needed to recover from a disaster. – You may want to automate recovery for any number of reasons: – Automated recovery is usually faster. – Staff may not be available for manual recovery, as is the case with “lights-out” data centers. – Reduction in human intervention is also a reduction in human error. Disasters do not happen

often, so lack of practice and the stressfulness of the situation may increase the potential for human error.

– Automated recovery procedures and processes can be transparent to the clients.

30

Even if recovery is automated, you may choose to, or need to recover from some types of disasters with manual recovery. A rolling disaster, which is a disaster that happens before the cluster has recovered from a previous disaster, is an example of when you may want to manually switch over. If the data link failed, and as it was coming up and re-synchronizing data, a data center failed, you would want human intervention to make judgment calls on whether the remote site has consistent data before failing over.

• Who manages the environment and how are they trained?

Putting a disaster tolerant architecture in place without planning for the people aspects is a waste of money. Training and documentation are more complex because the cluster is in multiple data centers.

Each data center often has its own operations staff with their own processes and ways of working. These operations people will now be required to communicate with each other and coordinate maintenance and failover rehearsals, change control, IT process, as well as working together to recover from an actual disaster. If the remote nodes are placed in a “lights-out” data center, the operations staff may want to put additional processes or monitoring software in place to maintain the nodes in the remote location. Rehearsals of failover scenarios are important to keep everyone prepared. Changes made to the production environment (such as OS and/or application upgrades) must also be tested at the recovery site, to ensure applications failover correctly in the event of disaster. A written plan should outline rehearsal of what to do in cases of disaster with a minimum recommended rehearsal schedule of once every 6 months, ideally once every 3 months.

• How is the environment maintained?

Planned downtime and maintenance, such as backups or upgrades, must be more carefully thought out because they may leave the cluster vulnerable to another failure. For example, when doing system maintenance in a Serviceguard cluster, nodes need to be brought down for maintenance in pairs: one node at each site, so that quorum calculations do not prevent automated recovery if a disaster occurs during planned maintenance. Rapid detection of failures and rapid repair of hardware is essential so that the cluster is not vulnerable to additional failures. Testing is more complex and requires personnel in each of the data centers. Site failure testing should be added to the current cluster testing plans.

31

For more information

• Product User’s Guides and Release Notes, found at http://docs.hp.com/en/ha.html • Current Unix Server Configuration Guide (by chapter) - found under “Ordering/Configuration

Guides” at http://source.hp.com/portal/site/source/ • DTS Whitepapers and Customer Presentations – found with the search key “DTS” at

http://source.hp.com/portal/site/source/ • HA ATC links, found at http://haweb.cup.hp.com/ATC/ • Cluster for High Availability, Second Edition, Peter S. Weygant • DWDM: A white paper, Joseph Algieri and Xavier Dahan • Evaluation of Data Replication Solutions, Bob Baird • Extended SAN: A Performance Study, Xavier Dahan • Extended MC/Serviceguard Cluster Configurations (Campus Cluster), Joseph Algieri and Xavier

Dahan • High Availability Technical Documentation • http://docs.hp.com/hpux/ha/index.html • HP Extended Cluster for RAC – 100 Kilometer Separation Becomes a Reality

© 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.

Itanium is a trademark or registered trademark of Intel Corporation in the U.S. and other countries and is used under license.

XXXX-XXXXEN, 03/2006

a comparison of hp-ux disaster tolerant solutions

Documents