understanding high availability with websphere mq - ibm · understanding high availability with...

34
Understanding high availability with WebSphere MQ Mark Hiscock Software Engineer IBM Hursley Park Lab United Kingdom Simon Gormley Software Engineer IBM Hursley Park Lab United Kingdom May 11, 2005 © Copyright International Business Machines Corporation 2005. All rights reserved. This whitepaper explains how you can easily configure and achieve high availability using IBM’s enterprise messaging product, WebSphere MQ V5.3 and later. This paper is intended for: o Systems architects who make design and purchase decisions for the IT infrastructure and may need to broaden their designs to incorporate HA. o System administrators who wish to implement and configure HA for their WebSphere MQ environment. Table of Contents 1. Introduction ........................................................................................................................................ 3 2. High availability.................................................................................................................................. 4 3. Implementing high availability with WebSphere MQ .................................................................... 6 3.1. General WebSphere MQ recovery techniques .............................................................................. 6 3.2. Standby machine - shared disks .................................................................................................... 7 3.2.1. HA clustering software ......................................................................................................... 9 3.2.2. When to use standby machine - shared disks ...................................................................... 10 3.2.3. When not to use standby machine - shared disks ................................................................ 10 3.2.4. HA clustering active-standby configuration ....................................................................... 11 3.2.5. HA clustering active-active configuration .......................................................................... 12 3.2.6. HA clustering benefits ........................................................................................................ 13 3.3. z/OS high availability options..................................................................................................... 16 3.3.1. Shared queues (z/OS only) .................................................................................................. 16 3.4. WebSphere MQ queue manager clusters .................................................................................... 19 3.4.1. Extending the standby machine - shared disk approach...................................................... 20 3.4.2. When to use HA WebSphere MQ queue manager clusters................................................. 21

Upload: vuongcong

Post on 28-Jun-2018

236 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ Mark Hiscock Software Engineer IBM Hursley Park Lab United Kingdom Simon Gormley Software Engineer IBM Hursley Park Lab United Kingdom May 11, 2005 © Copyright International Business Machines Corporation 2005. All rights reserved. This whitepaper explains how you can easily configure and achieve high availability using IBM’s enterprise messaging product, WebSphere MQ V5.3 and later. This paper is intended for:

o Systems architects who make design and purchase decisions for the IT infrastructure and may need to broaden their designs to incorporate HA.

o System administrators who wish to implement and configure HA for their WebSphere MQ environment.

Table of Contents 1. Introduction ........................................................................................................................................3 2. High availability..................................................................................................................................4 3. Implementing high availability with WebSphere MQ ....................................................................6

3.1. General WebSphere MQ recovery techniques..............................................................................6 3.2. Standby machine - shared disks....................................................................................................7

3.2.1. HA clustering software .........................................................................................................9 3.2.2. When to use standby machine - shared disks ......................................................................10 3.2.3. When not to use standby machine - shared disks................................................................10 3.2.4. HA clustering active-standby configuration .......................................................................11 3.2.5. HA clustering active-active configuration ..........................................................................12 3.2.6. HA clustering benefits ........................................................................................................13

3.3. z/OS high availability options.....................................................................................................16 3.3.1. Shared queues (z/OS only)..................................................................................................16

3.4. WebSphere MQ queue manager clusters ....................................................................................19 3.4.1. Extending the standby machine - shared disk approach......................................................20 3.4.2. When to use HA WebSphere MQ queue manager clusters.................................................21

Page 2: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

3.4.3 When not to use HA WebSphere MQ queue manager clusters............................................21 3.4.4. Considerations for implementation of HA WebSphere MQ queue manager clusters.........22

3.5. HA capable client applications ...................................................................................................24 3.5.1. When to use HA capable client applications.......................................................................25 3.5.2. When not to use HA capable client applications.................................................................25

4. Considerations for WebSphere MQ restart performance ............................................................26 4.1. Long running transactions ..........................................................................................................26 4.2. Persistent message use ................................................................................................................27 4.3. Automation .................................................................................................................................27 4.4. File systems ................................................................................................................................27

5. Comparison of generic versus specific failover technology...........................................................29 6. Conclusion.........................................................................................................................................31 Appendix A – Available SupportPacs.................................................................................................33 Resources...............................................................................................................................................34 About the authors .................................................................................................................................34

Page 2

Page 3: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

1. Introduction With an ever increasing dependence on IT infrastructure to perform critical business processes, the availability of this infrastructure is becoming more important. The failure of an IT infrastructure results in large financial losses, which increases with the length of the outage [5]. The solution to this problem is careful planning to ensure that the IT system is resilient to any hardware, software, local or system wide failure. This capability is termed “resilience computing”, which addresses the following topics:

o High availability o Fault tolerance o Disaster recovery o Scalability o Reliability o Workload balancing and stress

This whitepaper addresses the most fundamental concept of resilience computing, high availability (HA). That is, “An application environment is highly available if it possesses the ability to recover automatically within a prescribed minimal outage window” [7]. Therefore, an IT infrastructure that recovers from a software or hardware failure, and continues to process existing and new requests, is highly available.

Page 3

Page 4: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

2. High availability The HA nature of an IT system is its ability to withstand software or hardware failures so that it is available as much of the time as possible. Ideally, despite any failure which may occur, this would be 100% of the time. However, there are factors, both planned and unplanned, which prohibit this from being a reality for most production IT infrastructures. These factors lead to the unavailability of the infrastructure, meaning the ideal availability (per year) can be measured as the percentage of the year for which the system was available. For example: Figure 1. Number 9’s availability per year

Availability% Downtime per Year

99 3.65 days

99.9 8.76 hours

99.99 52.6 minutes

99.999 5.26 minutes

99.9999 30.00 seconds

Figure 1 shows that a 30 second outage per year is called “Six 9’s availability” because of the percentage of the year the system was available. Factors that cause a system outage and reduce the number of 9’s up time, fall into two categories: those that are planned and those that are unplanned. Planned disruptions are either systems management (upgrading software or applying patches), or data management (backup, retrieval, or reorganization of data). Conversely, unplanned disruptions are system failures (hardware or software failures) or data failures (data loss or corruption). Maximizing the availability of an IT system is to minimize the impact of these failures on the system. The primary method is the removal of any single point of failure (SPOF) so that should a component fail, a redundant or backup component is ready to take over. Also, to ensure enterprise messaging solutions are made highly available, the software’s state and data must be preserved in the event of a failure and made available again as soon as possible. The preservation and restoration of this data removes it as a single point of failure in the system. Some messaging solutions remove single points of failure, and make software state and data available, by using replication technologies. These may be in the form of asynchronous or synchronous replication of data between instances of the software in a network. However, these approaches are not ideal as asynchronous replication can cause duplicated or lost data and synchronous replication incurs a significant

Page 4

Page 5: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

performance cost as data is being backed up in real time. It is for these reasons that WebSphere MQ does not use replication technologies to achieve high availability. The next section describes methods for making a WebSphere MQ queue manager highly available. Each method describes a technique for HA and when you should and should not consider it as a solution.

Page 5

Page 6: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

3. Implementing high availability with WebSphere MQ This section discusses the various methods of implementing high availability in WebSphere MQ. Examples show when you can or cannot use HA.

• “Standby machine – shared disks” and “z/OS high availability options” describe HA techniques for distributed and z/OS queue managers, respectively.

• “WebSphere MQ Queue Manager clusters” describes a technique available to queue manages on all platforms.

• “HA capable client applications” describes a client-side technique applicable on all platforms.

By reading each section, you can select the best HA methodology for your scenario. This paper uses the following terminology:

• Machine – A computer running an operating system. • Queue manager – A WebSphere MQ queue manager that contains queue and

log data. • Server – A machine that runs a queue manager and other 3rd party services. • Private message queues – These are queues owned by a particular queue

manager and are only accessible, via WebSphere MQ applications, when the owning Queue manager is running. These queues are to be contrasted with shared messages queues (explained below), which are a particular type of queue only available on z/OS.

• Shared message queues – These are queues that reside in a Coupling Facility and are accessible by a number of queue managers that are part of a Queue Sharing Group. These are only available on z/OS and are discussed later.

3.1. General WebSphere MQ recovery techniques On all platforms, WebSphere MQ uses the same general techniques for dealing with recovery of private message queues after a failure of a queue manager. With the exception of shared messages queues (see “Shared queues”), messages are cached in memory and backed by disk storage if the volume of message data exceed the available memory cache. When persistent messaging is used, WebSphere MQ logs messages to disk storage. Therefore, in the event of a failure, the combination of the message data on disk plus the queue manager logs can be used to reconstruct the message queues. This restores the queue manager to a consistent state at the time just before the failure occurred. This recovery involves completing normal Unit or Work resolution, with in-flight messages being rolled back, in-commit messages being complete, and in-doubt messages waiting for coordinator resolution. The following sections describe how the above general restart process is used in conjunction with platform specific facilities, such as HACMP on AIX or ARM on z/OS, to quickly restore message availability after failures.

Page 6

Page 7: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

WebSphere MQ also provides a mechanism for improving the availability of new messages by routing messages around a failed queue manager transparently to the application producing the messages. This is called Websphere MQ clustering and is covered in “WebSphere MQ Queue Manager clusters”. Finally on z/OS, WebSphere MQ supports shared message queues that are accessible to a number of queue managers. Failure of one queue manager still allows the messages to be accessed by other queue managers. These are covered in “z/OS high availability options”.

3.2. Standby machine - shared disks As described above, when a queue manager fails, a restart is required to make the private message queues available again. Until then, the messages stored on the queue manager will be “stranded”. Therefore, you cannot access them until the machine and queue manager are returned to normal operation. To avoid the stranded messages problem, stored messages need to be made accessible, even if the hosting queue manager or machine is inoperable. In the standby machine solution, a second machine is used to host a second queue manager that is activated when the original machine or queue manager fails. The standby machine needs to be an exact replica, at any given point in time of the master machine, so that when failure occurs, the standby machine can start the queue manager correctly. That is, the WebSphere MQ code on the standby machine should be at the same level, and the standby machine should have the same security privileges as the primary machine. A common method for implementing the standby machine approach is to store the queue manager data files and logs on an external disk system that is accessible to both the master and standby machines. WebSphere MQ writes its data synchronously to disk, which means a shared disk will always contain the most recent data for the queue manager. Therefore, if the primary machine fails, the secondary machine can start the queue manager and resume its last known good state.

Page 7

Page 8: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

Figure 2. An active-standby setup The standby machine is ready to read the queue manager data and logs from the shared disk and to assume the IP address of the primary machine [3].

A shared external disk device is used to provide a resilient store for queue data and queue manager logs so that replication of messages are avoided. This preserves the once and once only delivery characteristic of persistent messages. If the data was replicated to a different system, the messages stored on the queues have been duplicated to the other system, and once and once only delivery cannot be guaranteed. For instance, if data was replicated to a standby server, and the connection between the two servers fails, the standby assumes that the master has failed, takes over the master server’s role, and starts processing messages. However, as the master is still operational, messages are processed twice, hence duplicated messages occur. This is avoided when using a shared hard disk because the data only exists in one physical location and concurrent access is not allowed. The external disk used to store queue manager data should also be RAID1 enabled to prevent it being a single point of failure (SPOF) [8]. The disk device may also have multiple disk controllers and multiple physical connections to each of the machines, to provide redundant access channels to the data. In normal operation, the shared disk is mounted by the master machine, which uses the storage to run the queue manager in the same way as if it were a local disk, storing both the queues and the WebSphere

1 Using a RAID configuration protects against data loss, such as mirroring.

Page 8

Page 9: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

MQ log files on it. The standby machine cannot mount the shared disk and therefore, cannot start the queue manager because the queue manager data is not accessible. When a failure is detected, the standby machine automatically takes on the master machine’s role, and as part of that process, mounts the shared disk and starts the queue manager. The standby queue manager replays the logs stored on the shared disk to return the queue manager to the correct state, and resumes normal operations. Note that messages on queues that are failed over to another queue manager retain their order on the queue. This failover operation can also be performed without the intervention of a server administrator. It does require external software, known as “HA clustering”, to detect the failure and initiate the failover process. Only one machine has access to the “shared”2 disk partition at a time, and only one instance of the queue manager runs at any one time to protect data integrity of messages. The objective of the shared disk is to move the storage of important data (for example, queue data and queue manager logs) to a location external to the machine, so that when the master machine fails, another machine may use the data.

3.2.1. HA clustering software Much of the functionality in the standby machine configuration is provided by external software, often termed as HA clustering software [4]. This software addresses high availability issues using a more holistic approach than single applications, such as WebSphere MQ, can provide. It also recognizes that a business application may consist of many software packages and other resources, all of which need to be highly available. This is because another complication is introduced when a solution consists of several applications that have a dependency on each other. For example, an application may need access to both WebSphere MQ and a database, and may need to run on the same physical machine as these services. HA clustering provides the concept of “resource groups”, where applications are grouped together. When failure occurs in of one of the applications in the group, the entire group is moved to a standby server, satisfying the dependency of the applications. However, this only occurs if the HA clustering software fails to restart the application on its current machine. It is also possible to move the network address and any other operating system resources with the group so that the failover is transparent to the client. If an individual software package was responsible for its own availability, it may not be able to transfer to another physical machine and will not be able to move any other resources on which it is dependent. By using HA clustering to cope with these low level considerations, such as network address takeover, disk access, and application dependencies, the higher level applications are relieved of this complexity. Although there are several vendors providing HA clustering, each package tends to follow the same basic principles and provide a similar set of basic functionality. Some solutions, such as Veritas Cluster Server and SteelEye LifeKeeper, are also compatible with multiple platforms to provide a similar solution in heterogeneous environments. In the same way that WebSphere MQ removed the complexity of application connectivity from the programmer, HA clustering techniques help provide a simple,

2 A more accurate name would be “switchable” disks.

Page 9

Page 10: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

generic solution for HA. This means applications, such as messaging and data management, can focus on their core competencies leaving HA clustering to provide a more reliable availability solution than “resource-specific” monitors. HA clustering also covers both hardware and software resources, and is a proven, recognized technology used in many other HA situations. HA clustering products are designed to be scalable and extensible to cope with changing requirements. IBM’s AIX HACMP product, SteelEye LifeKeeper, and Veritas Cluster Server scale up to 32 servers. HACMP, LifeKeeper, and Cluster Server have extensions available to allow replication of disks to a remote site for disaster recovery purposes.

3.2.2. When to use standby machine - shared disks The standby machine solution is ideal for messages that are delivered once and only once. For example, in billing and ordering systems, it is essential that messages are not duplicated so that customers are not billed twice, or sent two shipments instead of one. As HA clustering software is a separate product that sits along side existing applications, this methodology is also suited to convert an existing server, or set of servers to be highly available. It is possible to gradually convert a set of servers to be highly available. In large installations where there are many servers, HA clustering is a cost effective choice through the use of an n+1 configuration. In this approach, a single machine is used as a backup for a number of live servers. Hardware redundancy is reduced and therefore, cost is reduced, as only one extra machine is required to provide high availability to a number of active servers. As already shown, HA clustering software is capable of converting an existing application and its dependent resources to be highly available. It is, therefore, suited to situations where there are several applications or services that need to be made highly available. If those applications are dependent on each other, and rely on operating system resources, such as network addresses to function correctly, HA clustering is ideally suited.

3.2.3. When not to use standby machine - shared disks HA clustering is not always necessary when considering an HA solution. Although the examples given below are served by an HA clustering method, other solutions would serve just as well and it would be possible to utilize HA clustering at a later date if required. If the trapped messages problem is not applicable, such as there is no need to restart a failed queue manager with its messages intact, then shared disks are not necessary. This occurs if the system is only used for event messages that will be re-transmitted regularly, for messages that expire in a relatively short time, or for non-persistent messages (where an application is not relying on WebSphere MQ for assured delivery). For these situations, you can make a system highly available by using WebSphere MQ queue manager clustering only. This technology load balances messages and routes around failed servers. See “WebSphere MQ Queue Manager clusters” for more information on queue manager clusters.

Page 10

Page 11: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

In situations where it is not important to process the messages as soon as possible, then HA clustering may provide too much availability at too much of an expense. For example, if trapped messages can wait until an administrator restarts the machine, and hence the queue manager is restarted (using an internal RAID disk to protect the queue manager data), then HA clustering is considered too comprehensive of a solution. In this situation, it is possible to allow access for new messages using WebSphere MQ queue manager clustering, as in the case above. The shared disk solution requires the machines to be physically close to each other, as the distance from the shared disk device needs to be small. This makes it unsuitable for use in a disaster recovery solution. However, some HA clustering software can provide disaster recovery functionality. For example, IBM’s HACMP package has an extension called HAGEO, which provides data replication to remote sites. By backing up data in this fashion, it is possible to retrieve it if a site wide failure occurs. However, the off-site data may not be the most up-to-date because the replication is often delayed by a few minutes. This is because instantaneous replication of data to an off-site location incurs a significant performance hit. Therefore, the more important the data, the smaller the time interval will be, but the greater the performance impact. Time and performance must be traded against each other when implementing a disaster recovery solution. Such solutions do not provide all of the benefits of the shared disk solution and are beyond the scope of this document. The following sections describe two possible configurations for HA clustering. These are termed active-active and active-standby configurations.

3.2.4. HA clustering active-standby configuration In a generic HA clustering solution, when two machines are used in an active–standby configuration, one machine is running the applications in a resource group and the other is idle. In addition to network connections to the LAN, the machines also have a private connection to each other. This is either in the form of a serial link or a private Ethernet link. The private link provides a redundant connection between the machines for the purpose of detecting a complete failure. As previously mentioned, if a link between the machines fails, then both machines may try to become active. Therefore, the redundant link reduces the risk of communication failure between the two. The machines may also have two external links to the LAN. Again, this reduces the risk of external connectivity failure, but also allows the machines to have their own network address. One of the adapters is used for the “service” network address, such as the network address that clients use to connect to the service, and the other adapter has a network address associated with the physical machine. The service address is moved between the machines upon failure to provide HA transparency to any clients. The standby machine monitors the master machine via the use of heartbeats. These are periodic checks by the standby machine to ensure that the master machine is still responding to requests. The master machine also monitors its disks and the processes running on it to ensure that no hardware failure has occurred. For each service running on the machine, a custom utility is required to inform the HA clustering software that it is still running. In the case of WebSphere MQ, the SupportPacs describing HA configurations provide utilities to check the operation of queue

Page 11

Page 12: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

managers, which can easily be adapted for other HA systems. Details of these SupportPacs are listed in Appendix A. A small amount of configuration is required for each resource group to describe what should happen at start-up and shutdown, although in most cases this is simple. In the case of WebSphere MQ, this could be a start up script containing commands to start the queue manager (for example, strmqm), listener (for example, runmqlsr), or any other queue manager programs. A corresponding shutdown script is also needed, and depending on the HA clustering package in use, a number of other scripts may be required. Samples for WebSphere MQ are provided with the SupportPacs described in Appendix A. As the heartbeat mechanism is the primary method of failure detection, if a heartbeat does not receive a response, the standby machine assumes that the master server has failed. However, heartbeats may not respond because of a number of reasons, such as an overloaded server, or communication failure. There is a possibility that the master server will resume processing at a later stage, or is still running. This can lead to duplicate messages in the system and is not desired. Managing this problem is also the role of the HA clustering package. For example, RedHat Cluster services and IBM’s HACMP work around this problem by having a watchdog timer with a lower timeout than the cluster. This ensures that the machine reboots itself before another machine in the cluster takes over its role. Programmable power supplies are also supported, so other machines in the cluster can power cycle the affected machine, to ensure that it is no longer operational before starting the resource group. Essentially, the machines in the cluster have the capability to turn the other machines off. Some HA clustering software suites also provide the capability to detect other types of failure, such as system resource exhaustion, or process failure, and try to recover from these failures locally. For WebSphere MQ, you can implement on AIX using the appropriate SupportPac (see Appendix A) to restart a queue manager locally, which is not responding. This can avoid the more time consuming operation of completely moving the resource group to another server. You should design the machines used in HA clustering to have identical configurations to each other. This includes installed software levels, security configurations, and performance capabilities, to minimize the possibility of resource group start-up failure. This ensures that machines in the network all have the capability to take on another machine’s role. Note that for active-standby configurations, only one instance of an application is running at any one moment and therefore, software vendors may only charge for one instance of the application, as is the case for WebSphere MQ.

3.2.5. HA clustering active-active configuration It is also possible to run services on the redundant machine in what is termed an active–active configuration. In this mode, the servers are both actively running programs and acting as backups for each other. If one server fails, the other continues

Page 12

Page 13: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

to run its own services, as well as the failed server’s. This enables the backup server to be used more effectively, although when a failure does occur, the performance of the systems is reduced because it has taken on extra processing. In Figure 3, the second active machine runs both queue managers if a failure occurs. Figure 3. An active-active configuration

In larger installations, where several resource groups exist and more than one server needs to be made highly available, it is possible to use one backup machine to cover several active servers. This setup is known as an n+1 configuration, and has the benefit of reduced redundant hardware costs, because the servers do not have a dedicated backup machine each. However, if several servers fail at the same time, the backup machine may become overloaded. These extra costs must be weighed up against the potential cost of more than one server failing, and more than one backup machine being required.

3.2.6. HA clustering benefits HA clustering software provides the capability to perform controlled failover of resource groups. This allows administrators to test the functionality of a configured system, and also allow machines to be gracefully removed from an active cluster. This can be for maintenance purposes, such as hardware and software upgrades or data backup. It also allows failed servers, once repaired, to be placed back in the cluster and to resume their services. This is known as “fail-back” [4]. A controlled failover operation also results in less downtime because the cluster does not need to detect the

Page 13

Page 14: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

failure. There is no need to wait for the cluster timeout. Also, as the applications, such as WebSphere MQ, are stopped in a controlled manner, the start up time is reduced because there is no need for log replay. Using the abstract resource groups makes it possible for a service to remain highly available. This occurs when the machine that is normally running the services has been removed from the cluster. This is only true as long as the other machines have comparable software installed and access to the same data, meaning any machine can run the resource group. The modular nature of resource groups also helps the gradual uptake of HA clustering in an existing system and easily allows services to be added at a later date. This also means that in a large queue manager installation, you can convert mission critical queue managers to be highly available first, and later convert the less critical queue managers, or not at all. Many of the requirements for implementing HA clustering are also desirable in more bespoke, or product-centric HA solutions. For example, “RAID” disk arrays [8], extra network connections and redundant power supplies all protect against hardware failure. Therefore, improving the availability of a server results in additional cost, whether a bespoke or HA clustering technique is used. HA clustering may require additional hardware over and above some application specific HA solutions, but this enables a HA clustering approach to provide a more complete HA solution. You can easily extend the configuration of HA clustering to cover other applications running on the machine. The availability of all services is provided via a standard methodology and presented through a consistent interface rather than being implemented separately by each service on the machine. This in turn reduces complexity and staff training times and reduces errors being introduced during administration activities. By using one product to provide an availability solution, you can take a common approach to decision making. For instance, if a number of the servers in a cluster are separated from the others by network failure, an unanimous decision is needed to decide which servers should remain active in the cluster. If there were several HA solutions in place (such as each product uses its own availability solution), each with separate quorum algorithms3, then it is possible that each algorithm has a different outcome. This could result in an invalid selection of active servers in the cluster that may not be able to communicate. By having a separate entity, in the form of the HA clustering software, to decide which part of the cluster has the quorum, only one outcome is possible, and the cluster of servers continues to be available. Summary The shared disk solution described above is a robust approach to the problem of trapped messages, and allows access to stored messages in the event of a failure. However, there will be a short period of time where there is no access to the queue manager while the failure is being detected, and the service is being transferred to the standby server. It is possible during this time to use WebSphere MQ clustering to provide access for new messages because its load balancing capabilities will route

3 A quorum is the minimum number of members of a deliberative body necessary to conduct the business of that group.

Page 14

Page 15: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

messages around the failed queue manager to another queue manager in the cluster. How to use HA clustering with WebSphere MQ clustering is described in “When to use WebSphere MQ queue manager clusters”.

Page 15

Page 16: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

3.3. z/OS high availability options z/OS provides a facility for operating system restart of failed queue managers called Automatic Restart Manager (ARM). It provides a mechanism, via ARM policies, for a failed queue manager to be restarted “in place” on the failing logical partition (LPAR). Or, in the case of an LPAR failure, started on a different LPAR along with other subsystems and applications “grouped” together, such that the subsystem components provide the overall business solution can be restarted together. In addition, with a parallel sysplex, Geographically Dispersed Parallel Sysplex (GDPS) provides the ability for automatic restart of subsystems, via remote DASD copying techniques, in the event of a site failure. The above techniques are restart techniques that are similar to those discussed earlier for distributed platforms. We will now look at a capability which maximizes the availability of message queues in the event of queue manager failures that does not require queue manager restart.

3.3.1. Shared queues (z/OS only) WebSphere MQ shared queues is an exploitation of the z/OS-unique Coupling Facility (CF) technology that provides high-speed access to data across a sysplex via a rich set of facilities to store and retrieve data. WebSphere MQ stores shared message queues in the Coupling Facility, and this in turn, means that unlike private message queues, they are not owned by any single queue manager. Queue managers are grouped into Queue Sharing Groups (QSGs), analogous to Data Sharing Groups with data-sharing DB2. All queue managers within a QSG can access shared message queues for putting and getting of messages via the WebSphere MQ API. This enables multiple putters and getters on the same shared queue from within the QSG. Also WebSphere MQ provides peer recovery such that inflight shared queue messages are automatically rolled back by another member of the QSG in the event of a queue manager failure. WebSphere MQ still uses its logs for capturing persistent message updates so that in the extremely unlikely event of a CF failure, you can use the normal restart procedures to restore messages. In addition, z/OS provides system facilities to automatically duplex the CF structures used by WebSphere MQ. The combination of these facilities provides WebSphere MQ shared message queues with extremely high availability characteristics. Figure 4 shows three queue managers: QM1, QM2 and QM3 in the QSG GRP1 sharing access to queue A in the coupling facility. This setup allows all three queue managers to process messages arriving on queue A.

Page 16

Page 17: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

Figure 4. Three queue managers in a QSG share queue A on a Coupling Facility

GRP1 QM 2

QM 3 QM 1

Q A

Coupling Facility

A further benefit of using shared queues is utilizing shared channels. You can use shared channels in two different scenarios to further extend the high availability of WebSphere MQ. First, using shared channels, an external queue manager can connect to a specific queue manager in the QSG using channels. It can then put messages to the shared queue via this queue manager. This allows for queue managers in a distributed environment to utilize the HA functionality provided by shared queues. Therefore, the target application of messages put by the queue manager can be any of those running on a queue manager in the QSG. Second, you can use a generic port so that a channel connecting to the QSG could be connected to any queue manager in the QSG. If the channel loses its connection (because of a queue manager failure), then it is possible for the channel to connect to another queue manager in the QSG by simply reconnecting to the same generic port. 3.3.1.1 Benefits of shared message queues The main benefit of a shared queue is its high availability. There are numerous customer selectable configuration options for CF storage, ranging from running on standalone processors with their own power supplies to the Internal Coupling Facility (ICF) that runs on spare processors within a general zSeries server. Another key factor is that the Coupling Facility Control Code (CFCC) runs in its own LPAR, where it is isolated from any application or subsystem code. In addition, it naturally balances the workload between the queue managers in the QSG. That is, a queue manager will only request a message from the shared queue when the application, which is processing messages, is free to do so. Therefore, the availability of the messaging service is improved because queue managers are not flooded by messages directly. Instead, they consume messages from the shared queue when they are ready to do so. Also, should greater message processing performance be required, you can add extra queue managers to the QSG to process more incoming messages. With persistent messages, both private and shared, the message processing limit is constrained by the speed of the log. With shared message queues, each queue manager uses its own log

Page 17

Page 18: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

for updates. Therefore, deploying additional queue managers to process a shared queue means the total logging cost is liquidated gradually over a number of queue managers. This provides a highly scalable solution. Conversely, if a queue manager requires maintenance, you can remove it from the

astly, should a queue manager fail during the processing of a Unit of Work, the other

QSG, leaving the remaining queue managers to continue processing the messages. Both the addition and removal of queue managers in a QSG can be performed without disrupting the already existing members. Lmembers of the QSG will spot this and “Peer Recovery” is initiated. That is, if the unit of work was not completed by the failed queue manager, another queue manager in the QSG will complete the processing. This arbitration of queue manager data is achieved via hardware and microcode on z/OS. This means that the availability of the system is increased as the failure of any one queue manager does not result in trapped messages or inconsistent transactions. This is because Peer Recovery either completes the transaction or rolls it back. For more information on Peer Recovery and how to configure it, see z/OS Systems Administration Guide [6]. The benefits of shared queues are not solely limited to z/OS queue managers.

.3.1.2. Limitations of shared message queues essages are limited to be less than

he Coupling Facility is a resilient and durable piece of hardware, but it is a single

his system-managed duplexing is supported by WebSphere MQ. While the rebuild is

inally, a queue manager can only belong to one QSG and all queue managers in a

Although you cannot setup shared queues in a distributed environment, it is possible for distributed queue managers to place messages onto them through a member of the QSG. This allows for the QSG to process a distributed application’s message in a z/OS HA environment. 3With WebSphere MQ V5.3, physical shared m63KB in size. Any application that attempts to put a message greater than this limit receives an error on the MQPUT call. However, you can use the message grouping API to construct a logical message greater than 63KB, which consists of a number of physical segments. Tpoint of failure in this high availability configuration. However, z/OS provides duplexing facilities, where updates to one CF structure are automatically propagated to a second CF. In the unlikely event of failure of the “primary” CF, z/OS automatically switches access to the “secondary”, while the primary is being rebuilt. Ttaking place, there is no noticeable application effect. However, this duplexing will clearly have an effect on overall performance. FQSG must be in the same sysplex. This is a small limitation on the flexibility of QSGs. Also a QSG can only contain a maximum of 32 queue managers. For more information on shared queues, see WebSphere MQ for Z/OS – Concepts and Planning Guide [1].

Page 18

Page 19: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

3.4. WebSphere MQ queue manager clusters A WebSphere MQ queue manager cluster is a cross platform workload balancing solution that allows WebSphere MQ messages to be routed around a failed queue manager. It allows a queue to be hosted across multiple queue managers, thus allowing an application to be duplicated across multiple machines. It provides a highly available messaging service allowing incoming messages to be forwarded to any queue manager in the cluster for application processing. Therefore, if any queue manager in the cluster fails, new incoming messages continue to be processed by the remaining queue managers. In Figure 5, an application puts a message to a cluster queue on QM2. This cluster queue is defined locally on QM1, QM4 and QM5. Therefore, one of these queue managers will receive the message and process it. Figure 5. Queue managers 1,4, and 5 in the cluster receive messages in order

cluster Queue

Application Local Queue

QM 3 QM 1 QM 2

QM 4 QM 6

QM 5 cluster 1

By balancing the workload between QM1, QM4, and QM5, an application is distributed across multiple queue managers making it highly available. If a queue manager fails, the incoming messages are balanced among the remaining queue managers. While WebSphere MQ clustering provides continuous messaging for new messages, it is not a complete HA solution because it is unable to handle messages that have already been delivered to a queue manager for processing. As we have seen above, if a queue manager fails, these “trapped” private messages are only processed when the queue manager is restarted. However, by combining WebSphere MQ clustering with the recovery techniques covered above, you can create an HA solution from both new and existing messages. The following section shows this in action in a distributed shared disk environment.

Page 19

Page 20: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

3.4.1. Extending the standby machine - shared disk approach By hosting cluster queue managers on active-standby or active-active setups, trapped messages, on private or cluster queues, are made available when the queue manager is failed over to a standby machine and restarted. The queue manager will be failed over and will begin processing messages within minutes instead of the longer amount of time it would take to manually recover and repair the failed machine or failed queue manager in the cluster. The added benefit of combining queue manager clusters with HA clustering is that the high availability nature of the system becomes transparent to any clients using it. This is because they are putting messages to a single cluster queue. If a queue manager in the cluster fails, the client’s outstanding requests are processed when the queue manager is failed over to a backup machine. In the meantime, the client needs to take no action because its new requests will be routed around the failure and processed by another queue manager in the cluster. The client must only be tolerant if its requests are taking slightly longer than normal to be returned in the event of a failover. Figure 6 shows each queue manager in the cluster in an active-active, standby machine-shared disk configuration. The machines are configured with separate shared disks for queue manager data and logs to decrease the time required to restart the queue manager. See “Considerations for WebSphere MQ restart performance” for more information. Figure 6. Queue managers 1,4, and 5 have active standby machines

Cluster Queue

Application Local Queue

QM 3 QM 1

QM 2 QM QM log log

QM 4 QM 6 QM log QM 5 cluster 1

In this example, if queue manager 4 fails, it fails over to the same machine as queue manager 3, where both queue managers will run until the failed machine is repaired.

Page 20

Page 21: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

3.4.2. When to use HA WebSphere MQ queue manager clusters Because this solution is implemented by combining external HA clustering technology with WebSphere MQ queue manager clusters, it provides the ultimate high availability configuration for distributed WebSphere MQ. It makes both incoming and queued messages available and also fails over not only a queue manager, but also any other resources running on the machine. For instance, server applications, databases, or user data can fail over to a standby machine along with the queue manager. When using HA WebSphere MQ clustering in an active-standby configuration, it is a simpler task to apply maintenance or software updates to machines, queue managers, or applications. This is because you can first update a standby machine, then a queue manager can fail over to it, ensuring that the update works correctly. If it is successful, you can update the primary machine and then the queue manager can fail back onto it. HA WebSphere MQ queue manager clusters also greatly reduce the administration of the queue managers within it, which in turn reduces the risk of administration errors. Queue managers that are defined in a cluster do not require channel or queue definitions setup for every other member of the cluster. Instead, the cluster handles these communications and propagates relevant information to each member of the cluster through a repository. HA WebSphere MQ queue manager clusters are able to scale applications linearly because you can add new queue managers to the cluster to aid in the processing of incoming messages. Conversely, you can remove queue managers from the cluster for maintenance and the cluster can still continue to process incoming requests. If the queue manager’s presence in the cluster is required, but the hardware must be maintained, then you can use this technique in conjunction with failing the queue manager over to a standby machine. This frees the machine, but keeps the queue manager running. It is also possible for administrators to write their own cluster workload exits. This allows for a finer control of how messages are delivered to queue managers in the cluster. Therefore, you can target messages at machines in different ratios based on the performance capabilities of the machine (rather than in a simple round robin fashion).

3.4.3 When not to use HA WebSphere MQ queue manager clusters HA WebSphere MQ queue manager clusters require additional proprietary HA hardware (shared disks) and external HA clustering software (such as HACMP). This increases the administration costs of the environment because you also need to administer the HA components. This approach also increases the initial implementation costs because extra hardware and software are required. Therefore, balance these initial costs with the potential costs incurred if a queue manager fail and messages become trapped. Note that non-persistent messages do not survive a queue manager failover. This is because the queue manager restarts once it has been failed over to the standby machine, causing it to process its logs and return to its most recent known state. At

Page 21

Page 22: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

this point, non persistent messages are discarded. Therefore, if your application requires non-persistent messages, take into account this factor. If trapped messages are not a problem for the applications (for example, the response time of the application is irrelevant or the data is updated frequently), then HA WebSphere MQ queue manager clusters are probably not required. That is, if the amount of time required to repair a machine and restart its queue manager is acceptable, then having a standby machine to take over the queue manager is not necessary. In this case, it is possible to implement WebSphere MQ queue manager clusters without any additional HA hardware or software.

3.4.4. Considerations for implementation of HA WebSphere MQ queue manager clusters When configuring an active-active or active-standby setup in a cluster, administrators should test to ensure that the failover of a given node works correctly. Nodes should be failed over, when and where possible, to backup machines to ensure the failover processes work as designed and that no problems are encountered when a failover is actually required. Perform this procedure at the discretion of the administrators. It may cause problems or outages in a future production environment if failover does not happen smoothly. As with queue manager clusters, do not code WebSphere MQ applications as machine or queue manager specific, such as relying on resources only available to a single machine. This is because when applications are failed over to a standby machine, along with the queue manager they are running on, they may not have access to these resources. To avoid these administrative problems, machines should be as equal as possible with respect to software levels, operating system environments, and security settings. Therefore, any failed over applications should have no problems running. Avoid message affinities when programming applications. This is because there is no guarantee that messages put to the cluster queue will be processed by the same queue manager every time. It is possible to use the MQ Open Option “BIND_ON_OPEN” to ensure an application’s messages are always delivered to the same queue manager in the cluster. However, an application performing this operation incurs reduced availability because this queue manager may fail during message processing. In this case, the application must wait until the queue manager is failed over to a backup machine before it can begin processing the applications requests. If affinities had not been used, then no delay in message processing would be experienced. Another queue manager in the cluster would continue processing any new requests. Application programmers should avoid long running transactions in their applications. This is because these will greatly increase the restart time of the queue manager when it is failed over to a standby machine. See “Considerations for WebSphere MQ restart times” for more information. When implementing a WebSphere MQ cluster solution, whether for an HA configuration or for normal workload balancing, be careful to have at least two full cluster repositories defined. These repositories should be on machines that are highly

Page 22

Page 23: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

available. For example, they have redundant power supplies, network access and hard disks, and are not heavily loaded with work. Repositories are vital to the cluster because they contain cluster wide information that is distributed to each cluster member. If both of these repositories are lost, it is impossible for the cluster to propagate any cluster changes, such as new queues or queue managers. However, the cluster continues to function with each member’s partial repositories until the full repositories are restored.

Page 23

Page 24: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

3.5. HA capable client applications You can achieve high availability on the client side rather than using HA clustering, HA WebSphere MQ queue manager clusters, or shared queue server side techniques as previously described. HA capable clients are an inexpensive way to implement high availability, but usually it results in a large client with complex logic. This is not ideal and a server side approach is recommended. However, HA capable clients are discussed here for completeness. Most occurrences of a queue manager failure result in a connection failure with the client. Even if the queue manager is returned to normal operation, the client disconnects and remains so until the code used to connect the client to the queue manager is executed again. One possible solution to the problem of a server failure is to design the client applications to reconnect, or connect to a different, but functionally identical server. The client’s application logic has to detect a failed connection and reconnect to another specified server. The method of detecting and handling a failed connection depends on the MQ API in use. MQ JMS, for instance, provides an exception listener mechanism that allows the programmer to specify code to be run upon a failure event. The programmer can also use Java “try – catch” blocks to allow failures to be handled during code execution. The MQI API reports a failure upon the next function call that requires communication with the queue manager. In this scenario, it is the programmer’s responsibility to resolve the failure. The management of the failure depends on the type of application and also, if there are any other high availability solutions in place. A simple reconnect to the same queue manager may be attempted, and if successful, the application can resume processing. You can configure the application with a list of queue managers that it may connect to. Upon failure, it can reconnect to the next queue manager in the list. In an HA clustering solution, clients still experience a failed connection if a server is failed-over to a different physical machine. This is because it is not possible to move open network connections between servers. The client also may need to be configured to perform several reconnect attempts to the server, and/or wait a period of time to allow time for the server to restart. If the application is transactional, and the connection fails mid-transaction, the entire transaction needs to be re-executed when a new connection is established. This is because WebSphere MQ queue managers will rollback any uncommitted work at start-up time. You can supplement many server-side HA solutions with the use of client side application code designed to cope with the temporary loss, or need to reconnect to a queue manager. A client that contains no extra code may need user intervention, or even need to be completely restarted to resume full functionality. There is obviously extra effort required to code the client application to be HA aware, but the end result is a more autonomous client.

Page 24

Page 25: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

3.5.1. When to use HA capable client applications HA capable clients are ideally suited when an application has a number of clients that need to reconnect in the event of a failure and no HA solution has been implemented on the server side. This allows clients to connect themselves to alternative services while the failed service is restored.

3.5.2. When not to use HA capable client applications When a robust extensible high availability solution is required, the HA focus is on the server side rather than the client side. Clients with complex HA logic become large and must be maintained. Also, new clients coming onto the system must implement the same logic. However, a transparent server side HA solution negates the need to implement this technology Also, if there is a requirement for a thin client, then there is no room for bulky HA logic. Therefore, you must implement the HA solution on the server side.

Page 25

Page 26: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

4. Considerations for WebSphere MQ restart performance The most important factor in making an IT system highly available is the length of time required to recover from a failure. The methods described for making a WebSphere MQ queue manager highly available all involve situations, where a queue manager has failed and it must be restarted on the same machine or a standby machine. Therefore, the quicker you restart a queue manager, the quicker it can complete any outstanding work and begin to process any new requests. The quickest way to do this is to attempt to first failover the queue manager to the same machine it failed on. This is only possible if the queue manager has not failed due to a hardware problem (an external HA clustering technology can determine this). This approach will result in a much quicker restart time and a less disruptive failover because there is no need to move resources, such as network addresses, queue managers, applications, and shared disks to the standby machine. However, if you cannot achieve this, then the queue manager must be failed over to a standby machine. Therefore, minimizing the amount of start-up processing the queue manager must do to regain its state will minimize the amount of time the queue manager is unavailable. The next sections discuss factors that affect the start-up time of the queue manager.

4.1. Long running transactions If your client applications have long running transactions that use persistent messages, then this increases the amount of time a queue manager takes to start up. Design applications to avoid the use of long running transactions, because these can affect the amount of log data that needs to be replayed during recovery. By committing transactions as frequently as possible, the amount of log replay required to recover the transaction is reduced. WebSphere MQ uses automatically generated checkpoints to determine the point at which the log will replay. A checkpoint is a point where the log and queue files/pagesets are consistent4. If a transaction is not committed for several checkpoints, it follows that the size of the log required to recover the queue manager increases. Therefore, short transaction times reduce the amount of data to be processed when recovering a queue manager. It is possible to force a checkpoint on z/OS using log archiving or when a number of log records matching the LOGLOAD value have been written. The use of shorter transactions also has the benefit of reducing the possibility of the queue manager exhausting the available log space (and the quantity of log space required). This results in a long running transaction being rolled back on distributed to release space. On z/OS, a transaction will not be rolled back in this instance. Instead, it is necessary to access the archive logs if the transaction backs out. This could significantly extend the time that the backout takes. It is also important to note that if the transaction backs out, and not all of the log records are available, then the queue manager will terminate.

4 For z/OS, note that pagesets are only consistent on every third checkpoint.

Page 26

Page 27: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

For instance, if the queue manager has a long running Unit of Work (UOW), then it must scan back over a number of logs to recover it. By introducing frequent commits into the application code, it is possible to minimize long start-up times due to large UOWs. This also reduces the number of log files required to recover the queue manager. If they have been backed up onto another medium, such as tape, this significantly increases the restart time of the queue manager.

4.2. Persistent message use Persistent messages are first written to the queue manager log (for recovery purposes) and then to the queue file/pageset if the message is not gotten immediately. The queue manager replays the log during recovery. Reducing the amount of log to be reprocessed reduces the time required for recovery. Non-persistent messages are not written to the log so they do not increase the queue manager’s restart time. However, note that if an application is relying on WebSphere MQ to provide data integrity, you must use persistent messages to ensure message delivery. Also, as non-persistent messages are not logged, they do not survive a queue manager restart. A new class of message service was introduced with WebSphere MQ 5.3 CSD 6, which is positioned between persistent and non-persistent messaging. It allows non-persistent messages to survive a queue manager restart, although some messages may be lost because of the absence of message logging that persistent messaging provides. On non z/OS platforms, you can enable this message class by setting the queue parameter NPMCLASS to “HIGH”. On z/OS, this functionality is an emergent property of the use of shared queues as non-persistent messages are stored in the Coupling Facility. They do not get removed on the queue manager startup.

4.3. Automation The detection of the failure, failover to a standby machine, and restart of the queue manager (and applications) should be automated. By reducing operator intervention, the time required to failover a queue manager to a backup machine is significantly reduced. This allows normal service to be resumed as quickly as possible. You can achieve the automation of this process by using HA clustering software as described in “HA clustering software”.

4.4. File systems Use of a journaled file system is recommended on distributed platforms to reduce the time required to recover a file system to a working state. A journaling file system uses a journal to maintain a list of the file transactions being written to the disk. In the event of a failure, the disk structure remains in a consistent state because it can be rebuilt at boot time (for example, recovered from the journal) and can be used immediately. On a non-journaling file system, the state of the file system after a failure is not known and it is necessary to use a utility such as scandisk, or e2fsck, to find and fix errors. As the use of a journal avoids this problem, there is no need to perform a time

Page 27

Page 28: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

consuming file system scan to verify the integrity before you can use it. Common journaled file systems include Windows NTFS, Linux ext3, ReiserFS, and JFS. On z/OS, WebSphere MQ provides facilities for taking backup copies of the message data while the system is running. You can use these, in conjunction with the logs, to recover WebSphere MQ in the event of media failure. Taking periodic backups is recommended to reduce the amount of log data that needs to be processed at restart. Finally, to decrease the start-up time of a queue manager which has been failed over, store the queue manager log and queue files on separate disks. This increases performance in the recovery of the queue manager from its logs because it will face no disk contention for the queue files.

Page 28

Page 29: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

5. Comparison of generic versus specific failover technology The WebSphere MQ methods for high availability standby machine and shared disks and HA WebSphere MQ queue manager clusters both rely on external HA clustering software and hardware to monitor hardware resources and application data, to run processes, and to perform a failover process if any of these fail. The alternative approach to this solution is to utilize a product specific HA approach. These provide an “out of the box” experience and are usually tailored for each software application. These solutions primarily provide data replication to a specified partner so that failover can occur if a primary instance fails. You should fully investigate this product specific high availability approach before considering its use in a serious HA implementation. The primary reason for this investigation is that the software may rely upon the synchronization of data between product instances. Data replication in this manner is discussed in the section “High availability” at the beginning of this paper. It states “these approaches are not ideal as asynchronous replication can cause duplicated or lost data and synchronous replication incurs a significant performance costs.” Therefore, replicating data in this manner is not a good method for high availability. Another reason to avoid product specific approaches is that they only tend to allow a single software product to be failed over. However, an external HA clustering solution offers the ability to failover and restart interdependent groups of resources, such as other software applications and hardware resources. For instance, it is possible to failover WebSphere Business Integration Message Brokers with WebSphere MQ and DB2 using IBM’s HACMP technology. This extensibility is vital when considering the wider scope of high availability for all server applications and hardware resources. An external HA clustering approach utilizes available machines on the network more effectively. It is able to dynamically failover an application and any other resources to a single backup machine, shared by a number of queue managers in the network (often called an N+1 solution). This means a standby machine is not required for every active machine in the network. HA clustering technology detects subtle failures, such as an unexpected increase in network latency (thus heartbeats are not received), or the primary machine stalling for a short period of time due to increased IO. In either of these situations, the secondary machine may think its primary peer has failed and it begins to take over work. External HA clustering technologies, such as HACMP, perform these complex tasks. However, product specific technologies may not perform these tasks. This may mean that both the primary machine and the secondary think they are the primary machine. This leads to a “split brain” problem and duplicate message processing. Avoid the “split brain” situation using external HA clustering technology. This arbitrates all resources in the network and can decide which machines have access to the data. Therefore, in the event of a failure, the HA clustering software can provide

Page 29

Page 30: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

access to the shared resources to the standby machine. This machine is now considered the primary machine by all. To conclude, investigate product specific approaches because their HA approaches may not be flexible or expandable enough to incorporate the much wider demands of a highly available IT infrastructure.

Page 30

Page 31: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

6. Conclusion This paper discussed approaches for implementing high availability solutions using the WebSphere MQ messaging product. Choosing a solution to achieve a highly available system is based on the HA requirements of that system. For instance, is each message important? Can a trapped message wait a few hours until a machine is restarted, or must it be made available as soon as possible? If it is the former, then a simple clustering approach is enough. However, the latter requirement requires the use of HA clustering software and hardware. Also, are software applications reliant on specific software or hardware resources? If so, a HA cluster solution is critical when interdependent groups of resources must be failed over together. Note that the approaches discussed in this paper for implementing high availability with WebSphere MQ all employ common HA principles. You should adhere to those principles when implementing any highly available IT system. The first is the use of a single copy of any data. This makes the data much easier to manage, there are no ambiguities about who owns the real data and there are no issues in reconciling the data if there is a corruption. When a failover occurs, only one instance of the software has access to the real data, avoiding any confusion. The only exception to this statement is when you implement a disaster recovery solution to move copies of critical data off site. In this case, you cannot use a copy of the data to remove the single point of failure and to provide high availability. Instead, if a site wide failure occurs, the backup is used to restore critical data and to resume services (possibly on another site). Second, always verify software that stores persistent state on disk to ensure it performs synchronous writes to the disk and to ensure hardening of the data. Asynchronous writes to a disk can result in software believing the data has been hardened to disk when, in fact, it has not. WebSphere MQ always writes persistent data synchronously to disk to ensure it has been hardened, and therefore, recoverable in the event of a queue manager failure. Third, implementing redundancy at a hard disk level, to remove the disk as a single point of failure, is a simple step that prevents the loss of critical data if a disk fails. Despite synchronous writes ensuring the data has been hardened to disk, a disk failure can still destroy the data. Therefore, implement technologies, such as RAID, to provide a disk level redundancy of data. Fourth, and often overlooked, implement process controls for the administration of production IT systems. Often it is administrative errors that cause outages because of improperly tested software updates, incorrect parameter settings, or destructive actions performed by administrators. By having proper process controls and security restrictions, you can minimize these errors. Also, HA clustering software provides a single administration view of all machines in a HA cluster, which minimizes administration effort. Lastly, programming applications to avoid affinities between clients and servers and long running Units of Work are good practices. The first allows applications to be

Page 31

Page 32: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

failed over to any machine and still continue running. The second allows servers to be restarted quickly so that they do not have large amounts of outstanding work to process. We can conclude that implementing high availability using an external HA clustering solution can bring large benefits to an IT infrastructure. It can allow groups of resources to be failed over, single copies of data to be maintained, and simpler administration of resources. IBM WebSphere MQ, DB2, WebSphere Application Server and WebSphere Business Integration Message Broker all support high availability through HA clustering software, and all provide resources to configure easily. This approach is considerably more flexible than a product specific solution. You can expand this approach way beyond its initial scope. Ultimately, high availability is a combination of implementing the correct server side infrastructure, avoiding single points of failure wherever they may lie (in hardware or software), and being flexible in the HA approach. The cost of implementing HA can initially be seen as an expensive undertaking, but you must always balance it with the potential cost of losing IT systems or critical data. External HA clustering software can solve many issues of high availability, but high availability is only a small part of resilience computing. You must address concepts such as disaster recovery, fault tolerance, scalability, and reliability to provide a 24 by 7 solution that is available 100% of the time.

Page 32

Page 33: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

Appendix A – Available SupportPacs These SupportPacs are provided free from IBM and assist in the setup and configuration of WebSphere MQ using different HA clustering technologies. MC41 Configuring WebSphere MQ for iSeries High Availability http://www-1.ibm.com/support/docview.wss?rs=203&uid=swg24006894&loc=en_US&cs=utf-8&lang=en MC63 WebSphere MQ for AIX – Implementing with HACMP http://www-1.ibm.com/support/docview.wss?rs=203&uid=swg24006416&loc=en_US&cs=utf-8&lang=en MC68 Configuring WebSphere MQ with Compaq Trucluster for high availability http://www-1.ibm.com/support/docview.wss?rs=203&uid=swg24006383&loc=en_US&cs=utf-8&lang=en MC69 Configuring WebSphere MQ with Sun Cluster 2.X http://www-1.ibm.com/support/docview.wss?rs=203&uid=swg24000112&loc=en_US&cs=utf-8&lang=en MC6A Configuring WebSphere MQ for Sun Solaris with Veritas Cluster Server http://www-1.ibm.com/support/docview.wss?rs=203&uid=swg24000678&loc=en_US&cs=utf-8&lang=en MC6B WebSphere MQ for HP-UX – Implementing with Multi Computer/Service Guard http://www.ibm.com/support/docview.wss?rs=203&uid=swg24004772&loc=en_US&cs=utf-8&lang=en

Page 33

Page 34: Understanding high availability with WebSphere MQ - IBM · Understanding high availability with WebSphere MQ ... shared disks” and “z/OS high availability ... queue manager data

Understanding high availability with WebSphere MQ

Page 34

Resources

[1]. WebSphere MQ for Z/OS – Concepts and Planning Guide – Chapter 2 (Shared Queues), http://www-306.ibm.com/software/integration/mqfamily/library/manualsa/manuals/platspecific.html

[2]. WebSphere MQ queue manager clusters, http://www-306.ibm.com/software/integration/mqfamily/library/manualsa/manuals/crosslatest.html

[3]. WebSphere MQ High Availability, Mark Taylor, Transaction and Messaging Technical Conference

[4]. “Choosing the right availability solution”, L.Sherman, http://whitepapers.zdnet.co.uk/0,39025945,60018358p-39000482q,00.htm

[5]. “Understanding Downtime”, Business Continuity Solution Series, Vision Solutions Whitepaper, http://www.visionsolutions.com/BCSS/White-Paper-102_final_vision_site.pdf

[6]. WebSphere MQ Manuals for z/OS, Systems Administration Guide, Chapter 14, Page 151, http://www-306.ibm.com/software/integration/mqfamily/library/manualsa/manuals/platspecific.html

[7]. “Achieving High Availability Objectives”, CNT whitepapers, http://www.cnt.com/documents/?ext=pdf&filename=PL581

[8]. A definition of the term RAID, webopedia.com, http://www.webopedia.com/TERM/R/RAID.html

About the authors Mark Hiscock joined IBM in 1999 while studying at the same time for his Computer Science degree. He has worked in the Hursley Park Laboratory in the United Kingdom testing IBM’s middleware suite of applications from WebSphere MQ Everyplace to WebSphere Business Integration Message Brokers. He now works as a customer scenarios tester for WebSphere MQ and WebSphere Business Integration Message Brokers, basing his testing on real world customer scenarios. You can reach him at [email protected]. Simon Gormley joined IBM in 2000 as a software engineer, and works at the Hursley Park Laboratory in the United Kingdom. He is currently working in the WebSphere MQ and WebSphere Business Integration Brokers test team, and focusing on recreating customer scenarios to form the basis of tests. You can reach him at [email protected].