session 1156 paul dennis -...

Session 1156

Paul Dennis - [email protected]

WebSphere MQ for z/OS Restart and Recovery

1. Durable Data--Preparing for a Crash

Where Durable Data is Kept

Creation of Some Durable Data

Transaction Related Updates

Message and Queue Related Updates

Checkpoints

A Quick Note on Backout

2. Restart

Current Status Rebuild

Historic Status Rebuild

Active UR Backout

Other Stuff

Sample Restart Messages

3. Backup and Recovery

Restart and Recovery Problems

Backing up

Restarting After a Disaster

A Backup and Recovery Example

Agenda


ACTIVE LOGS ARCHIVED LOGS LOG

BSDS

VSAM Indexed

DS

HIGHEST RBA

CHKPT LIST

LOG INVENTORY

VSAM Linear DSTapes

VSAM Linear DS

PagesetsPrivate Objects

Private Messages

{

RECOV RBAs

}

CPCPCP

Transactional Information

CF

Shared MessagesShared Objects

Group Objects

DB2

......

To be able to perform restart and recovery, WebSphere MQ keeps durable data in several places.

The bootstrap dataset (BSDS) contains inventory information about the logs, including:

The RBA of the highest written record. This is not updated every time that WMQ writes to the log, so it just points to near the end of the log.

A list of recent checkpoints; more on checkpoints later.

Information about which datasets contain which portions of the log.

The BSDS is kept in a VSAM indexed dataset. You can have single or dual copies.

The log is the most important durable store used for restart and recovery. Logically, it is a linear stream of records starting from Relative Byte Address (RBA) '0000 00000000'x up to RBA 'FFFF FFFFFFFF'x. WMQ writes to the log in increasing RBA order. Physically it is divided into the active log and the archive log. The active log is kept on a set of at least three VSAM linear datasets which are used in rotation. As active log datasets fill up, records are offloaded to the archive log which is often kept on tape.

Information on the state and actions of transations involving persistent messages is kept on the log. This includes every persistent update to the pagesets, so the log can be used for media recovery. Coupling Facility (CF) backups are also written to the log.

The log is always written to before any other persistent medium, so it is a definitive record.

Pagesets hold non-shared objects and non-shared persistent messages. They may also hold non persistent messages. Each page of each pageset includes a record of the log RBA corresponding to when that page was last written out to disk (the Recovery RBA).

Group object definitions and shared object definitions are kept in shared DB2 tables. Shared non-persistent messages are kept in CF list structures (they survive the failure of WMQ, and so are quite durable). Shared persistent messages are also kept in CF list structures, but backed up to the log. Information about transactions involving shared messages is kept in CF list structures and on the log.


Creation of Some Durable Data: An Application

Get Req In Sync Update DB then

Put Rep In Sync

Two phase commit

Get Req In Sync

(bad msg)

Put Error Msg

Out of Sync

BackoutGet Req In Sync Update DB then

Put Rep In Sync

Two phase commit

Get Req In Sync

(bad msg)

Put Error Msg

Out of Sync

Backout

Register

Interest

Register

Interest

Register

Interest

Prepare

RMs

Commit

RMs

Register

Interest

Register

Interest

Abort

RMs

APPL

COORDINATOR

WMQ

DB

Creation of Some Durable Data

The chart shows a possible WMQ application.

The application, APPL, gets request messages from a WMQ request queue, updates a table in a database, and puts a reply message to a WMQ reply queue.

All the updates are persistent. The application uses a two-phase coordinator (such as RRS, CICS or IMS) to coordinate the transaction between the application and the two Resource Managers (the database and WMQ).

In the chart, the application encounters an error; it gets a message from the WMQ request queue that it does not understand. In response to this, it puts a message indicating the error to a third WMQ queue. It puts this message out of syncpoint scope, because the next thing that it does is to backout the transaction.

All this application activity causes WMQ to create durable data to ensure transactional integrity, and to retain persistent updates.

Transaction-Related Updates


Put Rep In Sync

Two phase commit

Prepare Commit

Get Req In Sync

(bad msg)

Put Error Msg

Out of Sync

Backout

LOG

BEG UR

DECINSERT

<MSG DATA>INCDELETE

CMT PH2B

CMIT PH1E

CMIT PH2E

BEG UR

DEC DELETE INCINSERT

<MSG DATA>

BEG

ABRT

CHKPT BEG

CHKPT

END

CHKPT

RECSCLR

END

ABRTINC

UN

DELETE

BEGIN URCOMMIT PH1 END

COMMIT PH2 BEG

COMMIT PH2 END

COMPENSATING

LOG RECORD

BEGIN UR

BEGIN

ABORT

END

ABORT

In Flight In Doubt In Commit Complete

UR 1In Flight In Abort Complete

UR 2

FORCE FORCE

FORCE FORCE FORCE

APPL

UR STATE

UR STATE

URID

URID

URE

URE

Transaction-Related UpdatesWMQ creates durable updates that represent the state of any transactions in which it is taking part. Durable transactional information is kept on the log.

In the chart, log records that WMQ writes to record the state of the transactions for the application APPL, described on the previous chart, are highlighted. Transactions are sometimes called Units of Recovery (URs). Here the transactions are labelled UR1 and UR2.

A UR starts with a Begin UR record, written to the log just before WMQ does the first persistent update, in this case the destructive get of the request message. The newly started UR is identified by the log RBA of this Begin UR record, and all log records written for this UR will contain this RBA as a Unit of Recovery Indentifier (URID). When the Begin UR record has been written, corresponding to the first persistent update, the UR is in the In Flight state. Each UR also has an in-memory control block holding its state, called a URE.

After its Begin UR, UR1 has some records written to the log that represent the get of the request message and the put of the reply message; more on that later. The next transaction-related log record is an End Phase1 Commit record. This is in response to the Prepare call from the coordinator because the application has initiated a two-phase commit. When WMQ returns to the coordinator with an OK return code, this is a guarantee that WMQ can commit the transaction. To back this guarantee, WMQ ensures that it is ready to commit, then it writes out the record and forces the log. Most log writes are "lazy", with data written to buffers and I/O only performed when the buffers are full. A log force initiates I/O and waits for it to complete before continuing. When the End Phase1 record has been written, UR1 is in the In Doubt State.

The coordinator prepares the database, and calls WMQ to commit UR1. WMQ writes a Begin Phase2 Commit record and forces the log. UR1 is in the In Commit state. WMQ does what it needs to to commit the UR, then writes an End Phase2 Commit record, completing UR1.

UR2 again starts with a Begin UR and some records representing the get of the "bad" request msg. Before it writes the records representing the out of syncpoint put, it saves the last RBA it wrote in virtual storage. After it has written the OOS data, it writes a Compensating Log Record (CLR) that includes the RBA it saved, and forces the log. When WMQ backs out UR2 in response to an abort call from the coordinator because the application has initiated abort, it writes a Begin Abort record and forces the log. UR2 is in the In Abort state. It then reads the log backward and uses the records that it finds with the URID for UR2 to undo the changes that it made. When it reads the CLR, it uses the RBA in it to skip over the records for the OOS updates as these must not be undone. WMQ then writes an End Abort record and again forces the log. Now UR2 is complete.

Message and Queue Updates


Put Rep In Sync

Two phase commit

Prepare Commit

Get Req In Sync

(bad msg)

Put Error Msg

Out of Sync

Backout APPL

BEG UR

DECINSERT

<MSG DATA>INCDELETE

CMT PH2B

CMIT PH1E

CMIT PH2E

BEG UR

DEC DELETE INCINSERT

<MSG DATA>

BEG

ABRT

CHKPT BEG

CHKPT

END

CHKPT

RECSCLR

END

ABRTINC

UN

DELETE

Decrement by 1

Delete

LOG

Buffer

Disc RecordLog RBA

Queue Object

Depth=d-1

Buffer

Disc Record

Msg Record

Flags=Deleted

Disc Record PSET 0

PSET 2

PSET 0, Page 6

PSET 2, Page 5

ReqQ

Log RBA

Message and Queue UpdatesPrivate queue and the messages on them are kept on pagesets; the changes made to those pagesets are also recorded on the log. These log records are highlighted in this chart for the application described earlier.

Each pageset consists of an array of 4k pages which can be accessed randomly. Each page consists of a disc image of that page, and when that page is being used, a matching in memory buffer.

A queue has durable data on two pagesets. The object itself always resides on pageset 0, and the messages and queue structure reside on another pageset corresponding to the storage class of the queue. The object is contained in a single record in a single page, and includes a count of the queue depth. The queue is a collection of pages containing message data scattered around the pageset and chained together to form the queue structure. The chart shows the request queue, and the updates associated with the get of a persistent message from that queue.

Before any persistent update is done, an exclusive lock is obtained for the message, so that no other getter can get it. This will ensure transational isolation.

Next the queue depth is decremented. A log record is written to describe the decrement, and after that (remember the log must always be ahead), the queue depth is decremented in the buffer for the page containing the queue. At the same time, the RBA at the end of the decrement log record is written to the buffer as the recovery RBA.

Now the message must be deleted. Again a log record describing the delete action is written, and after that the message record, in the buffer for page 5 of pageset 2 in this example, is updated to show that it has been deleted. This just involves setting on a flag; WMQ cannot remove the record, as it may be needed if the get is backed out. At the same time, the recovery RBA in the buffer for this page is set to the RBA at the end of the delete log record. Note at this time that no durable data has been written. The pageset updates have been done to the in-memory buffers, and the log has not been forced.

The put of the reply message consists of an increment of the queue depth and an insert of the data to the page at the end of the queue in the same manner as the decrement and delete for the get.

The log force on the prepare for UR1 causes all log records up to that point to be written to the disc, so at the end of this, the durable record of the updates has been made.

The actual disc records of the pagesets still do not have the updates, as they are in the buffers. A buffer will be written out to the disc when it is required for an in memory image of a different page, or when it has not been written for 3 checkpoints (more later). Note that the recovery RBA in the disc version of the page is the RBA after the last update in the disc version of the page (again, more later).

Checkpoints

UR 2UR 2

UR 1

CF

Next checkpoint After LOGLOAD log recs written or... When active log dataset changed or...

At clean shutdown or successful restart

CHKPT BEG

URIDs & STATES

PSET n LOW RECOV RBA

CF UOWIDs & STATES

CF BACKUP INFO

CHKPT END

END ABRT

CHKPT BEG

FORCE

...

Scan used pages

PagesetsUREs

PSET 0 LOW RECOV RBA

Page 0

Buffers > 2 CHKPT old flushed

Note: 3 active log datasets => 3 checkpoints in active log => All page updates on pagesets + active log

BSDS

CHKPT LIST

CHKPT BEG RBACHKPT

BEG RBA

Checkpoints

A checkpoint is a collection of log records that give an indication of the state of the queue manager at a point in time. It is also used as a term for the point in time when the queue manager writes these records.

A checkpoint is taken:When he number of log records described by the LOGLOAD parameter in the ZPARMS for the queue manager has been written since the last checkpoint.When the active log dataset which holds the current log RBA changes.After a successful restartAt shutdown. This is a special checkpoint as the queue manager can end work and achieve a consistent state.

A checkpoint starts with a Begin Checkpoint record.

The URID and state for each active UR, as described in the URE, is written out.

Each pageset is scanned. If any page has a buffer that has not been written out for more that 2 checkpoints, it is flushed to the pageset. If there are three active log dataset (each containing at least one checkpoint), then any update that has not been flushed out to a pageset will be in the active log. The lowest page Recovery RBA for all the pages in the pageset is written in a log record, and also copied into page zero of that pageset. This represents the earliest RBA that would have to be read to bring the pageset up to date if all the data in the buffers were lost.

Information about transactions involving messages on shared queues is kept in-memory, and when the transaction is past the In Flight state, also in the CF. The in-memory information is written to a log record as part of a checkpoint. Information about which queue manager has taken a backup of an application structure, and which queue manager has made persisted shared updates is also written to the log.

An End Checkpoint record is written, and the log is forced. After the force, so that the log is always ahead, the begin checkpoint RBA is added to the list in the BSDS.


INC(UNDO/REDO)

ALLOC PAGE(REDO)

CHAIN PAGE(REDO)

INSERT(UNDO/REDO)

INC(UNDO/REDO)

INSERT(UNDO/REDO)

CLR

Put In Sync Put Out of Sync

BEGIN UR

DELETE DECEND

ABORTBEGIN ABORT

BACKOUT

Read Log Backward "Undoing" to Begin UR

Skip back to RBA in CLR CLR

INSERT(UNDO/REDO)

Undo insert; action is delete

CHAIN PAGE(REDO)

Chain page record has no undo

ALLOC PAGE(REDO)

Alloc page record has no undo

INC(UNDO/REDO)

Undo inc; action is dec

Message does not fit on queue so allocate

a new page and chain it to the queue.

{


The chart shows a new application which does a put in syncpoint followed by a put out of syncpoint and then, for some reason, does a backout (sometimes called abort).

In this case, the first put does not fit on the pages allocated to the queue, and so a new page is allocated and chained into the queue. As always, each of these actions is recorded on the log before it is done to the pageset.

Each log record that describes an update to a record on a pageset must describe what action should be taken if the transation is committed, the Redo action, and what action should be taken if the transaction is backed out, Undo action. The log record contains the redo action and whether that record applies to redo and/or undo. The actual undo action is inferred from the redo action. For example, the undo action of an increment is a decrement.

When a UR is backed out, WMQ reads the log backward from the last record written by this UR to the Begin UR record, looking at each record with the same URID.

The first record read is the compensating log record for the out of syncpoint put. This holds an RBA that points backward in the log to the last record before the out of syncpoint put started. The out of syncpoint put has completed, and the backout of the UR should not affect this, so WMQ skips over these records without doing anything.

The next record read is the insert of the message data. This is an Undo/Redo record, and so applies to backout. The undo action is performed, which is a delete of the record holding the message.

The next two records are the chain of the page into the queue and the allocate of the page. Each of these is a Redo only record, and so backout processing takes no action. Why? The out of syncpoint put just has an increment record and an insert record, so it has clearly gone onto the same page as the in syncpoint put. If the page is unchained and deallocated now, that message will be lost. This could equally apply to a put by a different application.

The final record read is an Undo/Redo record for an increment of the queue depth. Backout processing undoes this with a decrement of the queue depth.


BSDSHIGHEST RBA

CHKPT LIST

LOG INVENTORY

CHKPT BEG

BEGIN UR

END CMIT PH2

END CMIT PH1

URIDs & STATES

CHKPT END

OTHER MISC RECORDS

PSET RECOV RBAs

CF DATA

PSET 0 LOW RECOV RBA

CF

Read forward to find actual end of log

UR 2UR 2

UR 1

RUREs

Create RUREs

Read forward from CHKPT

Add RURE

Delete RURE Change

RURE State

In Flight In Doubt In Commit In Abort


The queue manager has ended abnormally, so what happens at restart?

Almost the first activity is Current Status Rebuild. This is the process where the queue manager uses the log to establish the states of any URs that were active at the time of failure. The queue manager also establishes the RBA of the last complete record written to the log, the RBA of the earliest record required for recovery of the pagesets, and the RBA required for CF recovery.

The BSDS is used as a shortcut into the log. The highest written RBA and the inventory of log records are used together to find a starting point in the log datasets. WMQ reads the log forward from this point until it finds the last complete log record written. This is where log recording will continue from, and is indicated with the output of a CSQJ099I message. WMQ needs this early in restart, as each stage of restart records its progress on the log.

The list of checkpoints in the BSDS is used to find the RBA of the last checkpoint written. Current Status Rebuild reads the log forward from this point to the RBA that was the end of the log when the queue manager abended.

From the checkpoint record containing the URIDs and states, WMQ creates a set of RUREs in memory. These are very similar to UREs, but are used for recovery. At this point, each one describes the state of a UR at the time when the checkpoint was written. The other checkpoint records are used to determine the lowest RBA required for media recovery of each pageset, the states of any URs involving shared queues, and the lowest RBAs required from each queue manager in the QSG for CF structure recovery.

Next, WMQ reads the log forward and uses transaction-related log records to update the states of the RUREs, create a new RURE for and Begin UR records, and remove a RURE for any record that ends a UR (End Abort, End Commit Phase2). The RBA of the last record written for each UR is also stored in the RURE.

Remember that the URID of a transaction is the RBA of the Begin UR record? This means that for each UR WMQ knows the lowest RBA that needs to be read to process the entire transaction. Hopefully, this falls in the active log, otherwise some tapes might need to be mounted!

Log RBA Log RBALog RBALog RBA

Historic Status Rebuild: Messages and Queues

UR 2UR 2

UR 1

RUREs for in commit and in doubt URs only

INC(UNDO/REDO)

INSERT(UNDO/REDO)

BEGIN UR1

CSQR021D: Commit long UR?

In Flight

In Abort

In Commit

In Doubt

Log RBAs

Increasing

Start RBA

Pagesets

Read forward

ALLOC PAGE(REDO)

CHAIN PAGE(REDO)

Pages

REDO DONE DONE DONE DONEREDO REDO REDO

LOWEST RBA

Historic Status Rebuild: Messages and Queue

The next stage is Historic Status Rebuild.

The aim is to make sure that all updates for URs that are In Commit or In Doubt are done, that media is up to date (except for changes that will be backed out), and that In Doubt URs hold the necessary locks for transactional recovery.

Firstly, the RUREs for In Commit and In Doubt URs are sorted into ascending order of start RBA (ie URID). The RUREs for In Flight and In Abort URs are sorted into descending order of highest RBA written. Hopefully, they will all fall within the active log datasets, but if not, message CSQR021D is issued, and the operator has a chance to commit the UR.

Each pagesets is also examined to find the lowest recovery RBA which will be needed for media recovery. Hopefully this will be in the active log, as the active log should contain at least three checkpoints and pagesets are flushed out every third checkpoint. However, if a pageset has been lost, a new one may be inserted, and all log records for that pageset will need to be applied. The lowest of the lowest pageset recovery RBA and the lowest RURE URID is used as a start point for Historic Status Rebuild.

The log is read forward from this point. Any log record for a change to a page with a redo type will be redone if the RURE with the corresponding URID is In Commit or In Doubt and the RBA of the record is higher than the log recovery RBA of the page. If the original update involved getting locks, then Historic Status Rebuild will get those locks and associate them with the RURE. If the UR is in commit, these will be released by the end of restart, but if it is in doubt, they will not be released until the UR is resolved, which requires external input. For records from RBAs lower than the lowest RURE URID, but higher that the lowest pageset recovery RBA, all updates to pagesets must be redone, but the locks do not need to be obtained.

For shared queues, there may be some log records about the change in state of URs involving the CF which may need redo-ing.

Log RBALog RBA

Active UR Backout

UR 2UR 2

UR n

RUREs for in flight and in backout URs only

INSERT(UNDO/REDO)

BEGIN UR1

In Flight

In Abort

In Commit

In Doubt

Decreasing

High RBA

Read backward

ALLOC PAGE(REDO)

CHAIN PAGE(REDO)

Pages

NOT DONE UNDO

Do not backout redo only updates

NOT DONE UNDO

Last log

record written

INC(UNDO/REDO)

May need archives and reading backward is slow.

Now write a checkpoint

Active UR Backout

The next stage is Active UR Backouts.

The aim here is to undo all changes for URs that are backed out. Any UR that was in flight or in abort at the time when the queue manager abended must be backed out.

The sorted list RUREs for these URs that was produced during historic status rebuild is used.

The log is read backward from the highest written RBA by any of these URs. Backout is performed in a similar way to that described earlier. However, there is no need to undo any changes that were not flushed out to the pageset, as all the data in the buffers has been lost.

Note that reading the log backward is generally slow.

After active UR backout, a checkpoint is taken.

Other Stuff

DB2

Rebuild Index

Rebuild Group Objects

Message CORREL ID

CSQ1234

CSQ5646

CSQ6675

Message CORREL ID

CSQ1234

CSQ5646

CSQ6675

Message CORREL ID

CSQ1234

CSQ5646

CSQ6675

PSET 0 Other PSETs

Discard all pages with non persistent messages

Other Stuff

A few other things go on a queue manager start up. These include:Rebuilding in memory index for each indexed queue

From V5.3, this does not have to complete before restart completes, only before the queue can be used.Restoring group objects from DB2 tables onto pageset zero.Removing all non-persistent messages that had been written out to the pagesets. Any one page in a pageset can hold either persistent or non-persistent messages, but not both. Those containing non-persistent messages are simply marked as available for allocation at queue manager startup.

A Sample Restart Messages16.51.01 STC29943 CSQY000I !MQ1A IBM WebSphere MQ for z/OS XXXXXXXX

16.51.01 STC29943 CSQY001I !MQ1A QUEUE MANAGER STARTING, USING PARAMETER MODULE MQ1AZPRM

16.51.01 STC29943 CSQ3111I !MQ1A CSQYSCMD - EARLY PROCESSING PROGRAM IS XXXXXXX

16.51.02 STC29943 CSQJ127I !MQ1A SYSTEM TIME STAMP FOR BSDS=2004-02-17 13:29:18.55

16.51.03 STC29943 CSQJ001I !MQ1A CURRENT COPY 1 ACTIVE LOG DATA SET IS 519

519 DSNAME=VICY.MQ1A.LOGCOPY1.DS01, STARTRBA=000000000000

519 ENDRBA=00000275FFFF

16.51.03 STC29943 CSQJ099I !MQ1A LOG RECORDING TO COMMENCE WITH 520

520 STARTRBA=000000150000

16.51.05 STC29943 CSQR001I !MQ1A RESTART INITIATED

16.51.05 STC29943 CSQR003I !MQ1A RESTART - PRIOR CHECKPOINT RBA=000000125DE2

16.51.05 STC29943 CSQR004I !MQ1A RESTART - UR COUNTS - 544

544 IN COMMIT=0, INDOUBT=0, INFLIGHT=2, IN BACKOUT=0

16.51.05 STC29943 CSQR007I !MQ1A UR STATUS 545

545 T CON-ID THREAD-XREF S URID TIME

545 - -------- ------------------------ ------------- -------------------

545 B RHARRAN3 000000000000000000000000 F00000014B6E5 2004-02-17 14:37:32

545 B RHARRAN4 000000000000000000000000 F00000014B269 2004-02-17 14:37:32

16.51.05 STC29943 CSQI049I !MQ1A Page set 0 has media recovery 546

546 RBA=000000125DE2, checkpoint RBA=000000125DE2

16.51.05 STC29943 CSQI049I !MQ1A Page set 1 has media recovery 550

550 RBA=000000125DE2, checkpoint RBA=000000125DE2

16.51.06 STC29943 CSQR030I !MQ1A Forward recovery log range 564

564 from RBA=000000125DE2 to RBA=00000014F298

16.51.06 STC29943 CSQR005I !MQ1A RESTART - FORWARD RECOVERY COMPLETE - 565

565 IN COMMIT=0, INDOUBT=0

16.51.06 STC29943 CSQR032I !MQ1A Backward recovery log range 566

566 from RBA=00000014F298 to RBA=00000014B269

16.51.06 STC29943 CSQR006I !MQ1A RESTART - BACKWARD RECOVERY COMPLETE - 567

567 INFLIGHT=0, IN BACKOUT=0

16.51.08 STC29943 CSQR002I !MQ1A RESTART COMPLETED

And Some More16.51.08 STC29943 CSQP018I !MQ1A CSQPBCKW CHECKPOINT STARTED FOR ALL BUFFER POOLS

16.51.08 STC29943 CSQP019I !MQ1A CSQP1DWP CHECKPOINT COMPLETED FOR 570

570 BUFFER POOL 3, 2 PAGES WRITTEN



16.51.08 STC29943 !MQ1A DISPLAY THREAD(*) TYPE(INDOUBT)



16.51.08 STC29943 CSQP021I !MQ1A Page set 0 new media recovery 574

574 RBA=000000152AE8, checkpoint RBA=000000152AE8

16.51.08 STC29943 CSQP021I !MQ1A Page set 1 new media recovery 575

575 RBA=000000150850, checkpoint RBA=000000150850



16.51.08 STC29943 CSQI007I !MQ1A CSQIRBLD BUILDING IN-STORAGE INDEX FOR 582

582 QUEUE SYSTEM.CHANNEL.SYNCQ

16.51.08 STC29943 CSQV401I !MQ1A DISPLAY THREAD REPORT FOLLOWS -

16.51.08 STC29943 CSQV420I !MQ1A NO INDOUBT THREADS FOUND

16.51.08 STC29943 CSQ9022I !MQ1A CSQVDT ' DISPLAY THREAD' NORMAL COMPLETION

16.51.08 STC29943 CSQI006I !MQ1A CSQIRBLD COMPLETED IN-STORAGE INDEX 588

588 FOR QUEUE SYSTEM.CHANNEL.SYNCQ

16.51.12 STC29943 CSQJ322I !MQ1A DISPLAY SYSTEM report ... 597

16.51.12 STC29943 CSQJ322I !MQ1A DISPLAY LOG report ... 600

16.51.12 STC29943 CSQJ370I !MQ1A LOG status report ... 601

16.51.12 STC29943 CSQJ322I !MQ1A DISPLAY ARCHIVE report ... 603

16.51.12 STC29943 CSQY022I !MQ1A QUEUE MANAGER INITIALIZATION COMPLETE

16.51.12 STC29943 CSQ9022I !MQ1A CSQYASCP 'START QMGR' NORMAL COMPLETION

Timings


Just read from last checkpoint to end of log; pretty quick

About LOGLOAD records to read

Historic Status Rebuild

Read from earliest pageset recovery RBA or lowest URID for in commit or in doubt UR to log write commence RBA.

Range to read displayed after current status rebuild in message CSQR030I

If it falls outside the active logs for a UR, message CSQR020I is displayed

Assuming log reading at an average of 1MB/s, MQ1A completed HSR in less than 0.2 seconds. (OK it did not have much to do, only media recovery)

Progress (RBA being processed) displayed every 2 min. CSQR031I

Active UR Backout

Read backwards from highest RBA for any in flight or in abort UR to lowest URID for any in flight or in abort UR

Range to read displayed after Historic Status Rebuild in message CSQR032I

Perhaps 0.5MB/s is a resonable data rate

Progress (RBA being processed) displayed every 2 min. CSQR033I

What Can Go Wrong

ACTIVE LOGS ARCHIVED LOGS LOG

BSDS

VSAM Indexed

DS

HIGHEST RBA

CHKPT LIST

LOG INVENTORY

VSAM Linear DSTapes

VSAM Linear DS

PagesetsRECOV RBAs

CPCPCP

CF

Shared MessagesShared Objects

Group Objects

DB2

......BEGIN UR

COMMIT PH1 END

What Can Go Wrong

All that durable data is on discs and tapes (and a bit in the coupling facility and DB2). So the obvious thing that can go wrong is that some or all of this media gets destroyed or corrupted.

WMQ has the facility to have dual copies of active logs, BSDSs and archive logs, but it is still possible to lose both copies.

If one copy of the BSDS is unavailable, WMQ issues message CSQJ126E and switches to single BSDS mode. For the queue manager to restart, dual mode must be restored, by creating a new copy of the damaged BSDS and copying in the good copy.

As has been mentioned before, the most important store of durable data is the log. If there is a complete log, the queue manager can always be restarted and restored to the state that it was in shortly before the failure... It just may take some time.

It is important to make sure that logs are offloaded often enough to be able to make backups and then keep them in a safe place or places, perhaps off site. If the log is lost, the BSDS can be printed out to give an indication of which log datasets will need to be restored for restart to work.

If pagesets are lost, they can be restored from a fuzzy backup, and the pageset RBA will be correct so that media recovery can be performed at restart--that is, it should just work. More on restoring from backups later.

If CF structures are lost, those containing persistent messages can be restored from the log... So long as a backup has been taken on a queue manager whose logs are available, and the logs from any queue manager that may have done persistent updates to that structure are also available.

A problem that does not involve media failure is a long running unit of work. For example, this could be caused by channel (long BATCHINT), or by a bad application design. CSQJ160I and CSQJ161I warn of long URs. These can be committed at restart, or possibly resolved with the RESOLVE INDOUBT command.

The System Management and System Admin Guide describe disaster scenarios and responses.

Backing Up

ARCHIVED LOGS

Tapes

CFDB2

......

Fuzzy Backup (>1 per day)

DISPLAY USAGE:

Msg CSQI024I,CSQI025I

give start RBA

Copy DS (eg FLASHCOPY)

ARCHIVE LOG

Full Backup (disruptive)

Quiesce workload

Resolve indoubts

ARCHIVE LOG

Quiesce Qmgr

Msg CSQI024I,CSQI025I

give start RBA

Copy (eg FLASHCOPY)

Copied to archive

Not used in disaster recovery

Backup DSG

Structure backup records

Backup structures after PSETs

BACKUP CFSTRUCT(...)

ARCHIVE LOG (on all QMs)

Save Object Definitions

CSQUTIL MAKEDEFS

SAVE ZPARMS

PagesetsBSDS

HIGHEST RBA

CHKPT LIST

LOG INVENTORY

Use dual copy

Copy in archive

*LOW

*LOW

ACTIVE LOGS

Backing Up

A copy of the BSDS is produced for each archive log dataset. In has the same name as the archive, but ending .Bnnnnnnn, rather than .Annnnnnn. The most recent copy should be sufficient for disaster recovery.

Pagesets can be backed up using the fuzzy backup process to achieve a restorable point of consistency without disrupting WMQ operations. This should be done at least once per day.

Issue the DISPLAY USAGE command to determine the earliest log RBA required to perform media recovery on all pagesets (and on recoverable CF structures). This is displayed in message CSQI024I. If there are pagesets that are offline and require media recovery, the corresponding RBA is displayed in message CSQI025I.Copy all the pagesets to backups. You must have defined your pagesets with SHAREOPTS(2,3) to be able to do this when the queue manager is running. The copy must copy page zero first to ensure consistency.Issue the ARCHIVE LOG command.For recovery, you will need the backups of the pagesets plus copies of the archives from that containing the lower RBA from the CSQI024I or CSQI024I message to that produced by the ARCHIVE LOG command.

Pagesets can also be fully backed up, but this involves quiescing work to the queue manager, resolving all indoubts, and stopping the queue manager, before backing up the pagesets.

CF structures with RECOVER(YES) can be backed up by issuing the BACKUP CFSTRUCT(...) command. A fuzzy backup is written to the log of the queue manager on which the command is issued, and all persistent updates (eg MQPUT of a persistent message) to that structure are written to the log of the queue manager which does the update. The log containing the start of the latest backup and all subsequent logs for all queue managers are required for recovery, so CF structures should be backed up as often as pagesets. To gain a point of consistency for shared queues, BACKUP CFSTRUCT(...) should be issued for all recoverable application structures on a queue manager in the QSG, then the fuzzy pageset backup proceedure should be followed, and all backups of archive logs and pagesets kept.

DB2 data should be backed up using the DB2 proceedures.

Active logs are backed up by archiving...if you have all your active logs, then you will be able to restart to the point at which the queue manager terminated.


BSDS

HIGHEST RBA

CHKPT LIST

LOG INVENTORY

CF

Restore DB2 from DB2 BackupDB2

Minimum Requirements

Full Backup PSETs or

Fuzzy Backup PSETS plus relevant logs or

All logs (but slow)

plus DB2 backups

Restore BSDS

Allocate Datasets

Copy one from archive

CSQJU003 NEWLOG to add latest archive to BSDS

CSQJU003 DELETE and NEWLOG to add new active logs

If no logs or QSG, CSQJU003 CRESTART

Restore Pagesets

Allocate datasets

IDCAMS REPRO from backups

If no logs, CSQUTIL FORMAT REPLACE and RESETPAGE or

If no backup pagesets, CSQUTIL FORMAT RECOVER (slow)

... or Cold Start

Relevant logs are from that containing the RBA from the CSQI024I and CSQI025I messages before the fuzzy backup, to the latest log available

Restore CF Structures

Verify CFRM policies

RECOVER CFSTRUCT(...) on one QM

Need logs for all QMs

BACKUP ASAP

PagesetsACTIVEARCHIVE

Log

Restore Logs

Allocate datasets

IDCAMS REPRO archives from backups

CSQJUFMT format active logs

To recover to a point of consistency after a disaster, perhaps at a remote site, you will probably have: a fuzzy backup of the pagesets, backups of the logs (including archived BSDS) containing the earliest RBA required to recover from these pagesets identified by the CSQI024I message issued at backup time and the latest available RBAin a QSG, DB2 backups from that point of consistency

You could also recover from pageset backups produced in a full backup, from a complete copy of the log, or choose to cold-restart.

First, restore the BSDS from the archive copy:Allocate new BSDS(s) using IDCAMS DEF CLUSTERCopy the latest archive using IDCAMS REPROPrint it out using CSQJU004 and have a look!Print out the latest archive with CSQ1LOGP and determine the start and end RBA and LRSN (for QSG)Add the latest archive using the CSQJU003 NEWLOG commandDelete active log inventory using the CSQJU003 DELETE command for all active logs (unless you have them)Add in empty active log inventories using CSQJU003 NEWLOG, and define using IDCAMS DEF CLUSTERYou may need to add a Conditional Restart Record to indicate the highest LRSN that you wish to access

In a QSG in the event of sysplex failure, you need to use CRESTART with the ENDLRSN which is the lowest of that for the latest archives available for each queue manager to achieve a point of consistency.

Restore the pagesets from the backups using IDCAMS DEF CLUSTER and IDCAMS REPRO.

You need to use CSQUTIL FORMAT TYPE(REPLACE) and CSQUTIL RESETPAGE if you have no logs that are newer than your pageset backups. This only works if the pagesets are consistent.

Restore DB2 from backups using DB2 procedures.

See the System Admin Guide, and prepare in advance


An Example: MQ1A

ARCHIVED LOGS

Tapes

CF

DB2

......

Pagesets

ACTIVE LOGS

A.QUEUE on PSID(1)

SHARED.Q on STRUCTURE APPLICATION1

Move Msgs

VICY.MQ1A.PAGE00...

VICY.MQ1A.PAGE05

VICY.MQ1A.ARCLOG1.A000000n

VICY.MQ1A.ARCLOG2.A000000n

VICY.MQ1A.ARCLOG1.B000000n

VICY.MQ1A.ARCLOG2.B000000n

(actually on DASD)

VICY.MQ1A.LOGCOPY1.DS01




VICY.MQ1A.LOGCOPY2. " "

MQ1A is a queue manager in a QSG with one other queue manager, MQ1B

MQ1A has six pagesets, dual BSDS, four sets of dual active logs, and dual archive logs, named as in the chart

An application generates a moderate workload by moving persistent messages between a private queue A.QUEUE on pageset 1, and a shared queue SHARED.Q (no, really), on CF structure APPLICATION1, which is the only application structure in the QSG.

The example will show the steps taken to produce a single point of consistency using fuzzy backups, and the steps taken to restore the queue manager and the QSG.

The backup procedure should be used at least once per day, and the recovery procedure should be tested regularly.

An Example: MQ1A

An Example: continued (2)

1. Console: /!MQ1A BACKUP CFSTRUCT(APPLICATION1)14.21.43 STC28879 CSQE105I !MQ1A BACKUP task initiated for structure APPLICATION1

14.21.43 STC28879 CSQ9022I !MQ1A CSQELRBK ' BACKUP CFSTRUCT' NORMAL COMPLETION

14.21.43 STC28879 CSQE120I !MQ1A Backup of structure APPLICATION1 521

521 started at RBA=000003E3DA5F

14.21.44 STC28879 CSQE121I !MQ1A CSQELBK1 Backup of structure 523

523 APPLICATION1 completed at RBA=000003FF1A2C, size 2 MB

2. Console: /!MQ1A DISPLAY USAGE14.24.25 STC28879 CSQI010I !MQ1A Page set usage ... 573

573 Page Buffer Total Unused Persist Nonpersist Restart Expansion

573 set pool pages pages data pages data pages extents count

573 _ 0 0 1618 1598 20 0 1 USER 0

573 _ 1 1 2698 1416 1268 14 1 USER 2

573 _ 2 2 1078 1078 0 0 1 USER 0

573 _ 3 3 538 538 0 0 1 USER 0

573 _ 4 0 538 538 0 0 1 USER 0

573 _ 5 1 538 538 0 0 1 USER 0

573 End of page set report

14.24.25 STC28879 CSQP001I !MQ1A Buffer pool 0 has 2000 buffers




14.24.25 STC28879 CSQI024I !MQ1A CSQIDUSE Restart RBA for system as 578

578 configured=000001B01486

14.24.25 STC28879 CSQ9022I !MQ1A CSQIDUSE ' DIS USAGE' NORMAL COMPLETION

It is time to do the backup...

Start by backing up CF structures with RECOVER(YES) on one queue manager.

This creates a fuzzy backup on the log of MQ1A. The response shows the RBA range from the start to the end of the fuzzy backup.

Issue the display usage command. This shows that all the persistent data is on Pageset 0 (objects) and Pageset 1 on which A.QUEUE resides.

Message CSQI024I indicates that the earliest RBA required to restore the pagesets and the CF structures if a fuzzy copy is taken now will be RBA 000001B01486. If pageset backups are kept, archive logs ending before this RBA may be discarded (but don't get it wrong!).



3.JCL: Copy Pagesets... Example uses ADRDSSU//COPY1 EXEC PGM=ADRDSSU,REGION=32M PARM='UTILMSG=YES,TYPRUN=NORUN'

//SYSPRINT DD SYSOUT=H

//SYSIN DD *

COPY -

ALLDATA(*) -

ALLEXCP -

CANCELERROR -

SHARE -

TOL(ENQF) -

RENAMEU((VICY.MQ1A.**,RHARRAN.MQ1AB02.**) ) -

DATASET(INCLUDE(VICY.MQ1A.PAGE00 -

VICY.MQ1A.PAGE01 -

VICY.MQ1A.PAGE02 -

VICY.MQ1A.PAGE03 -

VICY.MQ1A.PAGE04 -

VICY.MQ1A.PAGE05 ))

4. Console: /!MQ1A ARCHIVE LOG14.25.15 STC28879 CSQJ033I !MQ1A FULL ARCHIVE LOG VOLUME 626

626 DSNAME=VICY.MQ1A.ARCLOG1.A0000004, STARTRBA=000003D0D000

626 ENDRBA=00000463EFFF, STARTLRSN=BBA993D649CE ENDLRSN=BBA994D56486,

626 UNIT=SYSDA, COPY1VOL=P5P44E, VOLSPAN=NO CATLG=YES

14.25.15 STC28879 CSQJ033I !MQ1A FULL ARCHIVE LOG VOLUME 627

627 DSNAME=VICY.MQ1A.ARCLOG2.A0000004, STARTRBA=000003D0D000

627 ENDRBA=00000463EFFF, STARTLRSN=BBA993D649CE ENDLRSN=BBA994D56486,

627 UNIT=SYSDA, COPY2VOL=P5P456, VOLSPAN=NO CATLG=YES

14.25.15 STC28879 CSQJ139I !MQ1A LOG OFFLOAD TASK ENDED

Copy the pagesets using utility of your choice (it must copy page zero first).

They are defined with SHAREOPTIONS(2,3) so that they can be copied while the queue manager is using them.

Issue the archive log command so that there is a point of consistency with the log ahead of the pagesets.

Note the archive number which is the highest in this backup. Noting the RBA and LRSN ranges can make recovery processing easier later.



5. JCL: Find the archive log with the earliest needed RBA from BSDS (000001B01486) //JU004 EXEC PGM=CSQJU004

//STEPLIB DD DSN=MQB1.V000.COM.OUT.SCSQANLE,DISP=SHR

// DD DSN=MQB1.V000.COM.OUT.SCSQAUTH,DISP=SHR

//SYSPRINT DD SYSOUT=*,DCB=BLKSIZE=629

//SYSUT1 DD DISP=SHR,DSN=VICY.MQ1A.BSDS01

//

Output: (section)

ARCHIVE LOG COPY 1 DATA SETS

START RBA/TIME/LRSN END RBA/TIME/LRSN CREATED DATA SET INFORMATION

---------------------- ---------------------- ----------- --------------------

000000000000 / 0000001FCFFF / 2004-08-13 DSN=VICY.MQ1A.ARCLOG1.A0000001

2004-08-13 13:45:41.6 2004-08-13 13:46:02.8 13:46 PASSWORD=********

/ BBA98BFFE3F0 / BBA98C141ED7 VOL=P5P45C UNIT=SYSDA

CATALOGUED

0000001FD000 / 00000295CFFF / 2004-08-13 DSN=VICY.MQ1A.ARCLOG1.A0000002

2004-08-13 13:46:02.8 2004-08-13 14:09:48.3 14:09 PASSWORD=********

/ BBA98C141ED8 / BBA991638F9F VOL=P5P452 UNIT=SYSDA

CATALOGUED

00000295D000 / 000003D0CFFF / 2004-08-13 DSN=VICY.MQ1A.ARCLOG1.A0000003

2004-08-13 14:09:48.3 2004-08-13 14:20:45.5 14:20 PASSWORD=********

/ BBA991638FA0 / BBA993D649CD VOL=P5P45F UNIT=SYSDA

CATALOGUED

000003D0D000 / 00000463EFFF / 2004-08-13 DSN=VICY.MQ1A.ARCLOG1.A0000004

2004-08-13 14:20:45.5 2004-08-13 14:25:13.0 14:25 PASSWORD=********

/ BBA993D649CE / BBA994D56486 VOL=P5P44E UNIT=SYSDA

CATALOGUED

Use the log summary print utility CSQJU004 to print the contents of the BSDS to determine the oldest archive needed to complete the backup.

The CSQI024I message displayed in Step 2 indicated a RBA of 000001B01486, which is in the second archive log, so we need to backup archive logs VICY.MQ1A.ARCLOG1.A0000002 to VICY.MQ1A.ARCLOG1.A0000004 and the archived BSDS VICY.MQ1A.ARCLOG1.B0000004 to get a consistent point of recovery with the pageset backups that we already have.



6. JCL: Copy archive logs and BSDS (and possibly ship off site) //COPY1 EXEC PGM=ADRDSSU,REGION=32M PARM='UTILMSG=YES,TYPRUN=NORUN'

//SYSPRINT DD SYSOUT=H

//SYSIN DD *

COPY -

ALLDATA(*) -

ALLEXCP -

CANCELERROR -

SHARE -

TOL(ENQF) -

RENAMEU( (VICY.MQ1A.**,RHARRAN.MQ1ABA02.**)) -

DATASET(INCLUDE(VICY.MQ1A.ARCLOG1.A0000002 -

VICY.MQ1A.ARCLOG1.A0000003 -

VICY.MQ1A.ARCLOG1.A0000004 -

VICY.MQ1A.ARCLOG1.B0000004))

7. This is now a complete backup:

RHARRAN.MQ1ABA02.ARCLOG1.B0000004 (latest BSDS)

RHARRAN.MQ1ABA02.ARCLOG1.A0000002-4 (archive logs needed to recover pageset backups)

RHARRAN.MQ1ABA02.PAGE00-05 (pageset fuzzy backups)

Plus DB2 DSG backups


Finally, backup the archive logs selected, and the latest archived BSDS, and this is an WMQ backup from which you can recover to a consistent point.

You will need a DB2 backup of the DB2 DSG for shared queues, which is from not before this time.

You will also need to know what inputs are needed to assemble the WMQ parameter modules on the recovery site.

Commands to regenerate all the objects on the queue manager can be generated using the CSQUTIL MAKEDEFS command, and these could even be used on a cold started queue manager. See the System Admin Guide for more details.

An Example: continued (6) Recovery

1. Unpack those tapes, and start to recreate... First define new BSDS //COPY1 EXEC PGM=IDCAMS

//SYSPRINT DD SYSOUT=*

//SYSIN DD *

DEFINE CLUSTER -

(NAME(VICY.MQ1A.BACK.BSDS01) -

VOLUMES(LETSMS) -

UNIQUE -

SHAREOPTIONS(2 3) ) -

DATA -

(NAME(VICY.MQ1A.BACK.BSDS01.DATA) -

RECORDS(60 60) -

RECORDSIZE(4089 4089) -

CONTROLINTERVALSIZE(4096) -

FREESPACE(0 20) -

KEYS(4 0) ) -

INDEX -

(NAME(VICY.MQ1A.BACK.BSDS01.INDEX) -

RECORDS(5 5) -

CONTROLINTERVALSIZE(1024) )

And Copy 2...


In the event of a disaster, find the backups that make the latest point of recovery that you have... hopefully you have kept them together.

The first job is to recreate the BSDSs that will have the same properties as those for the "lost" queue manager.


2. JCL: Extract the BSDS from the archive into one copy of the new BSDS //COPY1 EXEC PGM=IDCAMS


//SYSIN DD *

REPRO -

INDATASET(RHARRAN.MQ1ABA02.ARCLOG1.B0000004) -

OUTDATASET(VICY.MQ1A.BACK.BSDS01)

3. Print the BSDS using CSQJU004 as shown previously; keep the output

4. Find the RBA range of the last archive, which in not in the BSDSYou may have it from Step 4 of the backup procedure, or use the CSQ1LOGP utility

//CSQLOGPR EXEC PGM=CSQ1LOGP,REGION=0M

//STEPLIB DD DISP=SHR,DSN=MQB1.V000.CUR.OUT.SCSQLOAD

//ARCHIVE DD DISP=SHR,DSN=RHARRAN.MQ1ABA02.ARCLOG1.A0000004


//SYSSUMRY DD SYSOUT=*

//SYSIN DD *

SUMMARY(YES)

//

CSQ1212I FIRST LOG RBA ENCOUNTERED = 000003D0D360

000003D0D360 URID(000003CBEA6B) RM(DATA) LRID(00000001.00084301) TYPE( REDO )

SUBTYPE( FORMAT ) BACKWARD CHAIN(000842) FORWARD CHAIN(000000)

LRSN(BBA993D649D0) .....

00000463E3DC URID(0000045F1586) RM(RECOVERY) TYPE( END COMMIT2 )

LRSN(BBA994D4734A)

**** 00360036 00200010 03800000 045F1586 00000463 E3A66020 BBA994D4 734A0003

0000 00390000 0000D000 00000000 000003A9 94CDBEF3 E147

CSQ1213I LAST LOG RBA ENCOUNTERED = 00000463E3DC


Now reconstruct one BSDS, and later it can be copied into the second one.

Use IDCAMS REPRO to copy from the flat archived BSDS into the VSAM index dataset.

The BSDS was archived just before the archive that it is part of was made, so the last archive dataset for this backup is not recorded in this BSDS--at the time, it was the current active log being truncated by the offload process.

Print the latest archive using the CSQ1LOGP utility--see the System Admin Guide for details.

For a standalone queue manager, use SUMMARY(ONLY), as the start and end RBA for the archive will be displayed, and this is sufficient to add it to the BSDS. In a QSG, you must print the log, so as to find the LRSN for each of the start and end RBA.

The chart shows the top and the bottom of the output, with the bulk cut out.


5. Add the latest archive to the BSDS using the CSQJU003 utility (remember dual archives)//CSQJU003 EXEC PGM=CSQJU003,REGION=0M

//SYSUT1 DD DISP=SHR,DSN=VICY.MQ1A.BACK.BSDS01


//SYSIN DD *

NEWLOG DSNAME=VICY.MQ1A.ARCLOG1.A0000004,COPY1VOL=******,

UNIT=SYSDA,STARTRBA=03D0D000,ENDRBA=0463EFFF,

STRTLRSN=BBA993D649D0,ENDLRSN=BBA994D4734A,CATALOG=YES

NEWLOG DSNAME=VICY.MQ1A.ARCLOG2.A0000004,COPY2VOL=******,

UNIT=SYSDA,STARTRBA=03D0D000,ENDRBA=0463EFFF,

STRTLRSN=BBA993D649D0,ENDLRSN=BBA994D4734A,CATALOG=YES

6. Delete the old inventories for the active logs, and create new ones //CSQJU003 EXEC PGM=CSQJU003,REGION=0M



//SYSIN DD *

DELETE DSNAME=VICY.MQ1A.LOGCOPY1.DS01

DELETE DSNAME=VICY.MQ1A.LOGCOPY2.DS01

...DS04

NEWLOG DSNAME=RHARRAN.MQ1A.LOGCOPY1.DS01,COPY1

NEWLOG DSNAME=RHARRAN.MQ1A.LOGCOPY2.DS01,COPY2

...DS04


Add the latest archive log which is part of this backup to the BSDS using the CSQJU003 utility NEWLOG command.

The start RBA is the start RBA taken from the print of this archive in Step 4 rounded down to end in '000', and the end RBA is the corresponding end RBA rounded up to end in 'FFF'.

Remember to add an inventory for each copy of the archive log.

All the active logs are lost, so delete each using the CSQJU003 DELETE command, and add empty ones using the CSQJU003 NEWLOG commands.

MQ1A will have 4 active log datasets when it is recovered, and the names have changed from the originals.


7. Define the new active logs using IDCAMS DEFINE CLUSTER (one out of eight shown)//DEFN1 EXEC PGM=IDCAMS


//SYSIN DD *

DEFINE CLUSTER -

(NAME (RHARRAN.MQ1A.LOGCOPY1.DS01) -

VOLUMES(LETSMS) -

LINEAR -

SHAREOPTIONS(2 3) -

RECORDS(10000)) -

DATA -

(NAME(RHARRAN.MQ1A.LOGCOPY1.DS01.DATA) )

8. Add Conditional Restart Records to BSDS for END RBA or LRSN of latest archive//CSQJU003 EXEC PGM=CSQJU003,REGION=0M



//SYSIN DD *

CRESTART CREATE,ENDLRSN=BBA994D47349

9. Duplicate BSDS//COPY1 EXEC PGM=IDCAMS


//SYSIN DD *

REPRO -

INDATASET(VICY.MQ1A.BACK.BSDS01) -

OUTDATASET(VICY.MQ1A.BACK.BSDS02)


Create the new empty active logs using the IDCAMS DEFINE CLUSTER command. MQ1A uses SHAREOPTIONS(2,3) because it is in a queue sharing group.

Create a Conditional Restart Record in the BSDS. This prevents the log from one queue manager from being read past the point of consistency for the backup of the QSG. The LRSN used is the lowest end LRSN for the latest archive log available from each queue manager in the QSG subtract 1. In this case, MQ1A had an end LRSN of BBA994D4734A for the latest archive log, and MQ1B, the other queue manager in the QSG had BBA9956FFF7F, which is higher, so BBA994D7349 is used.

The BSDS is now ready to be used to restart the queue manager, so duplicate it into the second copy using IDCAMS REPRO.

In this case, the BSDS has a different name in the backup than it did originally, so the queue manager JCL will need to reflect this.

At this point, a parameter module may need to be assembled to match the new queue manager.

Object definition commands may be added to the CSQINP2 datasets referenced in the queue manager JCL.


10. Issue /!MQ1A START QMGR (after restoring all qmgr in QSG)CRESTART msg:

*60 CSQJ295D !MQ1A RESTART CONTROL INDICATES TRUNCATION AT LRSN BBA994D47349. REPLY Y TO CONTINUE, N TO

CANCEL

R 60,Y

CSQI049I !MQ1A Page set 0 has media recovery 291

....

CSQR002I !MQ1A RESTART COMPLETED

And the queue manager is restored to the point of consistency and available for new work.

But don't forget

RECOVER CFSTRUCT(APPLICATION1) when all queue managers are up

application dependancies,

distributed queuing,

other Resource Mangagers,

...


Now the queue manager can be restarted.

It issues a WTOR to confirm the conditional restart, then recovers the pagesets from the logs to the point at which the archive was taken.

The queue manager is now available for new work.

If the application structure was lost, it can be restored using the RECOVER CFSTRUCT(APPLICATION1) command once all the queue manager logs for the QSG are available.

For distributed queuing, if the queue manager is on a different system, different IP addresses and ports or LUs may be available. Channels may need to be reconfigured on this and other queue managers. The System Admin Guide describes how a queue manager and channel initiator can be restored to a different system but continue to serve in a cluster.

Other resource managers which may not be available might hold information which is needed to resolve WMQ work.

Applications which may not be available may hold information which is needed to restart WMQ work.

Disaster Recovery must take into account the whole enterprise!

Summary

WMQ keeps durable data

on logs (VSAM linear datasets and tapes)

on pagesets (VSAM linear datasets)

in the BSDS (VSAM indexed dataset)

in the Coupling Facility

in DB2

Queue manager restart uses the log

to determine the state of active transactions at the time of failure

to restore pagesets and CF structures

to commit or backout transactions as appropriate

If your data is important, make regular backups

Fuzzy pageset backups at least daily (and CF Structures too)

Copy archive logs and BSDS

Test recovery scenarios

Tailor to your enterprise

Session 1156

Paul Dennis - [email protected]

WebSphere MQ for z/OS Restart and Recovery

session 1156 paul dennis -...

Documents