session 1156 paul dennis -...
TRANSCRIPT
1. Durable Data--Preparing for a Crash
Where Durable Data is Kept
Creation of Some Durable Data
Transaction Related Updates
Message and Queue Related Updates
Checkpoints
A Quick Note on Backout
2. Restart
Current Status Rebuild
Historic Status Rebuild
Active UR Backout
Other Stuff
Sample Restart Messages
3. Backup and Recovery
Restart and Recovery Problems
Backing up
Restarting After a Disaster
A Backup and Recovery Example
Agenda
Where Durable Data is Kept
ACTIVE LOGS ARCHIVED LOGS LOG
BSDS
VSAM Indexed
DS
HIGHEST RBA
CHKPT LIST
LOG INVENTORY
VSAM Linear DSTapes
VSAM Linear DS
PagesetsPrivate Objects
Private Messages
{
RECOV RBAs
}
CPCPCP
Transactional Information
CF
Shared MessagesShared Objects
Group Objects
DB2
......
To be able to perform restart and recovery, WebSphere MQ keeps durable data in several places.
The bootstrap dataset (BSDS) contains inventory information about the logs, including:
The RBA of the highest written record. This is not updated every time that WMQ writes to the log, so it just points to near the end of the log.
A list of recent checkpoints; more on checkpoints later.
Information about which datasets contain which portions of the log.
The BSDS is kept in a VSAM indexed dataset. You can have single or dual copies.
The log is the most important durable store used for restart and recovery. Logically, it is a linear stream of records starting from Relative Byte Address (RBA) '0000 00000000'x up to RBA 'FFFF FFFFFFFF'x. WMQ writes to the log in increasing RBA order. Physically it is divided into the active log and the archive log. The active log is kept on a set of at least three VSAM linear datasets which are used in rotation. As active log datasets fill up, records are offloaded to the archive log which is often kept on tape.
Information on the state and actions of transations involving persistent messages is kept on the log. This includes every persistent update to the pagesets, so the log can be used for media recovery. Coupling Facility (CF) backups are also written to the log.
The log is always written to before any other persistent medium, so it is a definitive record.
Pagesets hold non-shared objects and non-shared persistent messages. They may also hold non persistent messages. Each page of each pageset includes a record of the log RBA corresponding to when that page was last written out to disk (the Recovery RBA).
Group object definitions and shared object definitions are kept in shared DB2 tables. Shared non-persistent messages are kept in CF list structures (they survive the failure of WMQ, and so are quite durable). Shared persistent messages are also kept in CF list structures, but backed up to the log. Information about transactions involving shared messages is kept in CF list structures and on the log.
Where Durable Data is Kept
Creation of Some Durable Data: An Application
Get Req In Sync Update DB then
Put Rep In Sync
Two phase commit
Get Req In Sync
(bad msg)
Put Error Msg
Out of Sync
BackoutGet Req In Sync Update DB then
Put Rep In Sync
Two phase commit
Get Req In Sync
(bad msg)
Put Error Msg
Out of Sync
Backout
Register
Interest
Register
Interest
Register
Interest
Prepare
RMs
Commit
RMs
Register
Interest
Register
Interest
Abort
RMs
APPL
COORDINATOR
WMQ
DB
Creation of Some Durable Data
The chart shows a possible WMQ application.
The application, APPL, gets request messages from a WMQ request queue, updates a table in a database, and puts a reply message to a WMQ reply queue.
All the updates are persistent. The application uses a two-phase coordinator (such as RRS, CICS or IMS) to coordinate the transaction between the application and the two Resource Managers (the database and WMQ).
In the chart, the application encounters an error; it gets a message from the WMQ request queue that it does not understand. In response to this, it puts a message indicating the error to a third WMQ queue. It puts this message out of syncpoint scope, because the next thing that it does is to backout the transaction.
All this application activity causes WMQ to create durable data to ensure transactional integrity, and to retain persistent updates.
Transaction-Related Updates
Get Req In Sync Update DB then
Put Rep In Sync
Two phase commit
Prepare Commit
Get Req In Sync
(bad msg)
Put Error Msg
Out of Sync
Backout
LOG
BEG UR
DECINSERT
<MSG DATA>INCDELETE
CMT PH2B
CMIT PH1E
CMIT PH2E
BEG UR
DEC DELETE INCINSERT
<MSG DATA>
BEG
ABRT
CHKPT BEG
CHKPT
END
CHKPT
RECSCLR
END
ABRTINC
UN
DELETE
BEGIN URCOMMIT PH1 END
COMMIT PH2 BEG
COMMIT PH2 END
COMPENSATING
LOG RECORD
BEGIN UR
BEGIN
ABORT
END
ABORT
In Flight In Doubt In Commit Complete
UR 1In Flight In Abort Complete
UR 2
FORCE FORCE
FORCE FORCE FORCE
APPL
UR STATE
UR STATE
URID
URID
URE
URE
Transaction-Related UpdatesWMQ creates durable updates that represent the state of any transactions in which it is taking part. Durable transactional information is kept on the log.
In the chart, log records that WMQ writes to record the state of the transactions for the application APPL, described on the previous chart, are highlighted. Transactions are sometimes called Units of Recovery (URs). Here the transactions are labelled UR1 and UR2.
A UR starts with a Begin UR record, written to the log just before WMQ does the first persistent update, in this case the destructive get of the request message. The newly started UR is identified by the log RBA of this Begin UR record, and all log records written for this UR will contain this RBA as a Unit of Recovery Indentifier (URID). When the Begin UR record has been written, corresponding to the first persistent update, the UR is in the In Flight state. Each UR also has an in-memory control block holding its state, called a URE.
After its Begin UR, UR1 has some records written to the log that represent the get of the request message and the put of the reply message; more on that later. The next transaction-related log record is an End Phase1 Commit record. This is in response to the Prepare call from the coordinator because the application has initiated a two-phase commit. When WMQ returns to the coordinator with an OK return code, this is a guarantee that WMQ can commit the transaction. To back this guarantee, WMQ ensures that it is ready to commit, then it writes out the record and forces the log. Most log writes are "lazy", with data written to buffers and I/O only performed when the buffers are full. A log force initiates I/O and waits for it to complete before continuing. When the End Phase1 record has been written, UR1 is in the In Doubt State.
The coordinator prepares the database, and calls WMQ to commit UR1. WMQ writes a Begin Phase2 Commit record and forces the log. UR1 is in the In Commit state. WMQ does what it needs to to commit the UR, then writes an End Phase2 Commit record, completing UR1.
UR2 again starts with a Begin UR and some records representing the get of the "bad" request msg. Before it writes the records representing the out of syncpoint put, it saves the last RBA it wrote in virtual storage. After it has written the OOS data, it writes a Compensating Log Record (CLR) that includes the RBA it saved, and forces the log. When WMQ backs out UR2 in response to an abort call from the coordinator because the application has initiated abort, it writes a Begin Abort record and forces the log. UR2 is in the In Abort state. It then reads the log backward and uses the records that it finds with the URID for UR2 to undo the changes that it made. When it reads the CLR, it uses the RBA in it to skip over the records for the OOS updates as these must not be undone. WMQ then writes an End Abort record and again forces the log. Now UR2 is complete.
Message and Queue Updates
Get Req In Sync Update DB then
Put Rep In Sync
Two phase commit
Prepare Commit
Get Req In Sync
(bad msg)
Put Error Msg
Out of Sync
Backout APPL
BEG UR
DECINSERT
<MSG DATA>INCDELETE
CMT PH2B
CMIT PH1E
CMIT PH2E
BEG UR
DEC DELETE INCINSERT
<MSG DATA>
BEG
ABRT
CHKPT BEG
CHKPT
END
CHKPT
RECSCLR
END
ABRTINC
UN
DELETE
Decrement by 1
Delete
LOG
Buffer
Disc RecordLog RBA
Queue Object
Depth=d-1
Buffer
Disc Record
Msg Record
Flags=Deleted
Disc Record PSET 0
PSET 2
PSET 0, Page 6
PSET 2, Page 5
ReqQ
Log RBA
Message and Queue UpdatesPrivate queue and the messages on them are kept on pagesets; the changes made to those pagesets are also recorded on the log. These log records are highlighted in this chart for the application described earlier.
Each pageset consists of an array of 4k pages which can be accessed randomly. Each page consists of a disc image of that page, and when that page is being used, a matching in memory buffer.
A queue has durable data on two pagesets. The object itself always resides on pageset 0, and the messages and queue structure reside on another pageset corresponding to the storage class of the queue. The object is contained in a single record in a single page, and includes a count of the queue depth. The queue is a collection of pages containing message data scattered around the pageset and chained together to form the queue structure. The chart shows the request queue, and the updates associated with the get of a persistent message from that queue.
Before any persistent update is done, an exclusive lock is obtained for the message, so that no other getter can get it. This will ensure transational isolation.
Next the queue depth is decremented. A log record is written to describe the decrement, and after that (remember the log must always be ahead), the queue depth is decremented in the buffer for the page containing the queue. At the same time, the RBA at the end of the decrement log record is written to the buffer as the recovery RBA.
Now the message must be deleted. Again a log record describing the delete action is written, and after that the message record, in the buffer for page 5 of pageset 2 in this example, is updated to show that it has been deleted. This just involves setting on a flag; WMQ cannot remove the record, as it may be needed if the get is backed out. At the same time, the recovery RBA in the buffer for this page is set to the RBA at the end of the delete log record. Note at this time that no durable data has been written. The pageset updates have been done to the in-memory buffers, and the log has not been forced.
The put of the reply message consists of an increment of the queue depth and an insert of the data to the page at the end of the queue in the same manner as the decrement and delete for the get.
The log force on the prepare for UR1 causes all log records up to that point to be written to the disc, so at the end of this, the durable record of the updates has been made.
The actual disc records of the pagesets still do not have the updates, as they are in the buffers. A buffer will be written out to the disc when it is required for an in memory image of a different page, or when it has not been written for 3 checkpoints (more later). Note that the recovery RBA in the disc version of the page is the RBA after the last update in the disc version of the page (again, more later).
Checkpoints
UR 2UR 2
UR 1
CF
Next checkpoint After LOGLOAD log recs written or... When active log dataset changed or...
At clean shutdown or successful restart
CHKPT BEG
URIDs & STATES
PSET n LOW RECOV RBA
CF UOWIDs & STATES
CF BACKUP INFO
CHKPT END
END ABRT
CHKPT BEG
FORCE
...
Scan used pages
PagesetsUREs
PSET 0 LOW RECOV RBA
Page 0
Buffers > 2 CHKPT old flushed
Note: 3 active log datasets => 3 checkpoints in active log => All page updates on pagesets + active log
BSDS
CHKPT LIST
CHKPT BEG RBACHKPT
BEG RBA
Checkpoints
A checkpoint is a collection of log records that give an indication of the state of the queue manager at a point in time. It is also used as a term for the point in time when the queue manager writes these records.
A checkpoint is taken:When he number of log records described by the LOGLOAD parameter in the ZPARMS for the queue manager has been written since the last checkpoint.When the active log dataset which holds the current log RBA changes.After a successful restartAt shutdown. This is a special checkpoint as the queue manager can end work and achieve a consistent state.
A checkpoint starts with a Begin Checkpoint record.
The URID and state for each active UR, as described in the URE, is written out.
Each pageset is scanned. If any page has a buffer that has not been written out for more that 2 checkpoints, it is flushed to the pageset. If there are three active log dataset (each containing at least one checkpoint), then any update that has not been flushed out to a pageset will be in the active log. The lowest page Recovery RBA for all the pages in the pageset is written in a log record, and also copied into page zero of that pageset. This represents the earliest RBA that would have to be read to bring the pageset up to date if all the data in the buffers were lost.
Information about transactions involving messages on shared queues is kept in-memory, and when the transaction is past the In Flight state, also in the CF. The in-memory information is written to a log record as part of a checkpoint. Information about which queue manager has taken a backup of an application structure, and which queue manager has made persisted shared updates is also written to the log.
An End Checkpoint record is written, and the log is forced. After the force, so that the log is always ahead, the begin checkpoint RBA is added to the list in the BSDS.
A Quick Note on Backout
INC(UNDO/REDO)
ALLOC PAGE(REDO)
CHAIN PAGE(REDO)
INSERT(UNDO/REDO)
INC(UNDO/REDO)
INSERT(UNDO/REDO)
CLR
Put In Sync Put Out of Sync
BEGIN UR
DELETE DECEND
ABORTBEGIN ABORT
BACKOUT
Read Log Backward "Undoing" to Begin UR
Skip back to RBA in CLR CLR
INSERT(UNDO/REDO)
Undo insert; action is delete
CHAIN PAGE(REDO)
Chain page record has no undo
ALLOC PAGE(REDO)
Alloc page record has no undo
INC(UNDO/REDO)
Undo inc; action is dec
Message does not fit on queue so allocate
a new page and chain it to the queue.
{
A Quick Note on Backout
The chart shows a new application which does a put in syncpoint followed by a put out of syncpoint and then, for some reason, does a backout (sometimes called abort).
In this case, the first put does not fit on the pages allocated to the queue, and so a new page is allocated and chained into the queue. As always, each of these actions is recorded on the log before it is done to the pageset.
Each log record that describes an update to a record on a pageset must describe what action should be taken if the transation is committed, the Redo action, and what action should be taken if the transaction is backed out, Undo action. The log record contains the redo action and whether that record applies to redo and/or undo. The actual undo action is inferred from the redo action. For example, the undo action of an increment is a decrement.
When a UR is backed out, WMQ reads the log backward from the last record written by this UR to the Begin UR record, looking at each record with the same URID.
The first record read is the compensating log record for the out of syncpoint put. This holds an RBA that points backward in the log to the last record before the out of syncpoint put started. The out of syncpoint put has completed, and the backout of the UR should not affect this, so WMQ skips over these records without doing anything.
The next record read is the insert of the message data. This is an Undo/Redo record, and so applies to backout. The undo action is performed, which is a delete of the record holding the message.
The next two records are the chain of the page into the queue and the allocate of the page. Each of these is a Redo only record, and so backout processing takes no action. Why? The out of syncpoint put just has an increment record and an insert record, so it has clearly gone onto the same page as the in syncpoint put. If the page is unchained and deallocated now, that message will be lost. This could equally apply to a put by a different application.
The final record read is an Undo/Redo record for an increment of the queue depth. Backout processing undoes this with a decrement of the queue depth.
Current Status Rebuild
BSDSHIGHEST RBA
CHKPT LIST
LOG INVENTORY
CHKPT BEG
BEGIN UR
END CMIT PH2
END CMIT PH1
URIDs & STATES
CHKPT END
OTHER MISC RECORDS
PSET RECOV RBAs
CF DATA
PSET 0 LOW RECOV RBA
CF
Read forward to find actual end of log
UR 2UR 2
UR 1
RUREs
Create RUREs
Read forward from CHKPT
Add RURE
Delete RURE Change
RURE State
In Flight In Doubt In Commit In Abort
Current Status Rebuild
The queue manager has ended abnormally, so what happens at restart?
Almost the first activity is Current Status Rebuild. This is the process where the queue manager uses the log to establish the states of any URs that were active at the time of failure. The queue manager also establishes the RBA of the last complete record written to the log, the RBA of the earliest record required for recovery of the pagesets, and the RBA required for CF recovery.
The BSDS is used as a shortcut into the log. The highest written RBA and the inventory of log records are used together to find a starting point in the log datasets. WMQ reads the log forward from this point until it finds the last complete log record written. This is where log recording will continue from, and is indicated with the output of a CSQJ099I message. WMQ needs this early in restart, as each stage of restart records its progress on the log.
The list of checkpoints in the BSDS is used to find the RBA of the last checkpoint written. Current Status Rebuild reads the log forward from this point to the RBA that was the end of the log when the queue manager abended.
From the checkpoint record containing the URIDs and states, WMQ creates a set of RUREs in memory. These are very similar to UREs, but are used for recovery. At this point, each one describes the state of a UR at the time when the checkpoint was written. The other checkpoint records are used to determine the lowest RBA required for media recovery of each pageset, the states of any URs involving shared queues, and the lowest RBAs required from each queue manager in the QSG for CF structure recovery.
Next, WMQ reads the log forward and uses transaction-related log records to update the states of the RUREs, create a new RURE for and Begin UR records, and remove a RURE for any record that ends a UR (End Abort, End Commit Phase2). The RBA of the last record written for each UR is also stored in the RURE.
Remember that the URID of a transaction is the RBA of the Begin UR record? This means that for each UR WMQ knows the lowest RBA that needs to be read to process the entire transaction. Hopefully, this falls in the active log, otherwise some tapes might need to be mounted!
Log RBA Log RBALog RBALog RBA
Historic Status Rebuild: Messages and Queues
UR 2UR 2
UR 1
RUREs for in commit and in doubt URs only
INC(UNDO/REDO)
INSERT(UNDO/REDO)
BEGIN UR1
CSQR021D: Commit long UR?
In Flight
In Abort
In Commit
In Doubt
Log RBAs
Increasing
Start RBA
Pagesets
Read forward
ALLOC PAGE(REDO)
CHAIN PAGE(REDO)
Pages
REDO DONE DONE DONE DONEREDO REDO REDO
LOWEST RBA
Historic Status Rebuild: Messages and Queue
The next stage is Historic Status Rebuild.
The aim is to make sure that all updates for URs that are In Commit or In Doubt are done, that media is up to date (except for changes that will be backed out), and that In Doubt URs hold the necessary locks for transactional recovery.
Firstly, the RUREs for In Commit and In Doubt URs are sorted into ascending order of start RBA (ie URID). The RUREs for In Flight and In Abort URs are sorted into descending order of highest RBA written. Hopefully, they will all fall within the active log datasets, but if not, message CSQR021D is issued, and the operator has a chance to commit the UR.
Each pagesets is also examined to find the lowest recovery RBA which will be needed for media recovery. Hopefully this will be in the active log, as the active log should contain at least three checkpoints and pagesets are flushed out every third checkpoint. However, if a pageset has been lost, a new one may be inserted, and all log records for that pageset will need to be applied. The lowest of the lowest pageset recovery RBA and the lowest RURE URID is used as a start point for Historic Status Rebuild.
The log is read forward from this point. Any log record for a change to a page with a redo type will be redone if the RURE with the corresponding URID is In Commit or In Doubt and the RBA of the record is higher than the log recovery RBA of the page. If the original update involved getting locks, then Historic Status Rebuild will get those locks and associate them with the RURE. If the UR is in commit, these will be released by the end of restart, but if it is in doubt, they will not be released until the UR is resolved, which requires external input. For records from RBAs lower than the lowest RURE URID, but higher that the lowest pageset recovery RBA, all updates to pagesets must be redone, but the locks do not need to be obtained.
For shared queues, there may be some log records about the change in state of URs involving the CF which may need redo-ing.
Log RBALog RBA
Active UR Backout
UR 2UR 2
UR n
RUREs for in flight and in backout URs only
INSERT(UNDO/REDO)
BEGIN UR1
In Flight
In Abort
In Commit
In Doubt
Decreasing
High RBA
Read backward
ALLOC PAGE(REDO)
CHAIN PAGE(REDO)
Pages
NOT DONE UNDO
Do not backout redo only updates
NOT DONE UNDO
Last log
record written
INC(UNDO/REDO)
May need archives and reading backward is slow.
Now write a checkpoint
Active UR Backout
The next stage is Active UR Backouts.
The aim here is to undo all changes for URs that are backed out. Any UR that was in flight or in abort at the time when the queue manager abended must be backed out.
The sorted list RUREs for these URs that was produced during historic status rebuild is used.
The log is read backward from the highest written RBA by any of these URs. Backout is performed in a similar way to that described earlier. However, there is no need to undo any changes that were not flushed out to the pageset, as all the data in the buffers has been lost.
Note that reading the log backward is generally slow.
After active UR backout, a checkpoint is taken.
Other Stuff
DB2
Rebuild Index
Rebuild Group Objects
Message CORREL ID
CSQ1234
CSQ5646
CSQ6675
Message CORREL ID
CSQ1234
CSQ5646
CSQ6675
Message CORREL ID
CSQ1234
CSQ5646
CSQ6675
PSET 0 Other PSETs
Discard all pages with non persistent messages
Other Stuff
A few other things go on a queue manager start up. These include:Rebuilding in memory index for each indexed queue
From V5.3, this does not have to complete before restart completes, only before the queue can be used.Restoring group objects from DB2 tables onto pageset zero.Removing all non-persistent messages that had been written out to the pagesets. Any one page in a pageset can hold either persistent or non-persistent messages, but not both. Those containing non-persistent messages are simply marked as available for allocation at queue manager startup.
A Sample Restart Messages16.51.01 STC29943 CSQY000I !MQ1A IBM WebSphere MQ for z/OS XXXXXXXX
16.51.01 STC29943 CSQY001I !MQ1A QUEUE MANAGER STARTING, USING PARAMETER MODULE MQ1AZPRM
16.51.01 STC29943 CSQ3111I !MQ1A CSQYSCMD - EARLY PROCESSING PROGRAM IS XXXXXXX
16.51.02 STC29943 CSQJ127I !MQ1A SYSTEM TIME STAMP FOR BSDS=2004-02-17 13:29:18.55
16.51.03 STC29943 CSQJ001I !MQ1A CURRENT COPY 1 ACTIVE LOG DATA SET IS 519
519 DSNAME=VICY.MQ1A.LOGCOPY1.DS01, STARTRBA=000000000000
519 ENDRBA=00000275FFFF
16.51.03 STC29943 CSQJ099I !MQ1A LOG RECORDING TO COMMENCE WITH 520
520 STARTRBA=000000150000
16.51.05 STC29943 CSQR001I !MQ1A RESTART INITIATED
16.51.05 STC29943 CSQR003I !MQ1A RESTART - PRIOR CHECKPOINT RBA=000000125DE2
16.51.05 STC29943 CSQR004I !MQ1A RESTART - UR COUNTS - 544
544 IN COMMIT=0, INDOUBT=0, INFLIGHT=2, IN BACKOUT=0
16.51.05 STC29943 CSQR007I !MQ1A UR STATUS 545
545 T CON-ID THREAD-XREF S URID TIME
545 - -------- ------------------------ ------------- -------------------
545 B RHARRAN3 000000000000000000000000 F00000014B6E5 2004-02-17 14:37:32
545 B RHARRAN4 000000000000000000000000 F00000014B269 2004-02-17 14:37:32
16.51.05 STC29943 CSQI049I !MQ1A Page set 0 has media recovery 546
546 RBA=000000125DE2, checkpoint RBA=000000125DE2
16.51.05 STC29943 CSQI049I !MQ1A Page set 1 has media recovery 550
550 RBA=000000125DE2, checkpoint RBA=000000125DE2
16.51.06 STC29943 CSQR030I !MQ1A Forward recovery log range 564
564 from RBA=000000125DE2 to RBA=00000014F298
16.51.06 STC29943 CSQR005I !MQ1A RESTART - FORWARD RECOVERY COMPLETE - 565
565 IN COMMIT=0, INDOUBT=0
16.51.06 STC29943 CSQR032I !MQ1A Backward recovery log range 566
566 from RBA=00000014F298 to RBA=00000014B269
16.51.06 STC29943 CSQR006I !MQ1A RESTART - BACKWARD RECOVERY COMPLETE - 567
567 INFLIGHT=0, IN BACKOUT=0
16.51.08 STC29943 CSQR002I !MQ1A RESTART COMPLETED
And Some More16.51.08 STC29943 CSQP018I !MQ1A CSQPBCKW CHECKPOINT STARTED FOR ALL BUFFER POOLS
16.51.08 STC29943 CSQP019I !MQ1A CSQP1DWP CHECKPOINT COMPLETED FOR 570
570 BUFFER POOL 3, 2 PAGES WRITTEN
16.51.08 STC29943 CSQP019I !MQ1A CSQP1DWP CHECKPOINT COMPLETED FOR 571
571 BUFFER POOL 2, 2 PAGES WRITTEN
16.51.08 STC29943 !MQ1A DISPLAY THREAD(*) TYPE(INDOUBT)
16.51.08 STC29943 CSQP019I !MQ1A CSQP1DWP CHECKPOINT COMPLETED FOR 573
573 BUFFER POOL 0, 15 PAGES WRITTEN
16.51.08 STC29943 CSQP021I !MQ1A Page set 0 new media recovery 574
574 RBA=000000152AE8, checkpoint RBA=000000152AE8
16.51.08 STC29943 CSQP021I !MQ1A Page set 1 new media recovery 575
575 RBA=000000150850, checkpoint RBA=000000150850
16.51.08 STC29943 CSQP019I !MQ1A CSQP1DWP CHECKPOINT COMPLETED FOR 576
576 BUFFER POOL 1, 58 PAGES WRITTEN
16.51.08 STC29943 CSQI007I !MQ1A CSQIRBLD BUILDING IN-STORAGE INDEX FOR 582
582 QUEUE SYSTEM.CHANNEL.SYNCQ
16.51.08 STC29943 CSQV401I !MQ1A DISPLAY THREAD REPORT FOLLOWS -
16.51.08 STC29943 CSQV420I !MQ1A NO INDOUBT THREADS FOUND
16.51.08 STC29943 CSQ9022I !MQ1A CSQVDT ' DISPLAY THREAD' NORMAL COMPLETION
16.51.08 STC29943 CSQI006I !MQ1A CSQIRBLD COMPLETED IN-STORAGE INDEX 588
588 FOR QUEUE SYSTEM.CHANNEL.SYNCQ
16.51.12 STC29943 CSQJ322I !MQ1A DISPLAY SYSTEM report ... 597
16.51.12 STC29943 CSQJ322I !MQ1A DISPLAY LOG report ... 600
16.51.12 STC29943 CSQJ370I !MQ1A LOG status report ... 601
16.51.12 STC29943 CSQJ322I !MQ1A DISPLAY ARCHIVE report ... 603
16.51.12 STC29943 CSQY022I !MQ1A QUEUE MANAGER INITIALIZATION COMPLETE
16.51.12 STC29943 CSQ9022I !MQ1A CSQYASCP 'START QMGR' NORMAL COMPLETION
Notes
Timings
Current Status Rebuild
Just read from last checkpoint to end of log; pretty quick
About LOGLOAD records to read
Historic Status Rebuild
Read from earliest pageset recovery RBA or lowest URID for in commit or in doubt UR to log write commence RBA.
Range to read displayed after current status rebuild in message CSQR030I
If it falls outside the active logs for a UR, message CSQR020I is displayed
Assuming log reading at an average of 1MB/s, MQ1A completed HSR in less than 0.2 seconds. (OK it did not have much to do, only media recovery)
Progress (RBA being processed) displayed every 2 min. CSQR031I
Active UR Backout
Read backwards from highest RBA for any in flight or in abort UR to lowest URID for any in flight or in abort UR
Range to read displayed after Historic Status Rebuild in message CSQR032I
Perhaps 0.5MB/s is a resonable data rate
Progress (RBA being processed) displayed every 2 min. CSQR033I
What Can Go Wrong
ACTIVE LOGS ARCHIVED LOGS LOG
BSDS
VSAM Indexed
DS
HIGHEST RBA
CHKPT LIST
LOG INVENTORY
VSAM Linear DSTapes
VSAM Linear DS
PagesetsRECOV RBAs
CPCPCP
CF
Shared MessagesShared Objects
Group Objects
DB2
......BEGIN UR
COMMIT PH1 END
What Can Go Wrong
All that durable data is on discs and tapes (and a bit in the coupling facility and DB2). So the obvious thing that can go wrong is that some or all of this media gets destroyed or corrupted.
WMQ has the facility to have dual copies of active logs, BSDSs and archive logs, but it is still possible to lose both copies.
If one copy of the BSDS is unavailable, WMQ issues message CSQJ126E and switches to single BSDS mode. For the queue manager to restart, dual mode must be restored, by creating a new copy of the damaged BSDS and copying in the good copy.
As has been mentioned before, the most important store of durable data is the log. If there is a complete log, the queue manager can always be restarted and restored to the state that it was in shortly before the failure... It just may take some time.
It is important to make sure that logs are offloaded often enough to be able to make backups and then keep them in a safe place or places, perhaps off site. If the log is lost, the BSDS can be printed out to give an indication of which log datasets will need to be restored for restart to work.
If pagesets are lost, they can be restored from a fuzzy backup, and the pageset RBA will be correct so that media recovery can be performed at restart--that is, it should just work. More on restoring from backups later.
If CF structures are lost, those containing persistent messages can be restored from the log... So long as a backup has been taken on a queue manager whose logs are available, and the logs from any queue manager that may have done persistent updates to that structure are also available.
A problem that does not involve media failure is a long running unit of work. For example, this could be caused by channel (long BATCHINT), or by a bad application design. CSQJ160I and CSQJ161I warn of long URs. These can be committed at restart, or possibly resolved with the RESOLVE INDOUBT command.
The System Management and System Admin Guide describe disaster scenarios and responses.
Backing Up
ARCHIVED LOGS
Tapes
CFDB2
......
Fuzzy Backup (>1 per day)
DISPLAY USAGE:
Msg CSQI024I,CSQI025I
give start RBA
Copy DS (eg FLASHCOPY)
ARCHIVE LOG
Full Backup (disruptive)
Quiesce workload
Resolve indoubts
ARCHIVE LOG
Quiesce Qmgr
Msg CSQI024I,CSQI025I
give start RBA
Copy (eg FLASHCOPY)
Copied to archive
Not used in disaster recovery
Backup DSG
Structure backup records
Backup structures after PSETs
BACKUP CFSTRUCT(...)
ARCHIVE LOG (on all QMs)
Save Object Definitions
CSQUTIL MAKEDEFS
SAVE ZPARMS
PagesetsBSDS
HIGHEST RBA
CHKPT LIST
LOG INVENTORY
Use dual copy
Copy in archive
*LOW
*LOW
ACTIVE LOGS
Backing Up
A copy of the BSDS is produced for each archive log dataset. In has the same name as the archive, but ending .Bnnnnnnn, rather than .Annnnnnn. The most recent copy should be sufficient for disaster recovery.
Pagesets can be backed up using the fuzzy backup process to achieve a restorable point of consistency without disrupting WMQ operations. This should be done at least once per day.
Issue the DISPLAY USAGE command to determine the earliest log RBA required to perform media recovery on all pagesets (and on recoverable CF structures). This is displayed in message CSQI024I. If there are pagesets that are offline and require media recovery, the corresponding RBA is displayed in message CSQI025I.Copy all the pagesets to backups. You must have defined your pagesets with SHAREOPTS(2,3) to be able to do this when the queue manager is running. The copy must copy page zero first to ensure consistency.Issue the ARCHIVE LOG command.For recovery, you will need the backups of the pagesets plus copies of the archives from that containing the lower RBA from the CSQI024I or CSQI024I message to that produced by the ARCHIVE LOG command.
Pagesets can also be fully backed up, but this involves quiescing work to the queue manager, resolving all indoubts, and stopping the queue manager, before backing up the pagesets.
CF structures with RECOVER(YES) can be backed up by issuing the BACKUP CFSTRUCT(...) command. A fuzzy backup is written to the log of the queue manager on which the command is issued, and all persistent updates (eg MQPUT of a persistent message) to that structure are written to the log of the queue manager which does the update. The log containing the start of the latest backup and all subsequent logs for all queue managers are required for recovery, so CF structures should be backed up as often as pagesets. To gain a point of consistency for shared queues, BACKUP CFSTRUCT(...) should be issued for all recoverable application structures on a queue manager in the QSG, then the fuzzy pageset backup proceedure should be followed, and all backups of archive logs and pagesets kept.
DB2 data should be backed up using the DB2 proceedures.
Active logs are backed up by archiving...if you have all your active logs, then you will be able to restart to the point at which the queue manager terminated.
Restarting After a Disaster
BSDS
HIGHEST RBA
CHKPT LIST
LOG INVENTORY
CF
Restore DB2 from DB2 BackupDB2
Minimum Requirements
Full Backup PSETs or
Fuzzy Backup PSETS plus relevant logs or
All logs (but slow)
plus DB2 backups
Restore BSDS
Allocate Datasets
Copy one from archive
CSQJU003 NEWLOG to add latest archive to BSDS
CSQJU003 DELETE and NEWLOG to add new active logs
If no logs or QSG, CSQJU003 CRESTART
Restore Pagesets
Allocate datasets
IDCAMS REPRO from backups
If no logs, CSQUTIL FORMAT REPLACE and RESETPAGE or
If no backup pagesets, CSQUTIL FORMAT RECOVER (slow)
... or Cold Start
Relevant logs are from that containing the RBA from the CSQI024I and CSQI025I messages before the fuzzy backup, to the latest log available
Restore CF Structures
Verify CFRM policies
RECOVER CFSTRUCT(...) on one QM
Need logs for all QMs
BACKUP ASAP
PagesetsACTIVEARCHIVE
Log
Restore Logs
Allocate datasets
IDCAMS REPRO archives from backups
CSQJUFMT format active logs
To recover to a point of consistency after a disaster, perhaps at a remote site, you will probably have: a fuzzy backup of the pagesets, backups of the logs (including archived BSDS) containing the earliest RBA required to recover from these pagesets identified by the CSQI024I message issued at backup time and the latest available RBAin a QSG, DB2 backups from that point of consistency
You could also recover from pageset backups produced in a full backup, from a complete copy of the log, or choose to cold-restart.
First, restore the BSDS from the archive copy:Allocate new BSDS(s) using IDCAMS DEF CLUSTERCopy the latest archive using IDCAMS REPROPrint it out using CSQJU004 and have a look!Print out the latest archive with CSQ1LOGP and determine the start and end RBA and LRSN (for QSG)Add the latest archive using the CSQJU003 NEWLOG commandDelete active log inventory using the CSQJU003 DELETE command for all active logs (unless you have them)Add in empty active log inventories using CSQJU003 NEWLOG, and define using IDCAMS DEF CLUSTERYou may need to add a Conditional Restart Record to indicate the highest LRSN that you wish to access
In a QSG in the event of sysplex failure, you need to use CRESTART with the ENDLRSN which is the lowest of that for the latest archives available for each queue manager to achieve a point of consistency.
Restore the pagesets from the backups using IDCAMS DEF CLUSTER and IDCAMS REPRO.
You need to use CSQUTIL FORMAT TYPE(REPLACE) and CSQUTIL RESETPAGE if you have no logs that are newer than your pageset backups. This only works if the pagesets are consistent.
Restore DB2 from backups using DB2 procedures.
See the System Admin Guide, and prepare in advance
Restarting After a Disaster
An Example: MQ1A
ARCHIVED LOGS
Tapes
CF
DB2
......
Pagesets
ACTIVE LOGS
A.QUEUE on PSID(1)
SHARED.Q on STRUCTURE APPLICATION1
Move Msgs
VICY.MQ1A.PAGE00...
VICY.MQ1A.PAGE05
VICY.MQ1A.ARCLOG1.A000000n
VICY.MQ1A.ARCLOG2.A000000n
VICY.MQ1A.ARCLOG1.B000000n
VICY.MQ1A.ARCLOG2.B000000n
(actually on DASD)
VICY.MQ1A.LOGCOPY1.DS01
VICY.MQ1A.LOGCOPY1.DS02
VICY.MQ1A.LOGCOPY1.DS03
VICY.MQ1A.LOGCOPY1.DS04
VICY.MQ1A.LOGCOPY2. " "
MQ1A is a queue manager in a QSG with one other queue manager, MQ1B
MQ1A has six pagesets, dual BSDS, four sets of dual active logs, and dual archive logs, named as in the chart
An application generates a moderate workload by moving persistent messages between a private queue A.QUEUE on pageset 1, and a shared queue SHARED.Q (no, really), on CF structure APPLICATION1, which is the only application structure in the QSG.
The example will show the steps taken to produce a single point of consistency using fuzzy backups, and the steps taken to restore the queue manager and the QSG.
The backup procedure should be used at least once per day, and the recovery procedure should be tested regularly.
An Example: MQ1A
An Example: continued (2)
1. Console: /!MQ1A BACKUP CFSTRUCT(APPLICATION1)14.21.43 STC28879 CSQE105I !MQ1A BACKUP task initiated for structure APPLICATION1
14.21.43 STC28879 CSQ9022I !MQ1A CSQELRBK ' BACKUP CFSTRUCT' NORMAL COMPLETION
14.21.43 STC28879 CSQE120I !MQ1A Backup of structure APPLICATION1 521
521 started at RBA=000003E3DA5F
14.21.44 STC28879 CSQE121I !MQ1A CSQELBK1 Backup of structure 523
523 APPLICATION1 completed at RBA=000003FF1A2C, size 2 MB
2. Console: /!MQ1A DISPLAY USAGE14.24.25 STC28879 CSQI010I !MQ1A Page set usage ... 573
573 Page Buffer Total Unused Persist Nonpersist Restart Expansion
573 set pool pages pages data pages data pages extents count
573 _ 0 0 1618 1598 20 0 1 USER 0
573 _ 1 1 2698 1416 1268 14 1 USER 2
573 _ 2 2 1078 1078 0 0 1 USER 0
573 _ 3 3 538 538 0 0 1 USER 0
573 _ 4 0 538 538 0 0 1 USER 0
573 _ 5 1 538 538 0 0 1 USER 0
573 End of page set report
14.24.25 STC28879 CSQP001I !MQ1A Buffer pool 0 has 2000 buffers
14.24.25 STC28879 CSQP001I !MQ1A Buffer pool 1 has 50000 buffers
14.24.25 STC28879 CSQP001I !MQ1A Buffer pool 2 has 1050 buffers
14.24.25 STC28879 CSQP001I !MQ1A Buffer pool 3 has 1050 buffers
14.24.25 STC28879 CSQI024I !MQ1A CSQIDUSE Restart RBA for system as 578
578 configured=000001B01486
14.24.25 STC28879 CSQ9022I !MQ1A CSQIDUSE ' DIS USAGE' NORMAL COMPLETION
It is time to do the backup...
Start by backing up CF structures with RECOVER(YES) on one queue manager.
This creates a fuzzy backup on the log of MQ1A. The response shows the RBA range from the start to the end of the fuzzy backup.
Issue the display usage command. This shows that all the persistent data is on Pageset 0 (objects) and Pageset 1 on which A.QUEUE resides.
Message CSQI024I indicates that the earliest RBA required to restore the pagesets and the CF structures if a fuzzy copy is taken now will be RBA 000001B01486. If pageset backups are kept, archive logs ending before this RBA may be discarded (but don't get it wrong!).
An Example: continued (2)
An Example: continued (3)
3.JCL: Copy Pagesets... Example uses ADRDSSU//COPY1 EXEC PGM=ADRDSSU,REGION=32M PARM='UTILMSG=YES,TYPRUN=NORUN'
//SYSPRINT DD SYSOUT=H
//SYSIN DD *
COPY -
ALLDATA(*) -
ALLEXCP -
CANCELERROR -
SHARE -
TOL(ENQF) -
RENAMEU((VICY.MQ1A.**,RHARRAN.MQ1AB02.**) ) -
DATASET(INCLUDE(VICY.MQ1A.PAGE00 -
VICY.MQ1A.PAGE01 -
VICY.MQ1A.PAGE02 -
VICY.MQ1A.PAGE03 -
VICY.MQ1A.PAGE04 -
VICY.MQ1A.PAGE05 ))
4. Console: /!MQ1A ARCHIVE LOG14.25.15 STC28879 CSQJ033I !MQ1A FULL ARCHIVE LOG VOLUME 626
626 DSNAME=VICY.MQ1A.ARCLOG1.A0000004, STARTRBA=000003D0D000
626 ENDRBA=00000463EFFF, STARTLRSN=BBA993D649CE ENDLRSN=BBA994D56486,
626 UNIT=SYSDA, COPY1VOL=P5P44E, VOLSPAN=NO CATLG=YES
14.25.15 STC28879 CSQJ033I !MQ1A FULL ARCHIVE LOG VOLUME 627
627 DSNAME=VICY.MQ1A.ARCLOG2.A0000004, STARTRBA=000003D0D000
627 ENDRBA=00000463EFFF, STARTLRSN=BBA993D649CE ENDLRSN=BBA994D56486,
627 UNIT=SYSDA, COPY2VOL=P5P456, VOLSPAN=NO CATLG=YES
14.25.15 STC28879 CSQJ139I !MQ1A LOG OFFLOAD TASK ENDED
Copy the pagesets using utility of your choice (it must copy page zero first).
They are defined with SHAREOPTIONS(2,3) so that they can be copied while the queue manager is using them.
Issue the archive log command so that there is a point of consistency with the log ahead of the pagesets.
Note the archive number which is the highest in this backup. Noting the RBA and LRSN ranges can make recovery processing easier later.
An Example: continued (3)
An Example: continued (4)
5. JCL: Find the archive log with the earliest needed RBA from BSDS (000001B01486) //JU004 EXEC PGM=CSQJU004
//STEPLIB DD DSN=MQB1.V000.COM.OUT.SCSQANLE,DISP=SHR
// DD DSN=MQB1.V000.COM.OUT.SCSQAUTH,DISP=SHR
//SYSPRINT DD SYSOUT=*,DCB=BLKSIZE=629
//SYSUT1 DD DISP=SHR,DSN=VICY.MQ1A.BSDS01
//
Output: (section)
ARCHIVE LOG COPY 1 DATA SETS
START RBA/TIME/LRSN END RBA/TIME/LRSN CREATED DATA SET INFORMATION
---------------------- ---------------------- ----------- --------------------
000000000000 / 0000001FCFFF / 2004-08-13 DSN=VICY.MQ1A.ARCLOG1.A0000001
2004-08-13 13:45:41.6 2004-08-13 13:46:02.8 13:46 PASSWORD=********
/ BBA98BFFE3F0 / BBA98C141ED7 VOL=P5P45C UNIT=SYSDA
CATALOGUED
0000001FD000 / 00000295CFFF / 2004-08-13 DSN=VICY.MQ1A.ARCLOG1.A0000002
2004-08-13 13:46:02.8 2004-08-13 14:09:48.3 14:09 PASSWORD=********
/ BBA98C141ED8 / BBA991638F9F VOL=P5P452 UNIT=SYSDA
CATALOGUED
00000295D000 / 000003D0CFFF / 2004-08-13 DSN=VICY.MQ1A.ARCLOG1.A0000003
2004-08-13 14:09:48.3 2004-08-13 14:20:45.5 14:20 PASSWORD=********
/ BBA991638FA0 / BBA993D649CD VOL=P5P45F UNIT=SYSDA
CATALOGUED
000003D0D000 / 00000463EFFF / 2004-08-13 DSN=VICY.MQ1A.ARCLOG1.A0000004
2004-08-13 14:20:45.5 2004-08-13 14:25:13.0 14:25 PASSWORD=********
/ BBA993D649CE / BBA994D56486 VOL=P5P44E UNIT=SYSDA
CATALOGUED
Use the log summary print utility CSQJU004 to print the contents of the BSDS to determine the oldest archive needed to complete the backup.
The CSQI024I message displayed in Step 2 indicated a RBA of 000001B01486, which is in the second archive log, so we need to backup archive logs VICY.MQ1A.ARCLOG1.A0000002 to VICY.MQ1A.ARCLOG1.A0000004 and the archived BSDS VICY.MQ1A.ARCLOG1.B0000004 to get a consistent point of recovery with the pageset backups that we already have.
An Example: continued (4)
An Example: continued (5)
6. JCL: Copy archive logs and BSDS (and possibly ship off site) //COPY1 EXEC PGM=ADRDSSU,REGION=32M PARM='UTILMSG=YES,TYPRUN=NORUN'
//SYSPRINT DD SYSOUT=H
//SYSIN DD *
COPY -
ALLDATA(*) -
ALLEXCP -
CANCELERROR -
SHARE -
TOL(ENQF) -
RENAMEU( (VICY.MQ1A.**,RHARRAN.MQ1ABA02.**)) -
DATASET(INCLUDE(VICY.MQ1A.ARCLOG1.A0000002 -
VICY.MQ1A.ARCLOG1.A0000003 -
VICY.MQ1A.ARCLOG1.A0000004 -
VICY.MQ1A.ARCLOG1.B0000004))
7. This is now a complete backup:
RHARRAN.MQ1ABA02.ARCLOG1.B0000004 (latest BSDS)
RHARRAN.MQ1ABA02.ARCLOG1.A0000002-4 (archive logs needed to recover pageset backups)
RHARRAN.MQ1ABA02.PAGE00-05 (pageset fuzzy backups)
Plus DB2 DSG backups
An Example: continued (5)
Finally, backup the archive logs selected, and the latest archived BSDS, and this is an WMQ backup from which you can recover to a consistent point.
You will need a DB2 backup of the DB2 DSG for shared queues, which is from not before this time.
You will also need to know what inputs are needed to assemble the WMQ parameter modules on the recovery site.
Commands to regenerate all the objects on the queue manager can be generated using the CSQUTIL MAKEDEFS command, and these could even be used on a cold started queue manager. See the System Admin Guide for more details.
An Example: continued (6) Recovery
1. Unpack those tapes, and start to recreate... First define new BSDS //COPY1 EXEC PGM=IDCAMS
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
DEFINE CLUSTER -
(NAME(VICY.MQ1A.BACK.BSDS01) -
VOLUMES(LETSMS) -
UNIQUE -
SHAREOPTIONS(2 3) ) -
DATA -
(NAME(VICY.MQ1A.BACK.BSDS01.DATA) -
RECORDS(60 60) -
RECORDSIZE(4089 4089) -
CONTROLINTERVALSIZE(4096) -
FREESPACE(0 20) -
KEYS(4 0) ) -
INDEX -
(NAME(VICY.MQ1A.BACK.BSDS01.INDEX) -
RECORDS(5 5) -
CONTROLINTERVALSIZE(1024) )
And Copy 2...
An Example: continued (6)
In the event of a disaster, find the backups that make the latest point of recovery that you have... hopefully you have kept them together.
The first job is to recreate the BSDSs that will have the same properties as those for the "lost" queue manager.
An Example: continued (7)
2. JCL: Extract the BSDS from the archive into one copy of the new BSDS //COPY1 EXEC PGM=IDCAMS
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
REPRO -
INDATASET(RHARRAN.MQ1ABA02.ARCLOG1.B0000004) -
OUTDATASET(VICY.MQ1A.BACK.BSDS01)
3. Print the BSDS using CSQJU004 as shown previously; keep the output
4. Find the RBA range of the last archive, which in not in the BSDSYou may have it from Step 4 of the backup procedure, or use the CSQ1LOGP utility
//CSQLOGPR EXEC PGM=CSQ1LOGP,REGION=0M
//STEPLIB DD DISP=SHR,DSN=MQB1.V000.CUR.OUT.SCSQLOAD
//ARCHIVE DD DISP=SHR,DSN=RHARRAN.MQ1ABA02.ARCLOG1.A0000004
//SYSPRINT DD SYSOUT=*
//SYSSUMRY DD SYSOUT=*
//SYSIN DD *
SUMMARY(YES)
//
CSQ1212I FIRST LOG RBA ENCOUNTERED = 000003D0D360
000003D0D360 URID(000003CBEA6B) RM(DATA) LRID(00000001.00084301) TYPE( REDO )
SUBTYPE( FORMAT ) BACKWARD CHAIN(000842) FORWARD CHAIN(000000)
LRSN(BBA993D649D0) .....
00000463E3DC URID(0000045F1586) RM(RECOVERY) TYPE( END COMMIT2 )
LRSN(BBA994D4734A)
**** 00360036 00200010 03800000 045F1586 00000463 E3A66020 BBA994D4 734A0003
0000 00390000 0000D000 00000000 000003A9 94CDBEF3 E147
CSQ1213I LAST LOG RBA ENCOUNTERED = 00000463E3DC
An Example: continued (7)
Now reconstruct one BSDS, and later it can be copied into the second one.
Use IDCAMS REPRO to copy from the flat archived BSDS into the VSAM index dataset.
The BSDS was archived just before the archive that it is part of was made, so the last archive dataset for this backup is not recorded in this BSDS--at the time, it was the current active log being truncated by the offload process.
Print the latest archive using the CSQ1LOGP utility--see the System Admin Guide for details.
For a standalone queue manager, use SUMMARY(ONLY), as the start and end RBA for the archive will be displayed, and this is sufficient to add it to the BSDS. In a QSG, you must print the log, so as to find the LRSN for each of the start and end RBA.
The chart shows the top and the bottom of the output, with the bulk cut out.
An Example: continued (8)
5. Add the latest archive to the BSDS using the CSQJU003 utility (remember dual archives)//CSQJU003 EXEC PGM=CSQJU003,REGION=0M
//SYSUT1 DD DISP=SHR,DSN=VICY.MQ1A.BACK.BSDS01
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
NEWLOG DSNAME=VICY.MQ1A.ARCLOG1.A0000004,COPY1VOL=******,
UNIT=SYSDA,STARTRBA=03D0D000,ENDRBA=0463EFFF,
STRTLRSN=BBA993D649D0,ENDLRSN=BBA994D4734A,CATALOG=YES
NEWLOG DSNAME=VICY.MQ1A.ARCLOG2.A0000004,COPY2VOL=******,
UNIT=SYSDA,STARTRBA=03D0D000,ENDRBA=0463EFFF,
STRTLRSN=BBA993D649D0,ENDLRSN=BBA994D4734A,CATALOG=YES
6. Delete the old inventories for the active logs, and create new ones //CSQJU003 EXEC PGM=CSQJU003,REGION=0M
//SYSUT1 DD DISP=SHR,DSN=VICY.MQ1A.BACK.BSDS01
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
DELETE DSNAME=VICY.MQ1A.LOGCOPY1.DS01
DELETE DSNAME=VICY.MQ1A.LOGCOPY2.DS01
...DS04
NEWLOG DSNAME=RHARRAN.MQ1A.LOGCOPY1.DS01,COPY1
NEWLOG DSNAME=RHARRAN.MQ1A.LOGCOPY2.DS01,COPY2
...DS04
An Example: continued (8)
Add the latest archive log which is part of this backup to the BSDS using the CSQJU003 utility NEWLOG command.
The start RBA is the start RBA taken from the print of this archive in Step 4 rounded down to end in '000', and the end RBA is the corresponding end RBA rounded up to end in 'FFF'.
Remember to add an inventory for each copy of the archive log.
All the active logs are lost, so delete each using the CSQJU003 DELETE command, and add empty ones using the CSQJU003 NEWLOG commands.
MQ1A will have 4 active log datasets when it is recovered, and the names have changed from the originals.
An Example: continued (9)
7. Define the new active logs using IDCAMS DEFINE CLUSTER (one out of eight shown)//DEFN1 EXEC PGM=IDCAMS
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
DEFINE CLUSTER -
(NAME (RHARRAN.MQ1A.LOGCOPY1.DS01) -
VOLUMES(LETSMS) -
LINEAR -
SHAREOPTIONS(2 3) -
RECORDS(10000)) -
DATA -
(NAME(RHARRAN.MQ1A.LOGCOPY1.DS01.DATA) )
8. Add Conditional Restart Records to BSDS for END RBA or LRSN of latest archive//CSQJU003 EXEC PGM=CSQJU003,REGION=0M
//SYSUT1 DD DISP=SHR,DSN=VICY.MQ1A.BACK.BSDS01
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
CRESTART CREATE,ENDLRSN=BBA994D47349
9. Duplicate BSDS//COPY1 EXEC PGM=IDCAMS
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
REPRO -
INDATASET(VICY.MQ1A.BACK.BSDS01) -
OUTDATASET(VICY.MQ1A.BACK.BSDS02)
An Example: continued (9)
Create the new empty active logs using the IDCAMS DEFINE CLUSTER command. MQ1A uses SHAREOPTIONS(2,3) because it is in a queue sharing group.
Create a Conditional Restart Record in the BSDS. This prevents the log from one queue manager from being read past the point of consistency for the backup of the QSG. The LRSN used is the lowest end LRSN for the latest archive log available from each queue manager in the QSG subtract 1. In this case, MQ1A had an end LRSN of BBA994D4734A for the latest archive log, and MQ1B, the other queue manager in the QSG had BBA9956FFF7F, which is higher, so BBA994D7349 is used.
The BSDS is now ready to be used to restart the queue manager, so duplicate it into the second copy using IDCAMS REPRO.
In this case, the BSDS has a different name in the backup than it did originally, so the queue manager JCL will need to reflect this.
At this point, a parameter module may need to be assembled to match the new queue manager.
Object definition commands may be added to the CSQINP2 datasets referenced in the queue manager JCL.
An Example: continued (10)
10. Issue /!MQ1A START QMGR (after restoring all qmgr in QSG)CRESTART msg:
*60 CSQJ295D !MQ1A RESTART CONTROL INDICATES TRUNCATION AT LRSN BBA994D47349. REPLY Y TO CONTINUE, N TO
CANCEL
R 60,Y
CSQI049I !MQ1A Page set 0 has media recovery 291
....
CSQR002I !MQ1A RESTART COMPLETED
And the queue manager is restored to the point of consistency and available for new work.
But don't forget
RECOVER CFSTRUCT(APPLICATION1) when all queue managers are up
application dependancies,
distributed queuing,
other Resource Mangagers,
...
An Example: continued (10)
Now the queue manager can be restarted.
It issues a WTOR to confirm the conditional restart, then recovers the pagesets from the logs to the point at which the archive was taken.
The queue manager is now available for new work.
If the application structure was lost, it can be restored using the RECOVER CFSTRUCT(APPLICATION1) command once all the queue manager logs for the QSG are available.
For distributed queuing, if the queue manager is on a different system, different IP addresses and ports or LUs may be available. Channels may need to be reconfigured on this and other queue managers. The System Admin Guide describes how a queue manager and channel initiator can be restored to a different system but continue to serve in a cluster.
Other resource managers which may not be available might hold information which is needed to resolve WMQ work.
Applications which may not be available may hold information which is needed to restart WMQ work.
Disaster Recovery must take into account the whole enterprise!
Summary
WMQ keeps durable data
on logs (VSAM linear datasets and tapes)
on pagesets (VSAM linear datasets)
in the BSDS (VSAM indexed dataset)
in the Coupling Facility
in DB2
Queue manager restart uses the log
to determine the state of active transactions at the time of failure
to restore pagesets and CF structures
to commit or backout transactions as appropriate
If your data is important, make regular backups
Fuzzy pageset backups at least daily (and CF Structures too)
Copy archive logs and BSDS
Test recovery scenarios
Tailor to your enterprise