unisys clearpath systems management · 2014. 12. 1. · 2 managing normal operation the management...

Unisys ClearPath Systems Management

Maximising IT Service Availability By Peter Bye

Contents Introduction.................................................................................................................................................... 1

Managing normal operation .......................................................................................................................... 2

Managing the flow of work ......................................................................................................................... 2

Housekeeping functions ............................................................................................................................ 3

Monitoring status ....................................................................................................................................... 3

Providing for the worst – handling abnormal conditions ............................................................................... 5

Degrees of negligence ............................................................................................................................... 5

Three recommendations ............................................................................................................................ 6

The technology .............................................................................................................................................. 8

Medium to large IT environments and systems management .................................................................. 8

ClearPath system attributes and system-level products ......................................................................... 10

Business Continuity Accelerator (BCA) ............................................................................................... 11

Extended Transaction Capacity (XTC) ................................................................................................ 12

Systems management products .............................................................................................................. 13

Operations Sentinel ............................................................................................................................. 14

SMA OpCon ......................................................................................................................................... 17

Enterprise Output Manager ................................................................................................................. 19

TeamQuest and SightLine ................................................................................................................... 21

Other products ..................................................................................................................................... 21

Managing ClearPath fabric-based systems ............................................................................................. 21

Conclusions ................................................................................................................................................. 23

More information ......................................................................................................................................... 24

Appendix: Managing IT service interruption: a process view ...................................................................... 25

Service interruption and restoration: the components ............................................................................. 25

A process view of service restoration with minimal automation .............................................................. 26

Enhancing the environment ..................................................................................................................... 27

Towards the dark data centre .................................................................................................................. 28

About the author .......................................................................................................................................... 31

1

Introduction

Organisations of any size increasingly depend on IT services to conduct their business. In some cases

they are absolutely critical: banks and on-line operations such as Google and Amazon cannot operate

without them. Their IT systems really are their business. And while IT may be less critical in other

organisations, it will still play a vital role. The chances are that business functions rapidly become

impaired if IT services are not working properly. For example, although airlines are in the business of

moving people and freight, they very quickly stop being able to operate if their IT systems are unavailable.

Passengers cannot make reservations, check in or board; freight gets lost; and a host of technical

functions cannot be performed. Ultimately, whole economies can seize up if IT systems do not work as

they should or even at all.

Maximising IT service availability as cost-effectively as possible is the task of systems management. This

paper discusses the functions of systems management and the technology available with ClearPath

systems. The emphasis is on managing environments where ClearPath systems coexist and collaborate

with other system types in the same organisation, or externally, in partner organisations of some kind.

As will become clear, this paper strongly emphasises the importance of automation. Any large

organisation such as a bank, airline or government department, will require significant systems

management resources. Automation is essential, not only to manage costs but, more importantly, to

ensure the highest levels of performance and availability. Large-scale environments cannot rely on

manual processes for group to group and people to system interactions, even in locations where the

costs of people are relatively low. Manual operation is error-prone, as repeated studies have shown, and

in some cases, it may simply not be possible. High traffic volumes, for instance, require immediate – and

hence automated – action if the consequences of any failure are not to get out of control.

High levels of automation are, then, essential for systems management in today’s IT environments.

Human intervention should be reserved for the kind of complex decisions best made by people. Examples

include a decision to move production to a disaster recovery (DR) site following a major incident, and the

organisational and external public relations functions that may be required in the aftermath of such an

event. One consequence of the increased level of automation is that the role of operators changes. They

are now required to perform higher-level functions; their role is better described as operations analysis

and management.

The paper begins by reviewing the role of systems management in handling the delivery of IT services

under normal operational conditions, where the systems are working correctly or with no significant

problems. It then goes on to discuss dealing with abnormal conditions, up to and including a complete

loss of IT services, for example following a natural disaster such as a flood, or a more localised event

such as a fire in the data centre. The technology available for managing ClearPath systems within a wider

environment is then surveyed, including a discussion of the management of ClearPath fabric-based

systems. The paper concludes by reviewing the key points raised. Finally, some pointers to further

information are provided. An appendix describes how increasing levels of automation affect the

processes of handling service interruptions, and the groups of people involved, from operations staff to

end users.

2

Managing normal operation

The management of normal operation is taken to mean those management functions necessary to deliver

the IT services an organisation requires when there are no significant problems in the systems or the

environment in which they operate. In such conditions, a variety of systems management functions have

to be performed. They may be divided into three groups, each of which is discussed in the following

paragraphs:

1) Managing the flow of work required to support the organisation’s business processes.

2) The housekeeping functions necessary to gather information about the systems, and to ensure that

the work flows smoothly and safely, and will continue to do so.

3) Monitoring the overall status, or health, of the IT environment to detect failures or potential failures as

quickly as possible and take corrective action.

Managing the flow of work

Although the details will vary widely depending on the nature of the business, the flow of work comprises

the online, transaction and batch processing needed to deliver the current IT services. It also includes the

development and test processes required to correct problems, enhance performance and bring new

services into production.

Online services are accessed directly by members of various user constituencies. The applications may

be internal to the organisation concerned, or serving a wider set of people, including the general public.

The sources of online requests include:

The public using the Internet and browsers or other devices such as fixed-line telephones, smart

phones and tablets of various kinds, emails and SMS messages, and specialised kiosks, for instance

ATMs and self-check in devices.

Members of an organisation serving the public: people in call centres, bank tellers and travel agents

are examples.

Internal users, including specialised applications such as analysis of consumer behaviour, and

administrative functions – expense claims, for example.

Process control systems using specialised equipment.

Batch processes run in support of online processes or perform complementary functions. An example of

the former is reconciliation and verification of funds moved between accounts. The latter might be the

application of a declared stock dividend across all portfolios containing that stock. Processing standing

orders is another example.

Although online and batch processes are often regarded as separate activities, the boundary between the

two is blurred. Batch processes can be treated as sets of transactions executed against a database or

databases if the system is designed that way. Both types of processing share the following attributes,

which place requirements on work flow management:

The processes may span more than one system within the same organisation or externally.

Elapsed and real time constraints require processes to finish within a certain time period or by a

specific clock time. For many organisations, the requirement is for online access to be 24x7 or as

near as possible.

A wide variety of events may initiate processes, including requests from different external access

channels, time and date, and receipt of specific triggers such as emails or file transfers.

3

Processes may be very long (‘sagas’ is a word sometimes used), with extended suspension of

processing awaiting some event such as an email or input message to restart. Processing insurance

claims, for instance, may require investigation by a claims adjuster, with approval or otherwise

required to continue. Large claims may take many weeks or longer to complete. The process must

not get lost during this time.

A wide variety of output may be generated, requiring printing or distribution to other media.

Time, date and specific days may vary the process flow. Examples include week, month and year

ends varying processing amounts, functions and timing.

These characteristics can make operation very complex, well beyond the scope of manual intervention

other than as a rare exception. Management functions include starting and stopping operational

processes, for example online systems, and incorporating any planned interruptions in schedules.

Maintenance, development and test activities consume a significant share of IT resources and need

managing. Corrections and updates to existing systems, and the development and deployment of new

applications, can benefit from automation in a number of ways, including:

Automating the scheduling, configuration and execution of test processes reduces the test time,

saves labour and therefore cost, and ensures repeatability.

Automating system and volume testing is essential; there is no alternative as labour costs would be

prohibitive even if the resources required could be scheduled.

Managing the transition from system test to production operation is particularly critical. The processes

involved can be complex and are error prone if performed manually. The result can be extended

downtime while problems are resolved. Automation of these processes substantially reduces the risk

of error, and provides an audit trail for any subsequent investigations and for regulatory compliance.

The processes should provide for a fall-back to an earlier version should a release fail. Automating

the transition of upgrades to production should include upgrades to automation and other systems

management software.

Housekeeping functions

Housekeeping comprises the activities necessary to ensure that systems perform and continue to perform

reliably and securely. The functions include the following:

Backing up databases and other data sources for recovery and analytical purposes.

Database reorganisations, if they are necessary.

Gathering statistics on system performance for subsequent analysis, for example for capacity

planning.

Gathering the information required for auditing, cost attribution and billing purposes.

Many of these functions are performed at scheduled times and therefore benefit from automation.

Automation also simplifies handling variations of date and time, for example for month or year end.

Monitoring status

The third set of management functions in normal operation is concerned with monitoring the health of the

IT systems and the environment in which they operate. Checking components’ availability and behaviour

provides information which may give advanced warning of problems and so can enable corrective action

to be taken as early as possible. For example, a database may be replicated synchronously or

asynchronously at a remote location for recovery purposes in the event of primary location failure. It is

therefore essential to know that the replication process is functioning correctly. If the link between the

4

sites is lost, for instance, the replication will be stopped. Should a major failure then occur, the recovery

would use an out-of-date copy of the database. Early detection allows the problem to be solved

immediately.

Another example is monitoring the behaviour of outsourced components. A typical case is where an

organisation manages its own data centres but outsources the external network to a specialist network

provider. Failures within the network may result in the loss of some connections, for example to groups of

end users. If those managing the data centre do not receive any notification about the failure, the first

indication of a problem would be when the end users call the help desk. Being able to monitor the

external network, for example by receiving status updates from the network supplier’s management

systems, can provide the information necessary to take pre-emptive action or at least warn the people

likely to be affected.

Monitoring the IT environment and reporting the results should be highly automated to avoid missing

important indications.

5

Providing for the worst – handling abnormal conditions

Although they may be relatively rare, the consequences of unplanned IT service interruption vary from

inconvenient to catastrophic. Potentially large sums of money and possibly life may be at risk, depending

on the nature of the business. It therefore seems obvious that organisations would plan for interruptions

of services, up to and including complete loss of the systems running the business. Such a disaster could

be caused by loss of a data centre due to power failure or some more dramatic reason, such as a serious

fire or natural disaster.

The plans should of course be commensurate with the financial and other consequences of lack of IT

services. There is no point in spending more on provision for loss of services than it costs to be without

them. Emergency services and financial institutions would be more exposed than, for example, a

medium-sized engineering company. But whatever the size of the organisation or the nature of its

business, some provision should be made. And, depending on the jurisdiction, a number of critical

commercial or public sector businesses are required by law or regulation to provide disaster recovery

(DR) facilities. Banks, for example, are required in most jurisdictions to have a DR capability.

However, for reasons that are not fully clear, a surprising number of organisations do not make adequate,

or indeed any, provision for loss of systems. This is in spite of the fact that officers of commercial

companies, or senior management of other organisations, may be personally open to civil or criminal

proceedings for the consequences of failure to provide cover in the case of service interruption. The

damages could be substantial if an organisation goes out of business because of loss of IT. There are

organisations facing this level of risk, in spite of increasing threats as diverse as terrorism and natural

disasters such as hurricanes and floods.

Degrees of negligence

Failure to provide for adequate service continuity varies in its level of apparent negligence. In some

cases, the senior management appears collectively to put its head into the sand, assuming naïvely that ‘it

won’t happen to us’, and so makes no provision at all. While such an attitude is not that common, it does

exist. The reasons advanced for the lack of provision include excuses such as ‘the systems are very

reliable, so we don’t need any DR’, and ‘we assume our supplier would provide a replacement

immediately’.

The author has heard both of these comments from a single organisation. The former statement ignores

the fact that the systems, while indeed reliable, were not fire- or water-proof, or sadly today, bomb-proof.

The latter remark was made in spite of there being no written agreement between the supplier and its

client. Nor were there any other provisions in place, for example for network switching or where the

‘replacement systems’ were to be installed. In fact, the organisation concerned did not have very stringent

requirements for DR as it could operate for two or three days without its systems; DR provision would

have been quite easy to make. But longer than two or three days to recover would have seriously

compromised the business.

Perhaps a more common fault is to make provision for DR but fail to ensure that it would work effectively.

A DR centre might be set up to contain back-up systems for the primary data centre, and perhaps some

live production as well. Procedures would be established for recovery in the event of various

contingencies up to and including the complete loss of the primary data centre. And yet still things do not

work out as planned when disaster or even lesser problems strike.

6

A real-life example is the case of the failure of a data centre belonging to a major bank (not a ClearPath

user). The failure, which occurred on a Saturday morning, made national news programmes in the

country concerned. It was reported that a power failure at one of the bank’s data centres early in the

morning had caused a major interruption of IT services: no ATMs, no over the counter services. Even

appointments could not be handled or cancelled as the required client information was in one of the lost

systems. The story continued over the weekend and was widely reported in the media. Recovery took

many hours.

It is not clear why it took so long to restore services. Evidently, uninterruptible power supply equipment

did not work immediately or systems would have been recovered more rapidly in situ. After all, the data

centre itself was not damaged as it would have been by a fire, for instance. And given that the bank has

more than one data centre, why were the key services not restored in another (DR) centre? Was there

critical equipment only in one data centre? Were the procedures for handling DR sufficiently documented

and automated? Whatever the reason, it is clear that improvements in DR handling were needed.

A data centre hosting outsourced systems for a number of organisations is another example of a problem

in waiting, happily resolved before disaster struck. A DR centre was established some hundreds of

kilometres away from the live centre, a provision that was important given that the primary data centre is

in an earthquake zone. The data centre has a variety of systems from different suppliers using different

technologies. The number of systems, and the fact that customers may have different service-level

agreements, means that DR in this case can be quite complex.

A DR process to manage a switch to the DR centre was established and documented. However, the

process was very complicated, taking over 100 pages to describe and requiring two to three days to

execute by a small team. In effect, the process was almost guaranteed not to work; the level of

complexity was such that there was little chance of executing it without errors. And, given the two to three

days required to execute, practising DR was impossible as the data centre could never be off the air for

that long. A project to automate the process reduced the time to execute to less than 30 minutes, with just

one operator required. It is now practised regularly.

Three recommendations

Any organisation should do the following:

1) Make a provision for DR commensurate with its exposure if systems are unavailable.

2) Apply a high level of automation to the DR process.

3) Practise the process regularly to make sure it works.

The first point seems obvious, although apparently not to everyone. However reliable systems are they

can still go wrong. Much more likely, the environment they live in might fail, for reasons totally outside the

control of the data centre’s owner; floods and hurricanes are examples. It therefore also makes sense to

avoid putting data centres in exposed places and to ensure that two centres would not be affected by the

same disasters: don’t put them on the same flood plain or earthquake fault line, for instance.

The second recommendation is automation. This paper has already stressed the importance of

automation in normal operation; for DR it is critical. The need to execute a DR process does not often

arise but when it does, it is very serious. Even in relatively straightforward environments it can be

complicated, never mind the data centres in the examples discussed above. The complexity and

infrequency of use mean that operators executing manual procedures will make many mistakes; they are

7

doing something rare and under pressure1. (The author is aware of one case where the operators started

a DR script on page 2.)

Automation is essential to minimise the risk of error. It captures operational best practises during the long

stretches of calm conditions: these may be later invoked during more turbulent times that are far less

predictable. Wise data centres will involve all shifts of operators, systems analysts, and IT management in

the definition of these automated sequences. It is usually true that not everyone knows (or agrees) on the

optional data centre procedures, so automation becomes a framework for quality improvement and

enhanced communication. When these automated sequences are employed in non-crisis scenarios they

improve data centre operations on a continuous, ongoing basis. Deploying them during a true disaster is

indeed beneficial, but their development and use pays recurring benefits as well.

Executing a DR process must be complemented by a return from DR, transferring production back to its

normal location. Note that some corrective action may be required before the return, depending on the

nature of the event causing the DR. The return process must also be automated, for the same reasons as

automating DR: it is rarely performed and therefore would be error-prone if manually executed. Fully

automated DR and return processes make the decision to execute a DR somewhat easier. And there is

scope for eliminating a time-wasting decision process altogether: if a failure occurs on a weekend or non-

prime time DR may be automatically invoked; if a failure occurs during other times management approval

may be required. The most advanced data centres have made the concept of a production site arbitrary: it

might be the local or remote location, and one-button DR automation makes the transition to either

painless. Under these conditions a site may run for weeks or months in one location, then switch to

another in a matter of minutes. When advanced DR automation is in place the location of the production

facility becomes of little consequence. Disaster may strike, but the business remains unaffected.

Automation can also provide more sophisticated handling of DR contingencies while improving process

visibility. An audit trail of the DR process is an obvious benefit. Less obvious but even more useful is a

graphical interface that represents the process in its entirety. Both technical and managerial staff can

intuitively grasp the state of the process. This is crucial because, as noted above, the process may be

highly complex. Automation hides this complexity and simplifies the recovery process for all audiences.

Additional sophistication may be added by allowing for recovery checkpoints, which graphically break up

the process into logical phases. The process may be halted at the click of an icon, should new

management direction be forthcoming in the middle of a DR switch. In addition, failed portions of the

process could be designated as blocking (the overall process will halt if this sub-process fails) or non-

blocking (the DR switch may continue while a less critical failure is flagged for asynchronous manual

attention).

Assume a failure occurs in one phase of the DR process and it is crucial (blocking). Technical support

staff may be automatically notified by the automation, perhaps through text messages to their mobile

phones. Lack of acknowledgement will escalate the problem and the next-level of support (and/or

management) will be sent a message. When the issue has been resolved operations may restart the

process at the prior checkpoint with a simple mouse click. With the underlying conditions having been

altered, the process now proceeds to the next checkpoint and, if all goes well, to completion.

1See the appendix for a discussion of how increasing automation affects recovery processes and the people involved.

8

It is difficult to automate the decision to execute DR in all cases. In highly sensitive environments, it may

be possible to distribute an application over a number of systems in different centres. Such a

configuration eliminates the need to decide on DR, as the remaining systems pick up the traffic if one

system is lost. Deciding on the level of sophistication required is a business decision, weighing the cost of

implementation against the financial, reputational and other exposure caused by IT service loss.

The final recommendation is crucial: DR and return processes must be practised to ensure they work as

intended. Regulations may require DR to be exercised periodically but, even if there are no rules or legal

requirements, the process should still be executed regularly. High levels of automation make frequent

rehearsals much easier. And rather than running normal production and DR data centres, an alternative is

for all data centres to be able to run production work. Applications can then be moved around from time to

time, ensuring that all equipment and processes work.

The technology

This section surveys the systems management technology available for ClearPath systems, showing how

it can be applied in normal operation and for dealing with abnormal conditions, for example in the event of

the loss of a data centre. No attempt is made to discuss all the products available; the reader is referred

to the MCP and OS 2200 product catalogues for complete lists. See the More Information section later in

the paper for pointers to the product and other documentation.

In many cases, the products can also manage the majority of other computer system types found in the

data centre, including cross-system management. In some cases, they can handle other equipment such

as storage subsystems, and even intruder alarm and air conditioning systems. They can in general

interface with each other and other management products, for example enterprise management systems,

and trouble-ticketing applications to record and keep track of problems or incidents needing attention.

The remainder of this section is organised as follows:

Before looking at technology, it is useful to put it into context by considering the kind of IT

environment typical of today’s medium to large organisation and the way it is managed.

ClearPath OS 2200 and MCP have attributes and provide system-level products that contribute to

the management goal of maximising the availability of IT services, although they may not always

be viewed as systems management products.

The key systems management products are described, in particular those that work across

multiple systems, where the managed systems may be collaborating with each other.

Medium to large IT environments and systems management

Figure 1 is a schematic of a typical medium to large environment, and the systems management

arrangements used.

For most medium to large organisations, the IT infrastructure is distributed across two data centres,

shown in figure 1 as the primary and secondary, although some organisations may have more than two.

How the data centres are used varies. The primary may be used for production with the secondary

reserved for back-up in case of the loss of the primary, with perhaps development and test systems.

Alternatively, production may be split across the two data centres in various ways, even with the same

9

application distributed across the two centres in some form of clustering. The data centres are interlinked

by high speed network connections, either private or supplied by a network provider.

The data centres contain a small number of ClearPath systems and perhaps other mainframe-class

systems. In spite of virtualisation, there are also a large number of other servers variously running UNIX,

Windows or Linux. The systems will use dedicated or shared storage subsystems of various kinds, as

well as printers.

The end users and other organisations, for example business partners, service providers such as credit

card management services, and government departments are connected to the data centres through

private networks, shared networks such as ATM or interbank networks, and the Internet. The component

labelled ‘Integration Infrastructure’ in figure 1 links the external systems to the systems running the

required applications, and the applications in the data centre to each other. As well as local area network

(LAN) components, the integration infrastructure is likely to contain portals of various kinds and may

include a well-developed service-oriented architecture (SOA) infrastructure, built around products such as

an enterprise service bus (ESB).

Figure 1: Schematic of a medium to large configuration and systems management arrangements

Management of the environment is from a management centre, sometimes referred to as a bridge.

Systems management operators are equipped with a number of workstations, connected to systems

management tools, which typically run in servers that are in turn connected to the managed systems.

Usually, although not always, the tools require an agent to be installed in the managed systems; the

agents are shown as small, coloured circles in the figure. Not all management tools will be connected to

all managed systems, as indicated by the absence of some colours in some systems in the figure.

The entire environment may be managed from a single management centre or possibly more than one. In

any event, an alternative management centre or bridge is required in case of the loss of one centre. The

management servers and the managed systems do not have to be in the same data centre as the

Primary Data Centre

Other OrganisationsSecondary Data Centre

Integration Infrastructure

Management

Centre 1

Management

Centre 2

End users= Management agents

= ClearPath systems

Storage subsystems eg SAN

= Other systems/servers

10

managed systems; they could be anywhere as long as suitable network connections are available. And

the workstations do not have to be co-located with the servers containing the management applications.

For example, some large organisations provide information-only workstations showing a high-level status

view to senior management.

ClearPath system attributes and system-level products

ClearPath systems are frequently used in mission-critical environments in the private and public sectors.

In addition to performance, high levels of availability and security are expected, together with an ability to

recover from any (rare) problems quickly and securely. The systems have a number of attributes that

contribute to these goals and therefore enhance their manageability.

Both MCP and OS 2200 systems manage diverse workloads of online and other activity, effective

performance even as system utilisation approaches 100%; few if any other operating systems can match

this degree of efficiency. At high loads, critical applications are kept fully operational while less critical

tasks wait until resources are available. Response times and other performance metrics can thus be

maintained at the levels required by the business, rather than deteriorating as loads increase.

For ClearPath OS 2200 systems, the features necessary to manage the workload are distributed across

the system. For MCP systems, Workload Management for ClearPath MCP simplifies the process of

managing workloads. The performance level required for applications to match the business goals and

priorities is specified. The system determines the amount of resources, such as processor usage, needed

to meet a goal. Workload Management for ClearPath MCP constantly monitors the system and

automatically adjusts processing to meet the specified goals.

ClearPath systems are engineered for reliability; systems may run for years without a failure. A significant

reason for the reliability – and performance – is the integrated hardware and software stack, which

enables full testing of the whole suite before release. And because Unisys provides the whole stack,

forward and backward compatibility can be maintained, enabling managed upgrades with minimal

disruption. Upgrades are typically performed in hours rather than days, and may in some circumstances

be done without downtime. Many updates may be made without taking the systems down and database

reorganisations are minimised.

Should problems occur, database recovery is rapid. A difference from other database approaches is that

recovery is from a point in time. Audit trails are maintained for database changes. If the database is

damaged, as a result of a rogue transaction for instance, the database is recovered from the audit trail. A

simple script is executed and the database recovered. Other databases such as DB2 and Oracle usually

have to perform a full recovery and then go forward, which can be a much longer process.

Security is a major requirement for ClearPath users. Features providing high levels of security include:

Applications are protected from each other to prevent cross-application damage.

Database access is carefully controlled to manage who can access what data.

The system is designed to protect against common security attacks such as buffer overrun, which is a

leading cause of virus and worm propagation.

Auditing technology is an integral part of the operating system environments, not an add-on.

11

Unisys Stealth technology can be used to isolate critical functions such as systems management.

Powerful products such as Operations Sentinel and Call Home may be used safely. The former may

be opened securely to interact with other applications.

Two other system-level products stand out as contributing to the goal of maximising IT service availability.

Product Description

Business Continuity Acceleration (BCA)

for ClearPath MCP systems

Provides a rapid, simplified switch to an alternative system

in another location.

Extended Transaction Capacity (XTC)

for ClearPath OS 2200 systems

Enables up to six Dorado partitions to run applications

sharing the same database

Business Continuity Accelerator (BCA)

The Business Continuity Accelerator software is designed to maximise the availability of applications and

data by accelerating and automating the process of relocating an application workload and its associated

data from a primary server to an alternate server. The software then reinitiates the execution of those

applications on an alternate server.

Figure 2 shows two sites, each with a ClearPath MCP system. The sites can be any distance apart. The

ClearPath MCP server at the primary site is running a business-critical workload. The BC server at the

alternate site can be either dedicated to BC purposes or running a less critical workload such as

development and test. Both servers are running an MCP operating environment and image enabler. A

data replication product, for example SRDF from EMC2, replicates data from the primary site to the

alternate site.

The Business Continuity Accelerator runs on both systems and uses a heartbeat mechanism (shown by

the red line in figure 2) to monitor the status of both servers. Business Continuity Accelerator is integrated

with the data replication product (black lines in the figure) to monitor status and control reconfiguration.

Figure 2: Business Continuity Accelerator configuration with ClearPath MCP systems in two sites

If the decision is made to transfer the workload to an alternate server, the administrator enters a single

command. The Business Continuity Accelerator then uses previously-generated scripts to complete the

Primary Site

Image

Enabler

Operating

Environment

Business

Continuity Accelerator

Alternate Site

Image

Enabler

Operating

Environment

Business

Continuity Accelerator

Data Replication Product

Network

Production server BC server

12

move, eliminating the need for operators to perform a complex series of manual operations. All the

resources are automatically transferred to the BC server, including the host identity and the image

enabler, and the jobs restarted.

The configuration in figure 2 shows one common use of the Business Continuity Accelerator, where the

production and BC servers are in separate data centres. The product supports the following additional

configurations:

Recovery to a local server: This capability can be used when production and BC servers are in the

same data centre with a shared storage area network (SAN); there is no need for a storage

replication product. This type of configuration is intended for recovery from system problems, not

disaster recovery.

Three-server recovery configuration: This type of configuration consists of a production server, a

local BC server and a remote BC server at a disaster recovery site.

Many-to-one recovery configuration: In this type of configuration, up to four production servers can

share a single BC server. However, only one production server can be recovered at a time.

Apart from its use in maintaining IT service continuity in the event of a failure, the Business Continuity

Accelerator can be used in a number of other scenarios, including:

Moving a workload to an alternate server while the primary server is unavailable because of

maintenance or upgrade activities.

Migrating a workload to a new system or application software release while providing a secure fall-

back position.

Permanently moving a workload onto a newly purchased server.

Relocating a server environment to another physical server as needed for purposes such as testing,

modelling or backup.

Extended Transaction Capacity (XTC)

Extended Transaction Capacity (XTC) is a grouping of software features which enable up to six ClearPath

OS 2200 partitions (referred to as ‘hosts’) to be clustered in an active, cooperative-processing state to

provide increased capacity and resilience. Applications are active in all hosts and access the same

database. An external lock manager for TIP and UDS databases, the Extended Processing Complex-

Locking (XPC-L), handles the locking of the shared files and coordination of multiple hosts in a

transaction processing environment.

Figure 3 is a schematic of an XTC configuration with two OS 2200 systems.

The end user network connects to all hosts because transactions can be processed by any of them. A

mechanism is therefore required to distribute incoming requests across the hosts, based on whatever

algorithm is appropriate, for example round-robin or using an approach based on the current load in a

host. Different clients have implemented different schemes.

13

Figure 3: Extended Transaction Capacity (XTC) configuration with two hosts

Resilience is increased because one or more of the active hosts in an XTC configuration can fail without

affecting the remaining hosts. There must, of course, be sufficient processing capacity in the remaining

host(s) to handle the maximum load during the outage. Standard recovery procedures are used to return

a stopped host to the XTC environment while the active hosts continue to process transactions. The

approach used for connecting the network has to make sure that incoming traffic is directed to the

remaining active host(s). The XPC-L is duplicated so that there is no single point of failure. One XPC-L is

the primary; the other one takes over the primary role in the event of any failure of the primary.

In addition to handling failures, the XTC configuration can be used for planned stops, for example for

software upgrades. Those sites which invest in N+1 configurations realise enhanced High Availability.

The loss of a single host has no impact on normal processing, so whether the disruption is due to system

upgrades or something less well planned is immaterial. Each host can be removed from the configuration,

updated and then returned. The remaining hosts continue to run the traffic thus eliminating end-user

service interruption.

The XTC configuration restricts the distance between systems to about 200 metres, so they have to be

within a single data centre, ideally in separate fire-proof cells, or in two close centres. Although the

distance restriction allows protection against system failure or a local incident such as a fire in a single

cell, additional action must be taken to guard against a wider problem, for example a significant flood or

earthquake. An XTC configuration can be combined with remote DR site using other techniques such as

SRDF to protect against such a contingency.

Systems management products

The increasing amount of collaboration between systems means that events in one system often affect

others. It is therefore essential to be able to take a ‘cross-system’ view of management: batch activities

completing in one system may trigger activities in others, for instance. And problems within a system may

cause difficulties for others. Systems management products must therefore be able to act across

systems. The table below shows the key products, which are explained in the following paragraphs.

Distance sensitive

Data centre 2

Shared

Disks

Data centre 1

Shared

DisksXPC-L

Network

XPC-L used for

record locking

Host 1

(active)

Host 2

(active)

End users

XPC-L

14

Product Description

Operations Sentinel (formerly known as

Single Point Operations (SPO))

Allows ClearPath and other systems to be managed from a single

point, including console and automation capabilities.

OpCon (from SMA Solutions) Workflow management covering a wide range of business

requirements.

Enterprise Output Manager (EOM –formerly

known as DEPCON)

Takes print and other output and delivers it to printers or a wide

variety of other media, such Web pages, emails and faxes.

Includes extensive formatting facilities.

TeamQuest (various products) Performance management and management utility tools.

Sightline Systems (various products) Performance management.

Other products. ClearPath systems contain SNMP (Simple Network Management

Protocol) agents. Affinité and Locum provide security and other

tools, primarily for MCP systems.

Operations Sentinel

Operations Sentinel from Unisys automates, consolidates access, monitors and manages multiple

heterogeneous systems. The environments that Operations Sentinel can manage range from large,

centralised data centres with multiple mainframe-class systems to fully distributed environments.

The application runs in Windows Servers, with Windows-based workstations providing the operational

interface. Configurations can be resilient to avoid any single point of failure. The Operations Sentinel

Console and Topology displays enable operators to monitor the state of all connected systems, recognise

and respond to exception conditions, and assume control of any system, either local or remote. Low level

automation handles most routine exceptions. Secure remote access and advanced alerting options are

available through a wide range of industry-standard mechanisms. The managed systems contain agents

providing a variety of information about all monitored objects, whether hardware- or software-based.

Figure 4 is a schematic of the architecture.

Figure 4: Operations Sentinel architecture: resilient configuration and remote alerting/access options

15

The managed systems can be of one type or many different types, including ClearPath MCP and OS

2200 systems, Microsoft Windows, Solaris, UnixWare, Linux, AIX, HP-UX, SVR4, other UNIX systems,

VMS, and other system types, including environmental equipment.

Topological high-level graphic displays, defined by the operations analyst, provide a unified view of

dissimilar systems, with drill-down facilities to access detailed information. Many different views can be

created to suit different groups of users or functionality. Figure 5 is an example of the main window and 6

is a sample Topology display.

Figure 5: Operations Sentinel main window, showing the four distinct panes

Figure 6: Operations Sentinel Topology Display with sample alerts

16

The most significant features of Operations Sentinel are:

Automation facilities which extend self-healing with prescribed solutions, functional samples, and

capabilities for tailoring:

• Every operator message from each connected system passes through the Autoaction Message

System (AMS), where it is compared to a database of user-defined message patterns. When a

match occurs, AMS carries out a corresponding set of actions.

• Actions include raising an alert, updating data in Operations Sentinel Topologies, logging an audit

trail message, or sending a message to the originating system or another managed system, and

initiating a sequence of operations involving multiple systems.

• Correlation of events and messages includes easy-to-use conditional logic which does not require

programming expertise.

• An integrated toolset for change management and offline verification tests is provided.

• Resilient Operations Sentinel servers and workstations are supported, as there must be no single

point of failure once a data centre comes to rely on the automation for daily system operation.

These may be run in a primary/secondary role or could be used for concurrent processing.

Resource monitoring for MCP, UNIX, and OS 2200 systems tracking 30 to 40 different attributes for

each system.

Alert notification and escalation through multiple devices including cell-phones, text messages, e-

mail, serial devices such as an LED wall panel display and audible alert messaging.

Help is highly customisable and based on Alert Id: this allows for Operations Best Practices to be built

into the automation tool. This is much better than relying on paper manuals that are out of date,

cannot be found, and may contain much that is apocryphal. Online information is far easier to update

and keep current than paper tomes maintained by few and read by none. Contact information, for

instance, changes often and may be updated with a few keystrokes when the help file is electronic.

Total or direct control of local and remote servers and systems:

• Remote operator display sessions (ODT) to MCP systems

• Direct console access for OS 2200 partitions in the ClearPath Dorado Series

• Replicated console windows for UNIX and other systems

• Remote desktops to control workstations and servers that run Windows

A graphical application for customised monitoring of hardware objects, such as disk and tape drives,

and software objects, such as processes, on managed systems. Operations Sentinel Zones and

Views provide the right information for each audience. For instance, a Help Desk will require less

detailed information than data centre operators. Very high level displays may be made available on

large screen monitors for management, analysts, and others concerned with the overall business.

Security roles are used to enforce which users may see/access which systems and components.

Customisable to meet needs of various users: operators, managers, help desk attendants, support

personnel, and other data centre staff:

Simple Network Management Protocol (SNMP) trap service plus a predefined database of traps and

corresponding Operations Sentinel events for immediate results.

An audit trail of messages and events with search and filter capabilities for quick isolation of specific

information, and a consolidation function to correlate information involving multiple systems.

A ‘menu builder’ to provide user-friendly, context-sensitive menus. These may invoke scripted actions

and represent an extensible user interface for operator use, with sophisticated actions defined by the

operations analyst.

Customised solutions have been deployed around the world to meet many data centre automation

needs. They include, but are not limited to, the following:

• One-button disaster recovery, which includes Server Control Automation (SCA), or the ability to

repurpose development/test systems for DR use. See Figure 7 for an example.

17

• EMC Site Remote Data Facility (SRDF) automation: ensure data is flowing to the DR site on a

real time basis and automate DR operations. This automation package can be integrated with

SCA.

• Virtual Tape Library (VTL) system monitoring and automation, including the ability to

automatically load boot tapes during a DR event.

• Shared Object Manager Application (SOMA): automatically move system resources from where

they are not currently needed to system(s) that have a need.

• Audible alerting: transform visual exception alerts into audible alarms when attention is required.

• Degree of implementation automation metrics: tracking automation progress can lead to

unattended operations.

Figure 7: Server Control Automation (SCA) User Interface and Topology Examples

Operation Sentinel integrates with a wide range of systems and network management products such as

CA Unicenter, HP OpenView and OpCon.

SMA OpCon

Originally designed to manage batch job scheduling, OpCon from SMA Solutions has evolved to become

a powerful workflow management tool with a wide and expanding variety of applications. It supports all

the major operating systems, including ClearPath OS 2200 and MCP systems, as well as some specific

application packages and environments, for example Java EE and SAP. OpCon includes cross-platform

management capabilities. Examples of flows are as diverse as jobs completing on one or more platforms

initiating jobs on other platforms, sequencing the start-up of middleware components in a distributed

environment and orchestrating a complete DR process involving multiple systems.

Figure 8 is a schematic of the architecture, showing the major components. The managed systems may

be in the same data centre or distributed across a number of locations.

The User Interface (U/I – Enterprise Manager) component provides the means for operations personnel

to access the system. The functions to be performed include defining permissions and privileges,

maintaining work-flow details such as schedules and calendars, and producing reports. The graphical

interface provides multiple views of process flows, for example using Pert charts, and bar charts showing

summary information such as daily statistics. Figures 9 and 10 are just two examples.

18

Figure 8: OpCon architecture, showing the major components

The Schedule Activity Monitor (SAM) component, which runs on a dedicated server, is the management

part of the products. It monitors the database for the schedules to be run, communicating with the Local

Schedule Activity Monitors (LSAM) in the managed systems. The SAM runs the daily schedules, ensuring

that the activities are started at the right time, on the right server and in the correct sequence.

The LSAM runs on each monitored machine, performing the initiation and monitoring of each activity and

communicating with the SAM to report status.

The database ideally resides on the server running the SAM and contains the necessary information

about the activities, including schedules, calendars, frequencies and dependencies. It provides replication

to enable a hot standby in the event of the loss of the primary OpCon server.

Figure 9: PERT view of process

ODBC

Management Server with:

- Database

- Schedule Activity Monitor (SAM)

Database

TCP/IP network

Managed systems with Local Schedule Activity Monitor (LSAM)

Bridge

or

Router

LSAM LSAM LSAM LSAM LSAM LSAM

SAM

LAN

Bridge

or

Router

User Interface (U/I)

LAN

Data centre nData centre 1

19

Figure 10: Daily statistics using bar chart

The core functions of OpCon, in addition to its role of managing timely and sure batch job scheduling, are:

Automatically reacting to external and internal events, including receipt of emails, user actions,

receipt of incoming files, detection of thresholds for example from monitoring databases and

networks, and requests from Operations Sentinel. The event can results in a visualisation, and/or an

action, including an alert.

Building processes dynamically, based on input received from an application.

Status monitoring, including as databases, memory, processors and applications.

Managing secure file transfer.

Integrating with ERP and business workflow management products.

Managing distributed data centres in multiple time zones.

Producing the data required for auditing purposes, for example ISO or SOX. A wide variety of

predefined reports are available, with facilities for custom report generation.

OpCon integrates with Operations Sentinel and other systems management products.

Enterprise Output Manager

The primary function of Enterprise Output Manager (EOM – formerly called DEPCON) from Unisys is to

process and route print files and other application output files from any supported platform to any

supported output destination. Processing is based on the file name, file size, or other file characteristics

which are specified in advance. In addition to printing, Enterprise Output Manager supports a variety of

other delivery methods. It runs in a Windows server, which connects to the various input and output

systems and devices using industry-standard protocols.

Figure 11 shows the input sources and output destinations currently supported.

20

Figure 11: Shows input sources and output destinations supported by Enterprise Output Manager

Enterprise Output Manager provides significant features and advantages, including:

Electronic forms can be used rather than pre-printed forms.

Output data can be enhanced without changing source applications.

Policy-based automation manages document workflow.

Information can be delivered to web, e-mail, printers, and more.

EOM has an alert handling service to assist operators/administrators. The EOM administrator can

configure the use of Operations Sentinel as the alert service. This helps centralising EOM alerts with

other managed alerts.

Figure 12 is an example of an input text file and the resulting formatted and enhanced output. It shows

the data from a metering source text file formatted as report for a client. The report is generated as a PDF

file.

Figure 12: Example of reformatting and enhancement by Enterprise Output Manager

Print to any Windows printer orTCP/IP devicesANY operating system using open protocolsOther Output Manager serversE-mail, Fax, create CD/DVDsExternal applicationsView files electronicallyWrite files to directoriesPost files to web foldersXML transformsFTPHTTPHTTP (Web services)Tagged Image File Format (TIFF)PDFSecure EmailUser written programs

ANY operating system (Windows,

ClearPath, UNIX, Linux, Oracle, HP, IBM) using open protocols –HTTP/TCP/LPR/LPD

Other Output Manager serversMonitor network directories

COM interface.Net APIE-mail

Message Queuing (MSMQ)HTTP

OutputsInputs

Enterprise

Output

Manager

21

TeamQuest and SightLine

Both TeamQuest Corporation and SightLine Systems produce capacity planning tools for ClearPath OS

2200 and MCP systems, as well as many other system types; they can be used throughout the data

centre. The products support capacity planning based on historical data and trend analysis.

Real time monitoring is also supported. Thresholds can be set on resources such as memory availability,

processor loads and queue lengths. Passing a threshold triggers an event, which can be displayed and/or

passed to other management tools, such as Operations Sentinel, for further analysis and corrective

action.

TeamQuest also provides a variety of additional management utility tools for ClearPath OS 2200

systems.

Other products

Both ClearPath system families have an SNMP agent, allowing the systems to be visible within a TCP/IP

network using any management product supporting SNMP, for example HP OpenView.

There are various other management products that can be used with ClearPath systems. Two examples

are from Affinité and Locum. The Affinité Corporation offers an integrated set of software tools for

performance monitoring, capacity management, ClearPath MCP system security and PC integration. The

Affinité Corporation has also developed ArteMon, a system performance monitor for Linux servers,

Windows servers and network devices. Locum Software Services Limited provides security tools for

ClearPath MCP systems.

Managing ClearPath fabric-based systems

ClearPath fabric-based systems for both Libra and Dorado were first announced in mid-2014. These

systems combine both ClearParh MCP or OS 2200 and Windows and/or Linux environments in a single

system. The fabric provides the connection between the systems, using a high-speed local connection to

link the different components. The different environments, including hundreds of secure partitions (s-Par®)

executing Linux or Windows, may contain application components which collaborate with each other or

run independently2. Figure 12 is a schematic of the architecture. The MCP or OS 2200 environments are

provided the Processor/Memory Module (PMM). The fabric-based systems also have a high availability

(HA) capability. It is provided as a redundant PMM, which can be bootstrapped in minutes if the primary

PMM fails. For the high end systems, HA is standard but is optional for the Libra 4300 and Dorado 4300.

The entire system may be managed as a single entity. Figure 13 shows one possible approach.

Fabric management is responsible for managing the underlying infrastructure such as creating secure

partitions and other fabric components. The various application environments are managed using

Operations Sentinel, SMA OpCon and TeamQuest or Sightline, performing the functions described earlier

in this section. Operations Sentinel could act as an overall management, showing the status of the whole

system, including feeds from other management tools such as an SNMP network manager. Alternatively,

the tools could report to an overall manager.

2 For an introduction to ClearPath fabric-based systems, their architecture and examples of their use see the White Paper

‘ClearPath Systems with Fabric-Based Infrastructure’, 2014, which can be found at http://www.unisys.com/offerings/high-end-servers/clearpath-systems/Whitepaper/ClearPath-Systems-with-Fabric-based-Infrastructure-id-1364

http://www.unisys.com/offerings/high-end-servers/clearpath-systems/Whitepaper/ClearPath-Systems-with-Fabric-based-Infrastructure-id-1364

http://www.unisys.com/offerings/high-end-servers/clearpath-systems/Whitepaper/ClearPath-Systems-with-Fabric-based-Infrastructure-id-1364

22

Figure 13: Example management structure for ClearPath fabric-based systems

SP = Specialty partitions

I/O = Input/output

W/L = Windows or LinuxEPP = Enterprise Partitionable Platform

PMM = Processor/Memory Module

ISM = I/O Specialty Partition Module

EPPs (up to 12 EPPs)ClearPath Complex

Unisys Intel Server

(ISM)Unisys Intel Platform (ISM)

Unisys Intel ServerUnisys Intel Platform (PMM)

Unisys firmware

MCP or OS 2200

Applications

s-Par

I/O SP

Unisys Intel Platform (EPP)

s-Par

W/L W/L

Unisys Intel Platform (EPP)

s-Par

W/L W/L

Fabric infrastructure

Fabric management

Operations Sentinel

SMA OpCon

TeamQuest/Sightline

Other management tools

23

Conclusions

The role of systems management is to maximise IT systems’ ability to deliver services. The requirements

in today’s environments, where services are more and more delivered by collaborating systems, are

particularly demanding. This paper has argued that successful systems management requires a high

level of automation. This is true even where labour costs are low: manual intervention is error-prone and

in some cases simply not possible.

While automation is essential to manage the normal flow of work, it is critical in managing the abnormal,

especially disaster recover. Although catastrophic failure is rare, the DR nightmare will occur at the worst

possible moment. Explicit and carefully-crafted DR contingencies must be considered and the required

infrastructure implemented. Unfortunately, this is where many organisations stop. The DR infrastructure

needs to be monitored automatically, with fail-safe alerts raised if and when it proves to be non-functional;

anything less means a potential loss of data. In other words, DR without automated monitoring is not a

DR solution.

Furthermore, the DR cut-over decision needs to be made easier and even automatic in some cases.

Automation and a graphical interface speed up the decision-making process. And it must be

implemented in reverse as well, to make coming back from DR equally painless and transparent. When

all this is in place, the ‘one button’ DR exercise is possible. Frequent testing of the end-to-end process is

now possible so that a disaster (natural or not) is not a disaster at all for the data centre.

An organisational consequence of applying high levels of automation to systems management is to

change the role of operators. They are now required to perform much higher-level functions; their role is

better described as operations analysis and management. Their activities are no longer concerned with

executing procedures manually, answering questions and so on. They are now responsible for running

the data centre, including all the planning and automation, with a supervisory role in case of any

unexpected event.

This role transformation is necessary. The complexity of current environments, coupled with the

pressures to reduce costs while maintaining high levels of service, means that the more traditional

approach to operations cannot be sustained. That should not ultimately be seen as a problem because

operations staff now can now move to more highly-skilled positions, enhancing their careers. However,

the change has to be managed as people cannot adapt overnight.

ClearPath systems operate in some of the world’s most critical environments. Mission-critical system

attributes and features such as the Business Continuity Accelerator and XTC combine with powerful

systems management tools from Unisys and its partners to provide an excellent systems management

foundation, for both normal and abnormal conditions, up to and including DR. And importantly, the

systems management products address all the platforms in the data centre, not just the ClearPath

systems.

24

More information

Additional information about standards, technologies and products is widely available, at both strategic

and detailed levels. The following are just a few of the useful sources.

White Papers about ClearPath systems and systems management tools, and other Unisys products and

services, can be found on the Unisys Web site:

http://www.unisys.com/search/unisys?k=UnisysAssetType:Whitepaper

Other sources of information about products and technologies mentioned in this paper are:

Affinité:

http://www.affinite.co.uk and http://www.affinite.com

The Internet Engineering Task Force (IETF):

http://www.ietf.org

Locum Software Services Limited:

http://www.locumsoftware.co.uk/

SightLine Systems

http://www.sightlinesystems.com/

SMA Solutions:

http://smasolutions.it

TeamQuest Corporation:

http://www.teamquest.com/

http://www.unisys.com/search/unisys?k=UnisysAssetType:Whitepaper

http://www.affinite.co.uk/

http://www.affinite.com/

http://www.ietf.org/

http://www.locumsoftware.co.uk/

http://www.sightlinesystems.com/

http://smasolutions.it/

http://www.teamquest.com/

25

Appendix: Managing IT service interruption: a process view

Handling service interruptions requires the execution of a number of processes. Low levels of automation

can result in the involvement of a number of different groups of people in the detection and diagnosis of

causes, and in service restoration. This appendix describes these processes and the groups involved,

and shows how they change with increasing levels of automation.

Service interruption and restoration: the components

Service interruptions can be divided into three classes, in descending order of severity of impact:

1) Complete loss of all service.

2) Partial loss of service, where some of the required service is delivered but not all.

3) The service level is degraded, typically because the response time has become extended beyond

that specified in the service level agreement, with a partial loss of service possible.

A complete loss of service most often arises from an environmental failure such as power supply, air

conditioning, fire, or something more sinister, for example terrorist action. However, there may be other

causes. For example, infrastructure failures such as loss of network connections or gateway systems

such as portals could cause the loss of all the services.

Partial loss of service could be caused by the failure of one of the component systems in a distributed

application to be able to execute its part of the overall service. An example might be the failure of the

customer management system. The severity of the impact would depend on the services provided by the

failed system.

Service degradation could result from an overload on one or more parts of the system, leading to queuing

and hence extended response times. It is also possible for some other, unrelated function to cause a

problem by overloading an essential component, for example the middleware infrastructure or the

network linking the various systems together.

The approach to handling service interruptions should be first, to minimise the probability of an

interruption, and secondly, to restore the service as quickly as possible in the event of a loss. Minimising

the probability of service loss is largely achieved by removing all single points of failure. Provision for

additional capacity in the event of an overload can reduce the risk of extended response times.

Restoring the service comprises a number of steps; figure A1 is a schematic of the process.

Figure A1: IT service interruption and restoration

The service interruption occurs at time T0, on the left of the figure. The service is restored at time T1, so

the goal is to minimise the interval T0 to T1, ideally reducing it to zero. The first step is to detect the service

Service interrupted Service restored

T0 T1

Detect service

interruption

Diagnose cause

Decide recovery

action(s)

Execute

recovery

action(s)

Advanced warning

of interruption

26

interruption. This may be obvious, for example all the lights have gone out following a power failure,

which suddenly interrupts all services. There may also be some warning of an impending total loss of

service, as could be the case if the air conditioning failed or a weather system were approaching; there

would be some time to take action. In other cases, the problem may be more difficult to detect, for

example where response times are gradually deteriorating.

The next step is to determine the cause of the interruption so that an appropriate recovery strategy may

be followed. While the cause may sometimes be obvious, for example in the case of a power loss or a

hard system stop, determining the root cause of a failure may be quite difficult, especially in the kind of

distributed environments common today.

The next activity is recovery, the nature of which depends on a diagnosis of the cause. A single

component failure caused by a software error could be recovered simply by restarting the failed system.

At the other extreme, a complete loss of a data centre would require a recovery in a disaster recovery

(DR) site. How long this takes depends on the provisions that have been made in advance, which follow

from an analysis of the business consequences of a failure. The DR recovery time could be as short as a

few seconds for extremely critical systems and may extend to hours or even days for less critical cases.

Following service restoration, what then remains is to collect information for subsequent analysis, either

locally or perhaps by suppliers, for example by reporting software errors.

A process view of service restoration with minimal automation

Figure A2 and the following paragraphs explain the process flow with minimal automation.

Figure A2: Service failure and restoration process flow with minimal automation

Infrastructure

Operations

Technical

support

User support

& business

management

Users

Servers, software, storage, networks..


Detect loss

of service

Diagnose cause

decide action

and prepare

Recover

In situ?

Execute local

recovery

Request DR

approval

OK?

Gather

follow-up

data

Execute DR

process

YY

N

N

Systems management tools

Report loss

of service

Approval

decision

Diagnostic

assistance

Analysis and

corrective

action

Inform

users

Users prepare

to use backup

system(s)

Users resume

operation

Applies only

for DR

Assist in DR

process

Assist in local

recovery

process

Report loss

of service

Diagnostic

assistance

Report loss

of service

Diagnostic

assistance

27

The infrastructure of hardware and software is at the bottom of the figure, while the various groups which

interact with the infrastructure in some way are shown in the swim lanes above it. The arrows represent

the flow and interactions, with time flowing from left to right. The service interruption is marked on the left

by a dashed arrow while service restoration is shown on the right, also by a dashed arrow. The central

concern is the time taken by the processes between the interruption and restoration.

A study of an organisation with minimal automation would typically reveal the following:

Although some causes of service interruption are obvious, and quickly detected, there is a

considerable amount of interaction needed between operations, technical support, user support

and even end users. Some service interruptions are not detected until reported by the users,

leading to extended delays.

Diagnosing the cause is similarly variable. Some causes are again obvious but extensive

interactions with technical support and other groups are common, increasing the duration of

service interruption.

Having determined the cause, recovery action follows. While the normal procedure is to recover

in situ, this depends on the cause; complete loss of the whole environment, for example, would

make it impossible.

A successful local recovery restores the service. All that then remains is to gather information for

further analysis by technical support. The time taken to perform a local recovery depends on the

nature of problem, the systems involved and the operational procedures required. The times are

often extended, frequently requiring assistance from technical support.

If a local recovery is not possible, or the attempt to perform one fails, the next step is to invoke a

failover to the DR site of some or all of the applications, depending on the problem. A failover to

DR requires authorisation by the business management and liaison with the users. On approval

by management, the DR processes are executed, restoring the service. Getting business

management approval is liable to be time-consuming, and the recovery procedures are complex

and somewhat error-prone.

All in all, there is considerable room for improvement.

Enhancing the environment

An analysis of the above would show that there is far too much manual intervention required. Detection of

service interruptions and cause determination suffers from a lack of effective information from the system,

leading to extensive interactions with other groups, such as support. A low level of automation in

executing procedures slows down recovery actions. Further, the lack of diagnostics and automation

increase the likelihood of operational errors, compounding any problems.

Significant improvements could therefore be made by improved use of systems management tools,

particularly for diagnosis, and a greatly extended use of automation. This approach would maintain the

existing strategy of requesting management approval for any failover to DR, but would greatly accelerate

all the procedures required to restore services, both as a local recovery in the main data centre, and in

the DR site.

Enhanced systems management enables operations to become far more proactive. This is a major factor

in accelerating service restoration, as extensive interaction between the various groups is very time

consuming. A psychological benefit is that users feel more confident in the system if they see problems

detected and resolved quickly without their having to complain. And an economic benefit is to reduce the

labour required for effective operation by cutting out unnecessary and wasteful activities.

28

Figure A3 shows the effects of enhanced systems management for problem detection and cause

diagnosis. A number of the process and interactions shown in figure A2 have been greyed out to

represent a reduction in the degree of interaction.

Figure A3: Process flow with enhanced problem detection and diagnosis

The environment described above is a significant improvement on the original state in that the level of

interaction required with technical support and user support teams, and the end users, is greatly reduced

for detecting and diagnosing problems. However, the basic model is unchanged. Intervention by

operators is still needed and business management approval is required for any DR activity.

There is clearly scope for improvement. Further automation can be undertaken in stages, each reducing

the need for operator or other intervention and the time required for service restoration. The goal is to

move towards the dark data centre, which runs with lights out and minimal intervention. How this can be

achieved is explained next.

Towards the dark data centre

Two snapshots are shown although the implementation may require more steps. The first snapshot

shows the results of automating the execution of recovery processes but not any decision to invoke a DR

process. The second snapshot is the final result with full automation. The first snapshot is shown in figure

A4. As can be seen, interventions required are largely confined to decisions about invoking DR

processes. Figure A5 shows the final stage, the dark data centre, which is fully automated.

InfrastructureServers, software, storage, networks..


Detect loss

of service

Diagnose cause

decide action

and prepare

Recover

In situ?

Execute local

recovery

Request DR

approval

OK?

Gather

follow-up

data

Execute DR

process

YY

N

N


Approval

decision

Analysis and

corrective

action

Inform

users

Users prepare

to use backup

system(s)

Users resume

operation

Applies only

for DRReport loss

of service

Diagnostic

assistance

Assist in local

recovery

process

Report loss

of service

Diagnostic

assistance

Report loss

of service

Diagnostic

assistance

Operations

Technical

support

User support

& business

management

Users

Assist in DR

process

29

Figure A4: Process flow with enhanced problem detection, diagnosis and automated recovery execution

Figure A5: The final stage of automation, resulting in the dark data centre

InfrastructureServers, software, storage, networks..


Detect loss

of service

Diagnose cause

decide action

and prepare

Recover

In situ?

Execute local

recovery

Request DR

approval

OK?

Gather

follow-up

data

Execute DR

process

YY

N

N


Approval

decision

Analysis and

corrective

action

Inform

users

Users prepare

to use backup

system(s)

Users resume

operation

Applies only

for DRReport loss

of service

Diagnostic

assistance

Assist in local

recovery

process

Report loss

of service

Diagnostic

assistance

Report loss

of service

Diagnostic

assistance

Operations

Technical

support

User support

& business

management

Users

Assist in DR

process

Infrastructure

Operations

analysis &

management

Servers, software, storage, networks..


Gather

follow-up

data


Analysis and

corrective

action

Users resume

operationReport loss

of service

Diagnostic

assistance

Assist in local

recovery

process

Report loss

of service

Diagnostic

assistance

Report loss

of service

Diagnostic

assistance

Detect loss

of service

Diagnose cause

decide action

and prepare

Recover

In situ?

Execute local

recovery

Request DR

approval

OK?

Execute DR

process

YY

N

N

Approval

decision

Inform

users

Users prepare

to use backup

system(s)

Applies only

for DR

Technical

support

User support

& business

management

Users

Assist in DR

process

30

In this stage the entire process is automated, eliminating as far as possible the need for operator

intervention and any managerial approval for DR processes. Technologies that could be used include

dual active data centres or live/hot-standby configurations. Virtualisation technologies can help in allowing

virtual servers to be easily relocated to other physical servers. The degree of sophistication depends on a

cost/benefit analysis, based on the financial or human consequences of loss of service.

Figure A5 shows the process changes that would follow from achieving this state. The remaining human

interventions have been removed or largely removed. The elapsed time from service interruption to

restoration may now be reduced to zero, at least in some cases: there is no interruption, in fact.

Applying high levels of automation to handling abnormal conditions should be implemented in stages to

minimise risk. With each stage the need for operator invention will be reduced and so the processes they

execute will change. Due allowance should be made for change management and ensuring that the new

procedures work effectively. And of course the processes should be tested regularly to minimise the

possibility of malfunction during a real incident.

31

About the author

Peter Bye

Now an independent consultant, Peter Bye was a senior system architect in Unisys, based in London. His

special area of interest is networked computing, including communications networking, middleware, and

architectures. He has many years of experience in information technology, working as a programmer,

analyst, team leader, project manager and consultant in large-scale customer projects in banking,

transportation, telecommunications and government. He has also worked in software development

centres, during which time he spent two years as member of an international standards committee

working on systems management.

He has worked for extended periods in Sweden, Denmark, Finland, Norway, the USA, France and Spain,

as well as the UK. He has presented at a wide variety of conferences and other events and is the author

of a number of papers on networking, systems management, and middleware. He is the co-author of a

book on middleware and system integration – IT Architectures and Middleware: Strategies for Building

Large, Integrated Systems (2nd Edition) – which was published by Addison-Wesley.

32

For more information visit www.unisys.com

© 2014 Unisys Corporation. All rights reserved.

Unisys and other Unisys products and services mentioned herein, as well as their respective logos, are trademarks or registered trademarks of Unisys

Corporation. All other trademarks referenced herein are the property of their respective owners.

Printed in the United States of America 14-0556

unisys clearpath systems management · 2014. 12. 1. · 2 managing normal operation the management...

Documents