websphere compute grid technical introduction

Application infrastructure virtualization White paper

WebSphere Extended Deployment (XD) Compute Grid

Technical introduction

Snehal S. Antani WebSphere XD Technical Lead

SOA Technology Practice IBM Software Services

April 2008

1

1. Executive Summary ........................................................................................................ 3 2. Batch processing and its role in IT ................................................................................. 4 3. Technical Overview of WebSphere XD Compute Grid ................................................. 9 4. WebSphere XD Compute Grid Use-cases .................................................................... 18

4.1 Use-case 1: Batch Modernization ........................................................................... 18 4.2 Use-case 2: Highly Parallel Batch Jobs .................................................................. 19 4.3 Use-case 3: Dynamic Online and Batch Infrastructure........................................... 22 4.4 Use-case 4: Batch as a Service................................................................................ 25 4.5 Use-case 5: Replacing existing java batch solutions .............................................. 26 4.6 Use-case 6: Sharing reusable business services across batch and OLTP ............... 29

2

1. Executive Summary

Overview Batch processing is an integral part of an IT infrastructure. Core business processes- calculating interest and credit scores, payments processing, billing systems, end of month/quarter/year reporting and so on all rely on batch as the execution environment. The emergence and evolution of standards, middleware, interpreted languages such as Java, and new development tooling will have significant impacts on the strategic business and technical direction of batch. This whitepaper will discuss the role an emerging technology - WebSphere XD Compute Grid - will play on batch modernization and enterprise grid computing. Value Proposition for WebSphere XD Compute Grid

1. Delivers an enterprise java batch execution environment built on WebSphere (zAAP eligible batch on z/OS)

2. Enables the incremental migration of COBOL to Java (on z/OS) thereby reducing the risks associated with a batch modernization project

3. Integrates with existing enterprise batch schedulers such as Tivoli Workload Scheduler (TWS), CA7, Control-M, Zeke to help deliver a robust, cost-effective, WebSphere-based batch execution environment

4. Enables new execution patterns including: Dynamic OLTP and Batch runtime environment built on WebSphere; highly parallel batch jobs; and many others.

5. Integrates with the overall SOA strategy of reuse by enabling one to share business logic across both the OLTP and Batch paradigms.

6. Delivers high-performance batch processing by leveraging the System-z, z/OS, and WAS z/OS performance optimizations gained when executing within close proximity of the data.

Common Use-Cases

1. Batch Modernization 2. Highly Parallel Batch Jobs 3. Dynamic Online and Batch infrastructure 4. Batch as a Service 5. Replace Existing, Homegrown Batch Frameworks 6. Share Business Logic Across Online and Batch Paradigms

3

2. Batch processing and its role in IT Batch processing is an integral part of an IT infrastructure. It encompasses all non-interactive processes – processes that do not require direct user input; furthermore, batch workloads must complete within some constrained window of time. Financial institutions for example will reconcile banking activities after the business day has ended; business cannot start the next day until all account activities have been settled. Batch is not limited to the financial services sector by any means. Insurance, healthcare, retail, and really any enterprise application infrastructure will require batch. Retail companies for instance have incorporated batch workloads into their supply-chain; at the end of the business day current inventory is evaluated, items required to be restocked are subsequently ordered from their suppliers. Service providers are another example; utility companies will generate billing statements at the end of each month, an activity that is achieved with programmatic access to the month’s activities, no user input is needed. The criticality of batch within a business imposes demanding requirements in terms of performance, recoverability, and availability. As modern enterprises evolve and adopt a 24x7 business model, batch windows - the constrained period of time that the workload must complete within - will only get smaller. Performance therefore is arguably the most important requirement. Recoverability is essential as well. If the batch job fails for whatever reason, the work completed thus far should be recoverable and of course, must maintain data integrity; the overhead of recovering that failed job should be minimized. For example, suppose a batch job must process several million bank account records; if that job fails while processing the last record, reprocessing the millions of completed records should be avoided, the job should restart at (or near) the failed record and complete quickly. Finally, availability of the batch service is a must. Businesses depend on the completion of their batch workloads, revenue could be lost if, for example, the days account activities have not been reconciled and business cannot continue until that has completed. For the first technical requirement – performance – the proximity of the business logic to its data is highly influential. The closer the business logic is to the data, the faster the workload will execute. There are two approaches to modifying the proximity: bring the data to the business logic, typically achieved through caching; bring the business logic to the data, normally realized by moving the location of the application to hardware shared with the database (System Z for example). On distributed platforms – UNIX, Linux, Solaris, Windows, etc – the data tends to reside on separate physical machines than the business logic. Caching technologies and database tuning therefore become essential to the overall performance of the system. Data Grid, the distributed caching component of WebSphere XD, delivers a sophisticated set of technologies for delivering in-memory databases, data partitioning, and aggressive caching strategies. Compute Grid, the batch execution component of WebSphere XD, can integrate with this caching technology and deliver a high-performance batch processing system. Of course, the use of such a caching technology is highly dependent on the data access patterns of the application, there will be cases where data cannot be cached or stored in some large-scale in-memory database.

4

In that scenario, database tuning or bringing the business logic to the data become more viable options. If the data for a batch workload resides on the mainframe (System Z), the ideal choice for the location business logic will be the mainframe. Once again, close proximity is essential for high-performance batch systems. WebSphere XD Compute Grid for z/OS, as a batch execution runtime, integrates with both the z/OS operating system and WebSphere for z/OS to deliver highly optimized methods for accessing data stored on the mainframe, be it data stored in traditional VSAM files, IMS, DB2 for z/OS, and other traditional and native data stores. The remote access of data – retrieving or storing data from a non-z/OS process – introduces a tremendous amount of overhead compared to the alternative of co-locating the business logic and data on z/OS. The performance impacts of this overhead are exacerbated with batch processing, where the workload is processing tens or hundreds of millions of records within some constrained period of time. A common design pattern for building high performance systems is to divide-and-conquer: to partition large workloads into smaller, more manageable pieces and to execute them in parallel. This approach can significantly (and positively) impact the overall execution time of a workload; normally the elapsed time for completing a job is a function of the number of records to be processed. With highly parallel workloads, the elapsed time instead becomes a function of the size of a single partition. A concrete implementation of parallel processing systems includes the use of a dispatcher, whose role is to manage the execution of partitions, and N workers, whose purpose is to execute jobs as discrete workloads. The architectural topology of WebSphere XD Compute Grid naturally fits this pattern; therefore it facilitates highly parallel batch jobs. XD Compute Grid takes this pattern a step further however. The worker component within Compute Grid supports multiple batch jobs to run concurrently; the worker hides the complexities of multi-threading from the application thereby simplifying both development and runtime management. An additional benefit, which is especially important on z/OS, is that fewer hardware resources are consumed with this topology: XD Compute Grid can run multiple jobs in parallel within a single process. For the second technical requirement – recoverability – batch workloads execute with certain expected qualities of service. Data integrity is of course a non-negotiable requirement; but in addition to that, failed batch jobs must have the ability to quickly restart. In batch terminology, workloads should be check pointed – to save the state of the workload, enabling the job to restart from its latest saved state – according to some interval throughout its execution. Check point mechanism are by no means novel, they have existed for several decades and are widely used in batch systems today. WebSphere XD Compute Grid, as a batch execution environment, delivers expected batch qualities of service such as a check point facility to ensure workloads are recoverable. Check point intervals, in terms of WebSphere XD Compute Grid, essentially demarcate long running transactions. The start of a checkpoint interval is the start of a transaction; the end of an interval is the commit of that transaction. All transactional work executed during that interval are governed by the underlying transaction manager; if the transaction

5

failed and the work must be rolled back, the transaction manager will work with the transactional resources involved and roll the changes back. If the transaction succeeds and must be committed, the transaction manager will again work with the transactional resources involved and commit the changes. WebSphere XD Compute Grid leverages the underlying transaction manager used by the application server. For example, on z/OS the transaction manager will be RRS; WebSphere XD Compute Grid on z/OS will therefore use RRS as its transaction manager. WebSphere XD Compute Grid, as it executes a batch job, will manage the check point intervals for that job. As a check point interval completes, WebSphere XD Compute Grid will store meta-data such as the location within the input and output files on behalf of the job. If the job fails and it must be restarted, this meta-data is retrieved and the data streams are positioned appropriately. This work: managing check point intervals and stream repositioning are essentially hidden from the application. The third requirement – availability – is one that is familiar not only to batch workloads, but to enterprise middleware infrastructures generally. Mission-critical applications and their dependant components must be both scalable and fault-tolerant; redundancy, delivered through clustering, is a significant part of the solution. If the demand for a component spikes, redundancy helps cover the additional load. If a component fails, redundancy ensures that at least one spare is available for use. WebSphere XD Compute Grid leverages the high availability options of WebSphere to ensure the batch runtime is always running. On Distributed platforms, Compute Grid can integrate with the Operations Optimization component of WebSphere XD to deliver a virtualized and dynamically scalable runtime. On z/OS, Compute Grid will leverage the multi-process architecture of WebSphere for z/OS and z/OS workload management to deliver a robust, scalable runtime. Additionally, the architectural topology of WebSphere XD Compute Grid ensures that its components are clustered and therefore redundant and highly available. The three technical requirements: performance, recoverability, and availability are important for one reason: batch workloads, which are critical to business operations, must complete within a constrained window, a window that is only getting smaller. This imposes a tremendous challenge for technology vendors and departments alike- continuously improve in each of those three areas. The evolution middleware and the emergence of interpreted languages such as Java will influence batch in many ways, but the three fundamental technical requirements will remain.

6

The three technical requirements for batch: performance, recoverability, and availability are critical; business operations depend on their batch workloads; revenue is at stake. IT departments however are faced with non-technical business challenges as well, namely to reduce execution and maintenance costs. Furthermore, IT departments are challenged to operate with a constant (or shrinking) budget while absorbing high growth in data and transaction volumes, a task that, on the surface at least, seems impossible. This is particularly important on the mainframe, the location for most enterprise-critical data and batch workloads. The emergence of virtualization, both hardware and software, on distributed platforms as well as specialty engines: zAAP, zIIP, and IFL processors on System Z have helped ease the challenge of cost reduction. The question now is how software can ease the adoption of these technologies without sacrificing any of the three aforementioned technical requirements. The first business challenge is reducing execution costs. Major contributors to execution costs differ across platforms. On distributed platforms – UNIX, Linux, Solaris, Windows, etc – underutilized hardware and software licensing tend to be the issues. The overall distributed middleware infrastructure can become over provisioned – more hardware resources are allocated to handle peak loads, but are on average underutilized; therefore hardware costs and software licensing can quickly grow out of control. Hardware and software virtualization have emerged to alleviate this problem; these virtualization technologies have the objective of reducing the overall hardware and software footprint of the infrastructure. WebSphere XD is one such technology in the software virtualization space. In addition to application virtualization and resiliency, the Operations Optimization component of XD delivers a goals-oriented runtime infrastructure that ensures workloads are meeting declared execution goals. Compute Grid, the java batch component of WebSphere XD, integrates with the Operations Optimization component and delivers a batch execution environment that is both cognizant of the execution goals and able to leverage the application virtualization and resiliency features. The mainframe, specifically System Z, delivered hardware virtualization and goals-oriented runtimes a few decades ago; therefore the technical solution for alleviating execution costs involves specialty processors: zAAP, zIIP, and IFL to be specific. Traditional batch processing on the mainframe was not written in Java; rather COBOL, C, and PL/X were the dominant languages. With specialty engines, zAAP in this case, Java on the mainframe executes at a more predictable, and often times cheaper, cost than other languages. Customers with a business case for migrating workloads to Java and zAAP engines have sought out technologies that help ensure the three technical requirements: performance, recoverability, and availability are not jeopardized. Enter WebSphere XD Compute Grid for z/OS. WebSphere XD Compute Grid delivers a java-based batch execution infrastructure built on WebSphere for z/OS; therefore Compute Grid and the batch applications it will execute are zAAP-eligible on the mainframe. The technology preserves the three technical requirements in a few ways: first, XD Compute Grid

7

leverages the performance benefits of Java 5 on z/OS; second, the technology integrates with the z/OS operating system for more efficient execution; finally, it not only leverages qualities of service (QoS) delivered by WebSphere for z/OS, but in addition delivers batch-centric QoS such as checkpointing. The second business challenge is reducing maintenance costs. Maintenance costs for sustaining a batch and OLTP runtime infrastructure appear in many forms, including: writing and maintaining separate application code for OLTP and batch applications; supporting separate test, deployment, and production management processes; devising and maintaining separate security procedures; and purchasing separate tooling to support each of these tasks. The reduction of maintenance costs is further complicated because of the time it takes for tangible results to surface. In the long term, simplifying the overall runtime infrastructure will be beneficial. WebSphere XD Compute Grid can help reduce maintenance costs. As a batch runtime infrastructure built on WebSphere, the separation of online and batch is eliminated. Merging these two paradigms leads to common processes for developing application code for both batch and OLTP; perhaps not share all services among the two domains, but at the least share test procedures, tooling, processes, and skill sets. Prior to WebSphere XD Compute grid, customers with a strong business case for java-based batch processing took matters into their own hands. They built and maintained their own java-based batch execution environments. For some customers, this meant building their own batch container and programming model; for others this involved a messaging system such as WebSphere MQ or some other mitigation technique. Most customers are not in the application infrastructure business, therefore investing development dollars into building and sustaining this type of technology was not a long-term objective of theirs, but rather the purpose of the constructing this infrastructure code was to mitigate the gap later filled by WebSphere XD Compute Grid. For those that pursued this mitigation, the maintenance costs can be eliminated with the adoption of this product. Batch processing is essential in enterprise computing, and can form the foundation for a given business. The evolution of technology will continue to provide new and innovative techniques for solving the technical challenges for batch – achieve high performance, provide recoverability, ensure availability. The business challenges will continue to drive the strategic direction of middleware infrastructures, WebSphere XD Compute Grid, as an emerging enterprise batch technology, can help solve both the technical and business challenges.

8

3. Technical Overview of WebSphere XD Compute Grid

WebSphere Extended Deployment (XD) delivers a set of technologies for improving the overall resiliency, efficiency, and management of a middleware infrastructure; furthermore XD provides features for both enterprise grid computing (java batch) and extreme transaction processing. XD delivers these technologies through three components, each of which can be used independently or cooperatively. Moreover, each component can be individually purchased or purchased as a bundle. Compute Grid , a java-based batch execution runtime built on WebSphere, delivers the following value to customers:

1. Delivers an enterprise java batch execution environment built on WebSphere (zAAP eligible batch on z/OS)

2. Enables the incremental migration of COBOL to Java (on z/OS) thereby reducing the risks associated with a batch modernization project

3. Integrates with existing enterprise batch schedulers such as Tivoli Workload Scheduler (TWS), CA7, Control-M, Zeke to help deliver a robust, cost-effective, WebSphere-based batch execution environment

4. Enables new execution patterns including: Dynamic OLTP and Batch runtime environment built on WebSphere; highly parallel batch jobs; and many others.

5. Integrates with the overall SOA strategy of reuse by enabling one to share business logic across both the OLTP and Batch paradigms.

6. Delivers high-performance batch processing by leveraging the System-z, z/OS, and WAS z/OS performance optimizations gained when executing within close proximity of the data.

Operations Optimization delivers the following features for improving resiliency and management of a middleware infrastructure:

1. A health management infrastructure, where customizable policies and corrective actions can be defined and enforced.

2. Continuous availability – interruption-free application updates – of applications with the Application Edition Manager.

3. Checkpointing the configuration of the WebSphere runtime to improve recoverability from administrative changes

9

4. Visualization technologies for viewing the relative health of the applications and infrastructure.

Furthermore for Distributed platforms, XD Operations Optimization provides the following features:

5. Application virtualization services enabling for the dynamic allocation and provisioning of applications to meet stated execution goals and increase scalability, reduce cost, and improve reliability.

6. A goals-oriented runtime to help meet stated execution goals. 7. Service policies and relative application priorities to differentiate workloads for

the goals-oriented runtime. 8. Delivers a goals-oriented, virtualized runtime infrastructure for non-WebSphere

middleware including Post Hypertext Preprocessor (PHP), BEA WebLogic, JBoss, Apache Tomcat, and others.

9. multi-media applications over voice and video using Session Initiation Protocol (SIP).

Finally Data Grid, a technology that enables extreme transaction processing (XTP) applications. This technology delivers several high-performance caching and data access options including:

1. An in-memory cache, providing transactional access to temporary data stored within the JVM.

2. A shared-coherent cache, providing transaction access to data similar to what databases perform today.

3. A database “shock absorber”, a more traditional use-case for caching where the a buffer requests can be offloaded from the database to the cache.

4. A scalable, low-latency, highly available data store. Here Data Grid facilitates for closer proximity of data and business logic with management and replication features.

The focus of this article is WebSphere XD Compute Grid, a java batch and enterprise grid execution environment. To understand this technology, its role among existing enterprise batch and grid artifacts must be understood. The batch landscape has four layers: schedulers, batch execution environments, batch application containers, and the business logic represented as batch applications. The following diagram depicts an overview of the batch landscape:

10

The role of each layer within the batch landscape is as follows:

- Schedulers; schedulers manage job dependencies, resource dependencies, scheduled submissions, and some form of job lifecycle and execution management. Quartz, Flux, and other such open source schedulers provide time-based scheduling and some form of dependency management. Tivoli Workload Scheduler, Control-M, Zeke, and other schedulers however provide more scheduling features and are typical products found at bigger customer shops. These shops have built complete batch infrastructures around the scheduler including security models, auditing mechanisms, log archiving, and so on. The following diagram depicts some key players within the scheduling landscape.

- Batch Execution Environments (BEE); they host batch

application containers, and provide features like: transaction management, checkpointing, recoverability, security management, connection management, scalability, high availability, output processing, and so on; the inherent qualities of service and integration with existing schedulers are provided by the BEE. XD

11

Compute Grid Delivers a BEE. The following diagram depicts the batch execution environment landscape.

- Batch Application Containers provide a well-formed invocation

model for the business logic. The container manages the lifecycle of the application and gives control to the underlying transaction manager, security manager, etc as needed. XD Compute delivers a batch application container. I would argue that Spring Batch is a Batch application container too. Spring Batch doesn’t provide a transaction manager, security manager, explicit high availability, and so on, but it does allow them to be injected into the container and therefore available to the application. The following diagram depicts the batch application container landscape.

- Batch Applications that implement the actual business logic and

run within a batch application container. Nothing special to discuss right now, perhaps portability among containers in the future

Three technologies in particular have come up frequently when discussing WebSphere XD Compute Grid: enterprise schedulers such as Tivoli Workload Scheduler; JZOS, a native java invocation framework available for z/OS; and Spring Batch, an open source batch container. First, the role of Compute Grid and enterprise schedulers; compute grid is an execution runtime for enterprise grid and java batch applications; it is not an enterprise scheduler. Enterprise schedulers serve as an integration point where the entire batch infrastructure is managed centrally. Artifacts such as enterprise-wide batch schedulers, dependencies among jobs and external resources, the location for where the job should execute, and so on are defined and managed at this point. XD Compute Grid works alongside enterprise schedulers and is essentially one destination where enterprise schedulers can dispatch to. Compute Grid’s primary objectives are to execute batch jobs with: high performance,

12

recoverability, and availability. The following figure depicts how Compute Grid and Tivoli Workload Scheduler would coexist on z/OS.

Compute Grid integrates with JES and allows jobs to be submitted via JCL. Since enterprise schedulers are familiar with how to manage JES batch jobs, they by proxy are able to manage Compute Grid jobs. Note that on z/OS, a native MQ client is used to for job submission and monitoring, therefore the job submissions don’t require the initialization and termination of a Java Virtual Machine, ensuring high performance. For distributed platforms, the following figure describes the integration:

On distributed platforms, a java-based adapter client bridges the gap, once again by proxy, between the enterprise scheduler and Compute Grid. Enterprise schedulers will centrally manage operational plans and other such scheduler-specific artifacts. These schedulers then dispatch batch jobs to XD Compute Grid via JES, on Distributed platforms, the details differ but the concept remains the same. XD Compute Grid will then execute the job assuring the defined qualities of service are met, and will notify the enterprise scheduler of job state information and other such execution data. Second, the role of Compute Grid and JZOS; JZOS is a technology for enabling java on z/OS to access and leverage traditional z/OS facilities. The following picture depicts JZOS and its roles.

13

This technology is composed of two components: first is a launcher, whose job is to efficiently initialize a J2SE runtime from JCL; second are the programming interfaces available to applications for accessing traditional z/OS resources such as datasets. The JZOS launcher technology initializes a J2SE runtime; a runtime that lacks features such as: transaction, security, and connection management; checkpoint and/or recoverability facilities for batch jobs; inherent high availability and other WebSphere qualities of service provided by enterprise middleware such as WebSphere. Furthermore, for each step executed, JZOS will initialize a Java virtual machine (JVM). For a few batch steps this may not be an issue, but when executing 10’s, 100’s, 1000’s of steps within a batch window, the overhead of JVM initialization and destruction will both dramatically decrease the overall performance of the system as well as significantly increase the CPU instructions executed which, on z/OS, directly impacts the monetary cost of the system. The JZOS programming interfaces (API’s) serve a different purpose than the launcher. The following diagram depicts the position of these interfaces relative to WebSphere XD Compute Grid.

The programming interfaces can be leveraged by Compute Grid applications, enabling these applications to access traditional z/OS artifacts such as ZFS, HFS, VSAM, and others. The JZOS programming interfaces coupled with Compute Grid delivers a strong integration point for enterprise java batch applications and traditional z/OS. The third technology to discuss is Spring Batch. Spring Batch is a batch application container. The technology does not provide a transaction manager, security manager, connection management, log management, inherent high availability, and other such infrastructure services. The technology enables the configuration of a batch application

14

via dependency injection and delegates to both the business logic and infrastructure services as needed during batch execution. The following diagram depicts the role of a batch container, which really is a configurable delegation mechanism to the business logic and underlying execution environment.

WebSphere XD Compute Grid on the other hand provides technology for three of the four batch layers described: a scheduler, tightly-integrated with the batch execution environment, for managing the execution of XD batch applications; a batch execution environment that provides infrastructure services like the transaction manager, security manager, high availability, and so on; and also a batch application container that delegates to the business logic and to the infrastructure services as appropriate. Spring Batch addresses one layer of the batch landscape; WebSphere XD Compute Grid addresses three layers. There are many technologies within the enterprise batch infrastructure, the three described – enterprise schedulers, JZOS, and Spring Batch – have surfaced as three technologies of value and the source of much confusion with regard to Compute Grid. Batch applications, regardless of the underlying technology, are well-defined. A single job consists of one or more steps. Each step is composed of the following components: the source(s) of input data; business logic that must be applied to the data; the destination(s) of output data; and finally the qualities of service that will govern the execution of the step. The following diagram depicts the definition of a batch job.

15

WebSphere XD Compute Grid delivers a programming model for representing each of the components of a batch job. The source(s) and destination(s) of data are represented by Batch Data Streams; the business logic for each step is contained within Batch Job Steps; and other programming artifacts represent the other pieces. Compute Grid jobs, to ensure simple development and higher performance, are built using plain-old-java-objects (Pojo’s) rather than alternative, heavyweight technologies. Each batch job is represented by meta-data; in traditional batch this meta-data was in the form of JCL. Compute Grid leverages an XML representation of the batch job named XJCL. The Compute Grid architecture follows the traditional Dispatcher-Worker pattern for high scalability and performance; the LRS, as the dispatcher, can receive the XJCL job definitions from multiple channels including: enterprise java beans, web services, java messaging, a command-line service, and through a job management console. The following figure depicts this.

16

The LRS, according to a variety of criteria and rules, the descriptions of which fall beyond the scope of this whitepaper, will dispatch that job to a cluster of LREE’s. LREE’s, multi-threaded batch execution runtimes, will then enforce qualities of service including recoverability. As jobs execute, LREE’s will notify the LRS of the state of the job. The following diagram depicts this behavior:

To ensure availability of the batch execution environment, both the LRS and LREE are clustered. Furthermore, the scalability of both components is based on the underlying application server infrastructure. Compute Grid provides scaling mechanism, but can build on mechanism provided by other middleware products as well. For example, with WebSphere XD Operations Optimization, dynamic clusters coupled with intelligent load-balancing can provide scalability on distributed platforms; with WebSphere z/OS, its inherent multi-process architecture delivers scalability. Compute Grid, an emerging enterprise grid and java batch execution runtime, addresses both the technical and business challenges faced by IT infrastructures. The dispatcher-worker pattern coupled multi-threaded execution runtimes addresses the challenge of performance. The mechanisms of checkpointing together with the qualities of service delivered by the underlying WebSphere assure both recoverability and availability.

17

4. WebSphere XD Compute Grid Use-cases

4.1 Use-case 1: Batch Modernization Pursuing batch modernization projects has been of particular interest of z/OS customers. The purpose of these projects is to migrate from a native z/OS batch runtime, typically developed in programming languages like C, C++, PL/I, and COBOL, to Java. These projects have 3 primary goals:

1. The new system should perform as well as the existing system 2. The new system must maintain the same qualities of service as the old system 3. The operational costs for the new system must be reduced.

Furthermore, these projects have 2 secondary goals of:

4. A more agile batch infrastructure, tolerant to future changes. 5. Common development, testing, deployment, security, archiving, and

production management processes and tooling across OLTP and Batch. Modernization projects can be seen as risky because of an all-or-nothing approach. Typically the new java environment would be built in parallel to the existing infrastructure; the original system would be maintained while the new system would be developed. Executives see these projects as risky primarily because of the long period of time between the investments made into the project and the returns of cost savings. With WebSphere XD Compute Grid however the risks are alleviated. Compute Grid enables the incremental migration of traditional batch applications to Java on z/OS. Both application types can coexist within the same execution container, sharing global resources such as transactions, connections to backend resources, and so on. The advantages with the incremental migration are the quick and quantifiable return on investment; therefore, the all-or-nothing approach is no longer the sole option for modernization. As each native batch module is migrated to Java, the off-loading of processing cycles to zAAP processors increases. Customers pursuing batch modernization projects based on Compute Grid have taken a similar approach. For the first phase, the new Compute Grid infrastructure is built and all new batch applications are deployed there. For the second phase, the following steps are executed:

1. Identify all of the modules to be migrated. 2. Determine the dependencies among the jobs as well as the resources

consumed by each. 3. Migrate jobs with few dependencies but significant resource consumption

to java.

18

For the third and final phase, incrementally migrate the remaining modules to Java. In this phase, both native and java modules will be co-located in same virtual machine; resources such as transactions, connections, etc will be shared among them. The following figure depicts the overall strategy:

4.2 Use-case 2: Highly Parallel Batch Jobs The dispatcher-worker architecture has been well established within the realm of high-performance computing. This architecture enables a highly-scalable execution environment, one onto which parallel processing of single jobs is possible. WebSphere XD Compute Grid builds on the dispatcher-worker architecture, but takes the pattern a step further: the dispatcher, and more importantly the workers, is multi-threaded. In Compute Grid, a single batch job is dispatched to a single worker and, for the life of the job, executes on a single managed thread. Managed threads are those that are visible to the underlying workload managers and have associated with them a transaction context, security context, and other such execution metadata. The Compute Grid dispatcher, the Long-Running Scheduler, manages the batch jobs dispatched across the many workers, the Long-Running Execution Environments, in the grid of resources. The Compute Grid architecture, the behavior of the dispatcher, and the qualities of service provided by the workers are conducive to highly-parallel batch jobs. A highly-parallel batch job is defined as a single, large batch job that can be broken into discrete

19

chunks; each chunk can be executed concurrently across a grid of resources. With Compute Grid, a single job to process 100,000,000 bank account records can be broken down into jobs that process a separate partition of the original data. For example, that job of 100,000,000 records can be broken into 100 jobs of 1,000,000 records. The jobs, each executing on their own managed thread, would be dispatched across the workers in the grid for execution. The compute grid infrastructure will execute each chunk concurrently with no sacrifice of qualities of service such as checkpointing, security, job restart, transactional integrity, and so on. There are two techniques for creating work partitions: the first is to create a proxy service, custom business logic with knowledge for how to break the job into smaller pieces; the second is to leverage the Parallel Job Manager component of Compute Grid, which provides callbacks and rules for creating job partitions. The following diagram illustrates the Parallel Job Manager and its role within the Compute Grid infrastructure.

Regardless of how the partitions are created, the many partitions must be managed and monitored as a single cohesive logical job. Compute Grid’s Parallel Job Manager delivers a set of features for managing these parallel batch jobs, and provides management features such as:

- stop/cancel the logical job, which will stop/cancel all of the underlying job partitions; for example:

20

- start/restart the logical job, which will determine which job partitions must be start/restarted;

- coordinate transactions among the job partitions to assure data integrity;

- monitor the logical job, which represents the status of all of the job partitions; for example:

21

- manage and aggregate the logs of the parallel jobs; for example:

The parallel execution of batch jobs can dramatically improve the overall performance and elapsed execution time of a batch job. The execution time for processing 100,000,000 records is a function of the total records to be processed. The execution time for processing the job in parallel however is a function of the size of the largest job partition. For example, breaking that single large job into 100 partitions of 1,000,000 records yields an execution time that is a function of 1,000,000 records. The management of the overall parallel execution, especially at the enterprise level with 1000’s of parallel jobs, is the real technical challenge. Compute Grid delivers both the parallel execution and management infrastructure.

4.3 Use-case 3: Dynamic Online and Batch Infrastructure Virtualization technologies, such as those available on System Z, System P, and other platforms, have long been sought after by IT Operations centers in an effort to reduce operations costs. With virtualization, the hardware and software footprints of an infrastructure can be reduced; hardware and software can be consolidated and shared among a common set of hardware resources. On decentralized, distributed platforms such as UNIX, Linux, Windows, and so on, virtualization becomes especially important because it alleviates the problem of traditionally low hardware utilization levels. As the execution of workloads shift from statically provisioned hardware and software infrastructures to virtualized runtimes, new technical challenges must be addressed. For example, when sharing hardware among a set of virtualized application servers hosting different applications, the infrastructure must ensure that the applications of higher

22

priority are given more resources when the system is under load. On System Z, z/OS workload management technologies address this challenge by way of service policies, relative application priorities, and goals-oriented execution. WebSphere XD Operations Optimization brings conceptually similar technologies to distributed platforms. There are numerous advantages and challenges to adopting a goals-oriented runtime such as z/OS, XD Operations Optimization, and others; the descriptions of which fall beyond the scope of this paper. In terms of batch processing however, several issues become important. For example, virtualized runtimes create a highly scalable infrastructure; one that can react to increases in load, whereas with statically provisioned runtimes the infrastructure is rigid and predefined. This ability to dynamically increase and decrease the size of the runtime infrastructure is especially valuable within the batch domain because of the existence of non-uniform demand for batch. Batch jobs traditionally run within some predefined window of time, after business hours for example. With a dynamic runtime, the batch window can trigger additional hardware and software resources to be provisioned to address the increase in demand for resources. Upon completion of the batch window, those provisioned resources can be released back to the pool of available resources for use by other workloads and sub-systems. In addition to dynamically provisioning resources, challenges such as: ensuring higher priority workloads are given more resources; resources consumed by various workloads are accounted for billing purposes; and the workloads are executing within some expected execution targets must be addressed. Conceptually this dynamic behavior and the solutions to these challenges are similar across platforms – distributed and z/OS. Technically however the mechanisms for achieving this behavior are very different. On z/OS, XD Compute Grid integrates with WebSphere for z/OS and the underlying z/OS operating system and leverages its inherent dynamic runtime. WebSphere for z/OS, for example, executes with a multi-process server architecture where workload management can dynamically increase and decrease the capacity of the server given some execution criteria. The following set of diagrams illustrates how this mechanism would behave.

23

In this first diagram, OLTP workloads are classified as gold priority. The underlying hardware virtualization mechanisms of System Z and z/OS therefore provision CPU and other resources to ensure that these work items are meeting their specified workload goals. As the runtime enters the batch window, the infrastructure can dynamically expand to process the additional workload.

24

In this diagram, the enterprise scheduler – Tivoli Workload Scheduler for example – will submit the batch jobs into the XD Compute Grid infrastructure. The Long-Running Scheduler, queues batch requests as the batch execution runtimes initialize.

In this diagram, WebSphere for z/OS, working with z/OS workload management, dynamically starts additional Servant processes to execute the workloads of different priorities – gold, silver, and bronze. System Z, working with z/OS workload management, provisions the underlying hardware, CPU for example, to ensure that the workloads of varying priority execute per their stated performance targets. As the batch window completes, the unnecessary WebSphere z/OS Servant processes and the underlying hardware resources are de-allocated to ensure efficient resource usage. With WebSphere XD Operations Optimization (XDOO), this behavior is emulated in distributed runtime infrastructures. The On-Demand Router, the principle component of XDOO, delivers a dynamically scalable, goals-oriented runtime to distributed platforms. With this technology, the utilization of the underlying server resources is monitored along with the overall execution of the prioritized workloads. Similar (conceptually) to WebSphere z/OS, the On-Demand Router will dynamically start additional instances of the application within the grid of hardware resources in an attempt to meet the stated execution targets of the workload. If the On-Demand Router detects that the runtime is over-provisioned – more application instances are started across the grid than needed – the unnecessary resources are de-allocated and made available for use by other workloads.

4.4 Use-case 4: Batch as a Service

25

WebSphere XD delivers features for resource usage accounting and reporting, on z/OS, these features integrate with the features already available on the platform (workload management, RMF, SMF, and so on). Usage accounting allows for detailed reports, for example the number of CPU seconds used by a given user or job submitter, to be compiled and used to charge the responsible consumer of those resources. The following is a customer example for chargeback: A datacenter hosts a batch service for producing customer bills. This service is given a list of accounts, the type of statement that must be produced and mailed. The consumers of this service are lines of business including the mortgagers, credit card issuers, and so on. Each consumer of the service must be charged accurately, ideally by some detailed and quantifiable mechanism such as number of CPU seconds consumed. With WebSphere XD Compute Grid, usage reporting classes can be associated with batch jobs and the detailed resource usage can be collected. Data centers can use this feature to deliver batch as a service, a potentially new revenue model whereby they are able to charge consumers of the service for exactly the amount of resources consumed. The following diagram depicts a simple scenario with batch as a service:

4.5 Use-case 5: Replacing existing java batch solutions IT Operations, prior to the introduction of enterprise batch runtimes like WebSphere XD Compute Grid, took it upon themselves to build their own homegrown batch solutions. The investment of such a homegrown system was justified with the potential long-term operational and maintenance cost savings. Often times homegrown solutions solve a narrow problem, and more important, can present a significant maintenance cost over time. Compute Grid can be used as the future enterprise grid and batch computing

26

environment. In order to adopt the technology as the long-term strategic batch platform, the existing systems must be replaced through some well-structured phase-out project. The following scenario provides an example of a pre-Compute-Grid and post-Compute-Grid infrastructure.

Example: Replace homegrown queue-based batch infrastructure with XD Compute Grid. In this example, the existing batch infrastructure is a queue-based system where batch records are delivered via some message queue. Listening on this queue is some message-driven enterprise bean (MDB); this MDB is notified when a new record must be processed. As records arrive in the queue, the MDB retrieves the record, processes it, and delivers the output to some results queue for post-processing. All of this work is done under the scope of at least one transaction, but often times many transactions are involved in the processing of a single record which will have performance implications. The following diagram illustrates this type of architecture:

With enterprise-quality batch systems, some important questions must be answered when adopting a queue-based batch infrastructure.

1. How will the lifecycle of the batch application be managed? Will a homegrown application management system be built as well?

2. How will the operational staff monitor the progress and health of the batch system? Will a homegrown monitoring infrastructure be built?

27

3. How will the batch workloads recover if there are failures? Will a homegrown checkpoint and recovery system be implemented?

4. When running batch and OLTP workloads concurrently, how will the operational staff ensure that one workload is not adversely impacted by the execution of the other?

The technical challenges associated with solving these 4 challenges listed are not insurmountable, they can be solved. The issues from an operational point of view however is how much the solution will cost to develop and manage, and whether it is cheaper to purchase a solution rather than build one. Compute Grid can be applied to such an infrastructure and address the aforementioned operational and management challenges. Furthermore, the agility of Compute Grid enables many of the existing assets to be reused. For example, the following architecture illustrates how Compute Grid could be applied to the queue-based batch system. There are several ways to apply Compute Grid to this problem; this is just one approach where the goal is the reuse of existing assets while solving the operational and management challenges posed.

Applying XD Compute Grid to the solution addresses the operational and management issues associated with the queue-based batch solution. The Long-Running Scheduler (LRS) component of Compute Grid provides operational

28

control over the job lifecycle (start, stop, cancel, etc). The LRS also provides monitoring and health features for the operational staff as well as workload management features to help ensure that higher priority workloads such as OLTP are not adversely impacted during batch execution. The Long-Running Execution Environment (LREE), the worker component of Compute Grid, applies qualities of service such as workload checkpoints, job restart, and recoverability to the batch applications. The operational and management concerns stated for this example are not unique to queue-based batch runtimes. They are relevant for any custom batch infrastructure, whether it be a homegrown batch container, queue-based system, stored procedures within the database, and so on. XD Compute Grid provides a flexible programming model and infrastructure that can be adapted to the existing infrastructures and bring to the solution more robust operational and management controls.

4.6 Use-case 6: Sharing reusable business services across batch and OLTP Business services, defined as discrete tasks responsible for executing some business function, can be shared across both the batch and OLTP runtimes. Batch execution environments are typically independent of the OLTP runtime, especially within the context of J2EE middleware like WebSphere; WebSphere XD Compute Grid delivers a batch execution runtime built on J2EE application servers, and this technology therefore brings together two traditionally disparate worlds, the effect of which is the ability to share business services across them. The following diagram depicts this.The reuse of these business services aligns well with the overall strategies dictated by SOA.

29

Several questions come about when investigating the reuse of services across these two domains. The two most prominent: first, how can one pursue this strategy without sacrificing performance; second, what percentage of services can actually be reused given the inherently different workloads that would be executed.

For the first question, the performance of batch processing is arguably one of the most critical requirements, therefore when describing “services” in terms of batch, local services and most likely plain-old-java-objects (POJO) based services, should be implied. The overhead of invoking remote services deters the use of that technology given the performance and time constraints of batch. When describing POJO-based services, technologists will be quick to point out the loss of authorization checks, transaction demarcation, and other such qualities of service provided by J2EE, web services, and so on. In terms of batch processing, authorization checks are typically not required for each record to be processed; rather, an initial authorization check is executed and encompasses the life of the job. Therefore, at the outermost layer, the layer managed by the batch container, authorization checks are executed. WebSphere XD Compute does provide the choice of using the security identity of the application server or of the batch-job submitter when processing each record, in case those types of business requirements exist. Within the context of WebSphere XD Compute Grid, transactions should not be managed directly by the business logic processing a specific record; instead the XD Compute Grid leverages transactions via the batch container to deliver checkpointing.

As mentioned, the requirement of authorization checks, transaction demarcation,

and other typical J2EE qualities of service will differ across the two execution paradigms; in order to reuse services however, the requirements of both domains must be satisfied. The following diagram depicts an application architecture that enables a POJO-service to be reused across OLTP and batch.

30

In that figure, POJO-services are private and therefore cannot be invoked directly

by application code that is not part of the “kernel”. Traditional OLTP clients would only access that POJO-service via an EJB wrapper; this wrapper is used to achieve the expected OLTP qualities of service. From the batch side, the services are invoked by batch applications whose qualities of service are enforced by the XD batch container.

Another key performance issue that arises from blending the batch and OLTP

domains is the design of the data access layer. Methods for accessing data within OLTP will be very different than how it would be done in batch. For instance, the access of data in OLTP tends to be random; therefore pre-fetching records to be processed would not make much sense. However in batch processing, data is typically accessed sequentially; an application will execute some business logic on all records in a database table or file for example, therefore to pre-fetch could improve overall performance. Another common technique employed by OLTP applications is the retrieval and construction of an entire object from a database row- all columns from that row are selected and used to construct an account record object for instance. Since performance is so critical in batch processing, selecting only the columns of interest rather than all columns can significantly improve performance. Context-based data access layers can address the issue of ensuring performance across multiple domains. This type of data access layer is based on the factory-impl pattern and upon initialization will load persistence classes relevant to the context specified (batch or OLTP). This technique can solve this problem when batch and OLTP are executing within their own virtual machines (but the business logic is shared

31

via common libraries). A slightly modified version of this approach, namely a factory-factory-impl pattern, can solve the problem when both batch and OLTP are run within the same virtual-machine.

For the second question, it is difficult to generally state the percentage of services that could be reusable across both the OLTP and batch paradigms; it mostly depends on the overall application architecture which is customer and use-case specific. Generally however, cross-cutting functions such as logging, auditing, the collection of usage or performance metrics, and so on are easily shareable. Services that don’t care about the source of the record to be processed tend to be the easiest to share across the two domains. For example, at one particular customer the function of calculating interest was reusable. The service was well-defined – the service did not care how a record was obtained; its method signature required that a record be passed to the function, therefore the source of the record, whether it be a batch application, an enterprise java bean (EJB), a web service, or whatever was irrelevant. Another customer was able to reuse a complex credit score service. The service exhibited similar characteristics as the interest calculator- the service did not retrieve the record to be processed but rather expected the record to be passed to it.

The benefits of sharing business logic across both domains lies in a few areas:

first, the reduction of development costs since the business logic must only be written once; second, the reduction of maintenance costs since only one version of the service must be fixed and furthermore, should stabilize faster given multiple users and greater access patterns.

32

websphere compute grid technical introduction

Documents