high availability

14
High Availability Challenge An increasing number of customers find their Data Integration implementation mus 24x7 without interruption or failure. This Best ractice describes the !igh Avai capabilities incorporated in ower%enter and explains wh" it is critical to addr architectural #i.e.& s"stems& hardware& firmware$ and procedural #i.e.& applicat implementation& session'wor(flow features$ recover" with !A. Description )hen considering !A recover"& be sure to explore the following two components of exist on all enterprise s"stems* External Resilience +xternal resilience has to do with the integration and specification of domain n database servers& ,T servers& networ( access servers in a defined& tested 24x7 configuration. The nature of Informatica-s DI setup places it at man" interface integration. Before placing and configuring ower%enter within an infrastructure expectation& the following uestions should be answered* Is the pre/existing set of servers alread" in a sustained !A configuration0 schematic with applicable settings to use for reference0 If so& is there a test to exercise before installing ower%enter products0 It is important to the external s"stems must be !A before the ower%enter architecture the" su be. )hat are the bottlenec(s or perceived failure points of the existing s"stem bottlenec(s li(el" to be exposed or heightened b" placing ower%enter in t infrastructure0 #e.g.& five times the amount of 1racle traffic& ten times t traffic& a 3I server that alwa"s shows 56 idle ma" now have twice as ma processes running$. ,inall"& if a proprietar" solution #such as IB8 !A%8 or 9eritas :torage ,o )indows$ has been implemented with success at a customer site& this sets a expectation. The customer ma" merel" want the grid capabilit" of multiple nodes to spla"'recover Informatica tas(s& and expect their bac(/end s"stem listed above$ to provide file s"stem or server bootstrap recover" upon a fu failure of those bac(/end s"stems. If these bac(/end s"stems have a script capabilit" to& for example& restart a repositor" service& ower%enter can b this fashion. !owever& ower%enter;s !A capabilit" extends as far as the o components.

Upload: srinimudaragadda

Post on 07-Oct-2015

9 views

Category:

Documents


0 download

TRANSCRIPT

High Availability

High Availability

Challenge

An increasing number of customers find their Data Integration implementation must be available 24x7 without interruption or failure.This Best Practice describes the High Availability (HA) capabilities incorporated in PowerCenter and explains why it iscritical to address both architectural (i.e., systems, hardware, firmware) and procedural (i.e., application design, code implementation, session/workflow features) recovery with HA.

Description

When considering HA recovery,be sure to explore the following two components of HA that exist on all enterprise systems:

External Resilience

External resiliencehas to do with the integration and specification of domain name servers, database servers, FTP servers, network access servers in a defined, tested 24x7 configuration.The nature of Informaticas DI setup places it at many interface points in system integration.Before placing and configuring PowerCenter within an infrastructure that has an HA expectation, the following questions should be answered:

Is the pre-existing set of servers already in a sustained HA configuration?Is there a schematic with applicable settings to use for reference?If so, is there a unit test or system test to exercise before installing PowerCenter products?It is important to remember that the external systems must be HA before the PowerCenter architecture they support can be.

What are the bottlenecks or perceived failure points of the existing system?Are these bottlenecks likely to beexposed or heightened by placing PowerCenter in the infrastructure? (e.g.,five times the amount of Oracle traffic, ten times the amount of DB2 traffic, a UNIX server that always shows 10% idle may now have twice as many processes running).

Finally, if a proprietary solution (such as IBM HACMP or Veritas Storage Foundation for Windows) has been implemented with success at a customer site, this sets a different expectation.The customer may merely want the grid capability of multiple PowerCenter nodes to splay/recover Informatica tasks, and expect their back-end system (such as those listed above) to provide file system or server bootstrap recovery upon a fundamental failure of those back-end systems. If these back-end systems have a script/command capability to, for example, restart a repository service, PowerCenter can be installed in this fashion.However,PowerCenter's HA capability extends as far as the PowerCenter components.

Internal Resilience

In an HA PowerCenter environment key elements to keep in mind are:

Rapid and constant connectivity to the repository metadata.

Rapid and constant network connectivity between all gateway and worker nodes in the PowerCenter domain.

A common highly-available storage system accessible to all PowerCenter domain nodes with one service name and one file protocol.Only domain nodes on the same operating system can share gateway and log files (see Admin Console->Domain->Properties->Log and Gateway Configuration).

Internal resilience occurs within the PowerCenter environment among PowerCenter services, the PowerCenter Client tools, and other client applications such as pmrep and pmcmd. Internal resilience can be configured at the following levels:

Domain. Configure service connection resilience at the domain level in the general properties for the domain. The domain resilience timeout determines how long services attempt to connect as clients to application services or the Service Manager. The domain resilience properties are the default values for all services in the domain.

Service. It is possible to configure service connection resilience in the advanced properties for an application service. When configuring connection resilience for an application service, this overrides the resilience values from the domain settings.

Gateway. The master gateway node maintains a connection to the domain configuration database. If the domain configuration database becomes unavailable, the master gateway node tries to reconnect. The resilience timeout period depends on user activity and whether the domain has one or multiple gateway nodes:

Single gateway node. If the domain has one gateway node, the gateway node tries to reconnect until a user or service tries to perform a domain operation. When a user tries to perform a domain operation, the master gateway node shuts down.

Multiple gateway nodes. If the domain has multiple gateway nodes and the master gateway node cannot reconnect, then the master gateway node shuts down. If a user tries to perform a domain operation while the master gateway node is trying to connect, the master gateway node shuts down. If another gateway node is available, the domain elects a new master gateway node. The domain tries to connect to the domain configuration database with each gateway node. If none of the gateway nodes can connect, the domain shuts down and all domain operations fail.

Process

Be aware that your implementation has a dependency on the installation environment.For example,you may want to combine multiple disparate ETL repositories onto a single upgraded PowerCenter platform.This has the benefit of:

Single point of access/administration from the Admin Console.

A group of repositories that now can become a repository domain.

A group of repositories that can be shaped into common processing/backup/schedule patterns for optimal performance and administration.

HA items of concern are now:

Single point of failure of onePowerCenter domain.

One repository, possibly heavy in processing or poorly designed, degrading that entirePowerCenter domain.

Common Elements of Concern in anHA Configuration

Restart and Failover

Restart and Failover has to do with the Domain Services (Integration and Repository).Obviously if these services are not highly available, the scheduling, dependencies(e.g., touch files, ftp, etc) and artifacts of your ETL cannot be highly available.

If a service process becomes unavailable, the Service Manager can restart the process or fail it over to a backup node based on the availability of the node. When a service process restarts or fails over, the service restores the state of operation and begins recovery from the point of interruption.

You can configure backup nodes for services if you have the high availability option. If you configure an application service to run on primary and backup nodes, one service process can run at a time. The following situations describe restart and failover for an application service:

If the primary node running the service process becomes unavailable, the service fails over to a backup node. The primary node may be unavailable if it shuts down or if the connection to the node becomes unavailable.

If the primary node running the service process is available, the domain tries to restart the process based on the restart options configured in the domain properties. If the process does not restart, the Service Manager can mark the process as failed. The service then fails over to a backup node and starts another process. If the Service Manager marks the process as failed, the administrator must enable the process after addressing any configuration problem.

If a service process fails over to a backup node, it does not fail back to the primary node when the node becomes available. You can disable the service process on the backup node to cause it to fail back to the primary node.

Recovery

Recovery is the completion of operations after an interrupted service is restored. When a service recovers, it restores the state of operation and continues processing the job from the point of interruption.

The state of operation for a service contains information about the service process. The PowerCenter services include the following states of operation:

Service Manager. The Service Manager for each node in the domainmaintains the state of service processes running on that node. If the master gateway shuts down, the newly elected master gateway collects the state information from each node to restore the state of the domain.

Repository Service. The Repository Service maintains the state of operation in the repository. This includes information about repository locks, requests in progress, and connected clients.

Integration Service. The Integration Service maintains the state of operation in the shared storage configured for the service. This includes information about scheduled, running, and completed tasks for the service. The Integration Service maintains session and workflow state of operation based on the recovery strategy you configure for the session and workflow.

When designing a system thathas HA recovery as a core component, be sure toinclude architectural and procedural recovery.

Architectural recovery for aPowerCenter domain involves the three bulleted items above restarting in a complete, sustainable and traceable manner. If the Service Manager and Repository Service recover, but the Integration service cannot recover the restart is notsuccessfuland has little value to a production environment.

Field experience with PowerCenter has yielded these key items in planning a proper recovery upon a systemic failure:

APowerCenter domain cannot be established without at least one gateway node running.Even if you have established a domain with ten worker nodes and one gateway node, none of the worker nodes can run ETL jobs without a gateway node managing the domain.

An Integration Service cannot run without its associated Repository Service being started and connected to its metadata repository.

A Repository Service cannot run without its metadata repository DBMS being started and accepting database connections.Often database connections are established on periodic windows that expire whichputs the repository offline.

If the installed domain configuration is running from Authentication Module Configuration and the LDAP Principal User account becomes corrupt or inactive, allPowerCenter repository access is lost. If the installation uses any additional authentication outsidePowerCenter (such as LDAP), an additional recovery and restart plan is required.

Procedural recovery is supported with many features of PowerCenter 8. Consider the following very simple mapping that might run in production for many ETL applications:

Suppose there is a situation where the ftp server sending this ff_customer file is inconsistent. Many times the file is not there, butthe processes depending on this must always run.The processis always insert only. You do not want the succession of ETL that follows this small process to fail - they can runto customer_stg with current records only. This setting in the Workflow Manager, Session, Properties would fit your need:

Since it is not critical the ff_customer records run each time, record the failure but continue the process.

Now say the situation has changed.Sessions are failing on aPowerCenter server due to target database timeouts.A requirement is given that the session must recover from this:

Resuming from last checkpointrestarts the process from its prior commit, allowing no loss of ETL work.

To finish this second case, consider three basic items on the workflow sidewhen HA isincorporated in your environment:

An Integration Service in an HA environment can onlyrecover those workflows marked with Enable HA recovery.For all critical workflows, this should be considered.

For a mature set of ETL code running in QA or Production, you may consider the following workflow property:

This would automatically recover tasks from where they failed in a workflow upon an application or system wide failure.Consider carefully the use of this feature, however.Remember, automated restart of critical ETL processes without interaction can have vast unintended side effects.For instance, if a database alias or synonym was dropped, all ETL targets may now refer to different objects than the original intent.OnlyPowerCenter environments with HA, mature production support practices, and a complete operations manual per Velocity, should expect complete recovery with this feature.

In an HA environment, certain components of the Domain can go offline while the Domain stays upto execute ETL jobs.This is a time to use the Suspend On Error feature from the General tab of Workflow settings.The backup Integration Server would then pickup this workflow and resume processing based on the resume settings of this workflow:

Features

A variety of HA features exist inPowerCenter 8. Specifically,they include:

Integration Service HA option

Integration Service Grid option

Repository Service HA option

First, proceed from an assumption that nodes have been provided to you such that a basic HA configuration of PowerCenter 8 can take place.A lab-tested version completed by Informatica is configured as below with an HP solution.Your solution can be completed with any reliable clustered file system.Your first step would always be implementing and thoroughly exercising a clustered file system:

Now, lets address the options in order:

Integration Service HA Option

Youmust have the HA option on the license key for this to be available on install.Note that once the base PowerCenter 8 install is configured, all nodes areavailable from the Admin Console->Domain->Integration Services->Grid/Node Assignments.From the above example, you would see Node 1, Node 2, Node 3 as dropdown options on that browse page.With the HA (Primary/Backup) install complete, Integration Services are then displayed with both P and B in a configuration, with the current operating node highlighted:

If a failure were to occur on this HA configuration, the Integration Service INT_SVCS_DEV would poll the Domain: Domain_Corp_RD for another Gateway Node, then assign INT_SVCS_DEV over to that Node, in this case Node_Corp_RD02.Then the B button would highlight showing this Node as providing INT_SVCS_DEV.

A vital component of configuring the Integration Service for HA is making sure the Integration Service files are stored in a shared persistent environment.You must specify the paths for Integration Service files for each Integration Service process. Examples of Integration Service files include run-time files, state of operation files, and session log files.

Each Integration Service process uses run-time files to process workflows and sessions. If you configure an Integration Service to run on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the run-time files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files.

State of operation files must be accessible by all Integration Service processes. When you enable an Integration Service, it creates files to store the state of operations for the service. The state of operations includes information such as the active service requests, scheduled tasks, and completed and running processes. If the service fails, the Integration Service can restore the state and recover operations from the point of interruption.

All Integration Service processes associated with an Integration Service must use the same shared location. However, each Integration Service can use a separate location.

By default, the installation program creates a set of Integration Service directories in the server\infa_shared directory. You can set the shared location for these directories by configuring the process variable $PMRootDir to point to the same location for each Integration Service process.The key HA concern of this is $PMRootDir should be on the highly-available clustered file system mentioned above.

Integration Service Grid Option

Youmust have the Server Grid option on the license key for this to be available on install.In configuring the $PMRootDir files for the Integration Service, retain the methodology describedabove.Also, in Admin Console->Domain->Properties->Log and Gateway Configuration, the log and directory paths should be on the clustered file system mentioned above.A grid must be created before it can be used in a Power Center 8 domain.It is essential to remember that a grid can only be created from machines running the same operating system.

Be sure to remember thesekey points:

PowerCenter supports nodes from heterogeneous operating systems, bit modes, and others to be used within same domain. However, if there are heterogeneous nodes for a grid, then you can only run a workflow on the Grid,not a session.

A session on grid does not support heterogeneous operating systems. This is because a session may have a sharing cache file and other objects that may not be compatible with all ofthe operating systems. For session on a grid, you need a homogeneous grid.

In short, scenarios such as a production failure are the worst possible time to find out that a multi-OS grid does not meet your needs.

If you have a large volume of disparate hardware, it is certainly possible to make perhaps two grids centered on two different operating systems.In either case, the performance of your clustered file system is going to affect the performance of your server grid, and should be considered as part of your performance/maintenance strategy.

Repository Service HA Option

Youmust have the HA option on the license key for this to be available on install.There are two ways to include the Repository Service HA capability when configuring PowerCenter 8:

The first is during install.When the Install Program prompts for your nodes to do a Repository install (after answering Yes to Create Repository), you can enter a second node where the Install Program can create and invoke the PowerCenter service and Repository Service for a backup repository node.Keep in mind that all of the database, OS, and server preparation steps referred to in the PowerCenter Installation and Configuration Guide still hold true for this backup node.When the install is complete, the Repository Servicedisplays a P/B link similar to that illustratedabove for the INT_SVCS_DEV example Integration Service.

A second method for configuring Repository Service HA allows for measured, incremental implementation ofHA from a tested base configuration.After ensuring that your initial Repository Service settings (e.g., resilience timeout, codepage, connection timeout) and the DBMS repository containing the metadata are running and stable, you can add a second node and make it the Repository Backup.Install the PowerCenterService on this second server following the PowerCenter Installation and Configuration Guide.In particular, skip creating Repository Content or an Integration Service on the node.Following this, Go to Admin Console->Domain and select:

Create->Node.The server to contain this node should be of the exact same configuration/clustered file system/OS as the Primary Repository Service.

The following dialog should appear:

Assign a logicalname to the nodeto describe its place, andselect Create.The node should now be running as part of your domain, but if it isn't, refer to the PowerCenter Command Line Reference with the infaservice and infacmd commands to ensure the node is running on the domain.When it is running,go to Domain->Repository->Properties->Node Assignments->Edit and the browser window displays:

Click OK and the Repository Service is now configured in a Primary/Backup setup for the domain.To ensurethe P/B setting, test the following elements of the configuration:

1. Be certain the same version of the DBMS client is installed on the server and can access the metadata.

2. Both nodes must be on the same clustered file system.

3. Log onto the OS for the Backup Repository Service and ping the Domain Master Gateway Node.Be sure a reasonable response time is being given at an OS level (i.e., less than 5 seconds).

4. Take the Primary Repository Service Node offline and validate that the polling, failover, restart process takes place in a methodical, traceable manner for the Repository Service on the Domain.Thisshouldbe clearly visible from the node logs on the Primary and Secondary Repository Service boxes [$INFA_HOME/server/tomcat/logs] or from Admin Console->Repository->Logs.

Note: Remember that when a node is taken offline, you cannot access Admin Console from it.