rac trouble

Log Files for Troubleshooting Oracle RAC issues

The cluster has a number of log files that can be examined to gain any insight of occurring problemsA good place to start diagnosis for the cluster problems is from the $ORA_CRS_HOME/log/<hostname>/alert<hostname>.log

All clusterware log files are stored under $ORA_CRS_HOME/log/ directory.

1. alert<nodename>.log : Important clusterware alerts are stored in this log file. It is stored in $ORA_CRS_HOME/log/<hostname>/alert<hostname>.log

2. crsd.log : CRS logs are stored in $ORA_CRS_HOME/log/<hostname>/crsd/ directory. The crsd.log file is archived every 10MB as crsd.101, crsd.102 ...

3. cssd.log : CSS logs are stored in $ORA_CRS_HOME/log/<hostname>/cssd/ directory. The cssd.log file is archived every 20MB as cssd.101, cssd.102....

4. evmd.log : EVM logs are stored in $ORA_CRS_HOME/log/<hostname>/evmd/ directory.

5. OCR logs : OCR logs (ocrdump, ocrconfig, ocrcheck) log files are stored in $ORA_CRS_HOME/log/<hostname>/client/ directory.

6. SRVCTL logs: srvctl logs are stored in two locations, $ORA_CRS_HOME/log/<hostname>/client/ and in $ORACLE_HOME/log/<hostname>/client/ directories.

7. RACG logs : The high availability trace files are stored in two locations$ORA_CRS_HOME/log/<hostname>/racg/ and in $ORACLE_HOME/log/<hostname>/racg/ directories.

RACG contains log files for node applications such as VIP, ONS etc.ONS log filename = ora.<hostname>.ons.logVIP log filename = ora.<hostname>.vip.log

Each RACG executable has a sub directory assigned exclusively for that executable.racgeut : $ORA_CRS_HOME/log/<hostname>/racg/racgeut/racgevtf : $ORA_CRS_HOME/log/<hostname>/racg/racgevtf/racgmain : $ORA_CRS_HOME/log/<hostname>/racg/racgmain/

racgeut : $ORACLE_HOME/log/<hostname>/racg/racgeut/racgmain: $ORACLE_HOME/log/<hostname>/racg/racgmain/racgmdb : $ORACLE_HOME/log/<hostname>/racg/racgmdb/racgimon: $ORACLE_HOME/log/<hostname>/racg/racgimon/

As in a normal Oracle single instance environment, a RAC environment contains the standard RDBMS log files:

These files are located by the parameters :

background_dest_dump contan the alert log and backgrond process trace files.user_dump_dest contains any trace file generated by a user process.core_dump_dest contains core files that are generated due to a core dump in a user process.

RAC - Issues & Troubleshooting

http://oracledbascriptsfromajith.blogspot.in/2011/09/rac-issues-troubleshooting.html

Whenever a node is having issues joining the cluster back post reboot, here is a quick check list I would suggest:

/var/log/messages

ifconfig

ip route

/etc/hosts

/etc/sysconfig/network-scripts/ifcfg-eth*

ethtool

mii-tool

cluvfy

$ORA_CRS_HOME/log

Let us now take a closer look at specifc issues with examples and steps taken for their resolution.

These are all tested on Oracle 10.2.0.4 database on RHEL4 U8 x-64

1. srvctl not able to start Oracle Instance but sqlplus able to start

a. Check racg log for actual error message.

% more $ORACLE_HOME/log/`hostname -s`/racg/ora.{DBNAME}.{INSTANCENAME}.inst.log

b. Check if srvctl is configured to use correct parameter file(pfile/spfile)

% srvctl config database -d {DBNAME} -a

You can also validate parameter file by using sqlplus to see the exact error message.

c. Check ownership for $ORACLE_HOME/log

If this is owned by root, srvctl won't be able to start instance as oracle user.

# chown -R oracle:dba $ORACLE_HOME/log

2. VIP has failed over to another node but is not coming back to the original node

Fix: The node where the VIP has failed over, bring it down manually as root

Example: ifconfig eth0:2 down

PS: Be careful to bring down only VIP. A small typo may bring down your public interface:)

3. Moving OCR to a different location

PS: This can be done while CRS is up as root.

While trying to change ocr mirror or the ocr to a new location, ocrconfig complaints.

The fix is to touch the new file.

Example:

# ocrconfig -replace ocrmirror /crs_new/cludata/ocrfile

PROT-21: Invalid parameter

# touch /crs_new/cludata/ocrfile

# chown root:dba /crs_new/cludata/ocrfile

# ocrconfig -replace ocrmirror /crs_new/cludata/ocrfile

Verify:

a. Validate using "ocrcheck". Device/File Name should point to the new one with integrity check succeeded.

b. Ensure OCR inventory is updated correctly

# cat /etc/oracle/ocr.loc

ocrconfig_loc and ocrmirrorconfig_loc should point to correct locations.

4. Moving Voting Disk to a different location

PS: CRS must be down while moving the voting disk.

The idea is to add new voting disks and delete the older ones.

Find below sample errors and their fix.

# crsctl add css votedisk /crs_new/cludata/cssfile_new

Cluster is not in a ready state for online disk addition

We need to use force option. However, before using force option, ensure CRS is down.

If CRS is up, DO NOT use force option else it may corrupt your OCR.

# crsctl add css votedisk /crs_new/cludata/cssfile_new -force

Now formatting voting disk: /crs_new/cludata/cssfile_new

successful addition of votedisk /crs_new/cludata/cssfile_new.

Verify using "crsctl query css votedisk" and then delete the old votedisks.

While deleting too, you'll need to use force option.

Also verify the permissions of the voting disk files. It should be oracle:dba

If voting disks were added using root, the permission should be changed to oracle:dba

5. Manually registering listener resource to OCR

Listener was registered manually with OCR but srvctl was unable to bring up the listener

Let us first see example of how to manually do this.

From an existing available node, print the listener resource

% crs_stat -p ora.test-server2.LISTENER_TEST-SERVER2.lsnr > /tmp/res

% cat /tmp/res

NAME=ora.test-server2.LISTENER_TEST-SERVER2.lsnr

TYPE=application

ACTION_SCRIPT=/orahome/ora10g/product/10.2.0/db_1/bin/racgwrap

ACTIVE_PLACEMENT=0

AUTO_START=1

CHECK_INTERVAL=600

DESCRIPTION=CRS application for listener on node

FAILOVER_DELAY=0

FAILURE_INTERVAL=0

FAILURE_THRESHOLD=0

HOSTING_MEMBERS=test-server2

OPTIONAL_RESOURCES=

PLACEMENT=restricted

REQUIRED_RESOURCES=ora.test-server2.vip

RESTART_ATTEMPTS=5

SCRIPT_TIMEOUT=600

START_TIMEOUT=0

STOP_TIMEOUT=0

UPTIME_THRESHOLD=7d

USR_ORA_ALERT_NAME=

USR_ORA_CHECK_TIMEOUT=0

USR_ORA_CONNECT_STR=/ as sysdba

USR_ORA_DEBUG=0

USR_ORA_DISCONNECT=false

USR_ORA_FLAGS=

USR_ORA_IF=

USR_ORA_INST_NOT_SHUTDOWN=

USR_ORA_LANG=

USR_ORA_NETMASK=

USR_ORA_OPEN_MODE=

USR_ORA_OPI=false

USR_ORA_PFILE=

USR_ORA_PRECONNECT=none

USR_ORA_SRV=

USR_ORA_START_TIMEOUT=0

USR_ORA_STOP_MODE=immediate

USR_ORA_STOP_TIMEOUT=0

USR_ORA_VIP=

Modify relevant parameters in the resource file to point to correct instance.

Rename as resourcename.cap

% mv /tmp/res /tmp/ora.test-server1.LISTENER_TEST-SERVER1.lsnr.cap

Register with OCR

% crs_register ora.test-server1.LISTENER_TEST-SERVER1.lsnr -dir /tmp/

Start listener

% srvctl start listener -d testdb -n test-server1

While trying to start listener, srvctl is throwing errors like "Unable to read from listener log file"

The listener log file exists.

If resource is registered using root, then srvctl won't be able to start using oracle user.

So all the aforementioned operations while registering the listener manually should be done using oracle user.

6. Services

While checking status of a service, it says "not running"

If we try to start it using srvctl, the error message is "No such service exists" or "already running"

If we try to add service with same name, it says "already exists"

This happens because the service is in an "Unknown" state in the OCR

Using crs_stat, check if any related resource for service(resource names ending with .srv and .cs) is still lying around.

srvctl remove service -f has been tried and the issue persists.

Here is the fix:

# crs_stop -f {resourcename}

# crs_unregister {resourcename}

Now service can be added and started correctly.

7. Post host reboot, CRS is not starting

After host reboot, CRS was not coming up. No CRS logs in $ORA_CRS_HOME

Check /var/log/messages

"Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.9559"

No logs seen in /tmp/crsctl.*

Run cluvfy to identify the issue

$ORA_CRS_HOME/bin/cluvfy stage -post crsinst -n {nodename}

/tmp was not writable

/etc/fstab was incorrect and was fixed for making /tmp available

If you see messages like "Shutdown CacheLocal. my hash ids don't match" in the CRS log, then

check if /etc/oracle/ocr.loc is same across all nodes of the cluster.

8. CRS binary restored by copying from existing node in the cluster

CRS not starting with following messages in /var/log/messages;

"Id "h1" respawning too fast: disabled for 5 minutes"

CRSD log showing "no listener"

If CRS binary is restored by copying from existing node in the cluster, then you need to ensure:

a. Hostnames are modified correctly in $ORA_CRS_HOME/log

b. You may need to cleanup socket files from /var/tmp/.oracle

PS:Exercise caution while working with the socket files. If CRS is up, you should never touch those files otherwise reboot may be inevitable.

9. CRS rebooting frequently by oprocd

Check /etc/oracle/oprocd/ and grep for "Rebooting".

Check /var/log/messages and grep for "restart"

If the timestamps are matching, this confirms reboots are being initated by oprocd process.

%ps -efgrep oprocd

root 10409 9937 0 Feb27 ? 00:00:00 /oracle/product/crs/bin/oprocd.bin run -t 1000 -m 500 -f

-t 1000 means oprocd would wake up every 1000ms

-m 500 means allow upto 500ms margin of error

Basically with these options if oprocd wakes up after > 1.5 secs it’s going to force a reboot.

This is conceptually analogous to what hangcheck timer used to do pre 10.2.0.4 Oracle releases on Linux.

Fix is to set CSS diagwait to 13

#crsctl set css diagwait 13 -force

# /oracle/product/crs/bin/crsctl get css diagwait

13

This actually changes what parameters oprocd runs with

%ps -efgrep oprocd

root 10409 9937 0 Feb27 ? 00:00:00 /oracle/product/crs/bin/oprocd.bin run -t 1000 -m 10000 -hsi 5:10:50:75:90 -f

Note that the margin has now changed to 10000ms i.e 10 seconds in place of the default 0.5 seconds.

PS: Setting diagwait requires a full shutdown of Oracle Clusterware on ALL nodes.

10. Cluster hung. All SQL queries on GV$ views are hanging.

Alert log from all instance have message like below:

INST1: IPC Send timeout detected. Receiver ospid 1650

INST2:IPC Send timeout detected.Sender: ospid 24692

Receiver: inst 1 binc 150 ospid 1650

INST3: IPC Send timeout detected.Sender: ospid 12955

Receiver: inst 1 binc 150 ospid 1650

The ospid on all instances belong to LCK0 - Lock Process

In case of inter-instance lock issues, it's important to identify the instance from where it's initiating.

As seen from above, INST1 is the one that needs to be fixed.

Just identify the process that is causing row cache lock and kill it otherwise reboot node 1.

11. Inconsistent OCR with invalid permissions

% srvctl add db -d testdb -o /oracle/product/10.2

PRKR-1005 : adding of cluster database testdb configuration failed, PROC-5: User does not have permission to perform a cluster registry operation on this key. Authentication error [User does not have permission to perform this operation] [0]

crs_stat doesn't have any trace of it so utilities like crs_setperm/crs_unregister/crs_stop won't work in this case.

ocrdump shows:

[DATABASE.LOG.testdb]

UNDEF :

SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_ALL_ACCESS, OTHER_PERMISSION : PROCR_READ, USER_NAME : root, GROUP_NAME : root}

[DATABASE.LOG.testdb.INSTANCE]

UNDEF :

SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_ALL_ACCESS, OTHER_PERMISSION : PROCR_READ, USER_NAME : root, GROUP_NAME : root}

These logs are owned by root and that's the problem.

This means that the resource was perhaps added into OCR using root.

Though it has been removed by root but now it cannot be added by oracle user unless we get rid of the aforementioned.

Shutdown the entire cluster and either restore from previous good backup of OCR using:

ocrconfig -restore backupfilename

You can get list of backups using:

ocrconfig -showbackup

If you are not sure of last good backup, there you can also do the following:

Take export backup of OCR using:

ocrconfig -export /tmp/export -s online

Edit /tmp/export and remove those 2 lines pointing to DATABASE.LOG.testdb and DATABASE.LOG.testdb.INSTANCE owned by root

Import it back now

ocrconfig -import /tmp/export

After starting the cluster, verify using ocrdump.

The OCRDUMPFILE should not have any trace of those leftover log entries owned by root.

Troubleshooting Oracle Clusters and Oracle RAC

Oracle Clusterware has its moments. There are times when it does not want to start for various reasons. In this section we talk about diagnosing the health of the cluster, collecting diagnostic information, and trying to correct problems with Oracle Clusterware.

Checking the Health of the ClusterOnce you have configured a new cluster you will probably want to check the health of that cluster. Additionally you might want to check the health of the cluster after you have added or removed a node or if there is something about the cluster that is causing you to suspect that it’s suffering from some problem. Follow these steps to give your cluster a full health check:

1. Use the crsctl check has command to check if OHASD is running on the local node and that it’s healthy:

[oracle@rac1 admin]$crsctl check hasCRS-4638: Oracle High Availability Services is online

2. Use the crsctl check crs command to check OHASD, CRSD, ocssd and EVM daemons.

[oracle@rac1 admin]$crsctl check crsCRS-4638: Oracle High Availability Services is onlineCRS-4537: Cluster Ready Services is onlineCRS-4529: Cluster Synchronization Services is onlineCRS-4533: Event Manager is online

3. Use the crsctl check cluster –all command to check all daemons on all nodes of the cluster.

[oracle@rac1 admin]$crsctl check cluster –all********************************************************rac1:CRS-4537: Cluster Ready Services is onlineCRS-4529: Cluster Synchronization Services is onlineCRS-4533: Event Manager is onlinerac2:CRS-4537: Cluster Ready Services is onlineCRS-4529: Cluster Synchronization Services is onlineCRS-4533: Event Manager is online

4. Check the cluster logs for any error messages that might have been logged.

If these steps do not indicate a problem and you feel there is a problem with the cluster, you can do the following:

1. Stop the cluster (crsctl stop cluster)2. Start the cluster (crsctl start cluster), monitoring the startup message to see if any errors

occur.

3. If errors do occur or you still feel there is a problem, check the cluster logs for error messages.

Log Files, Collecting Diagnostic Information and Trouble ResolutionBecause Oracle Clusterware 11g Release 2 consists of a number of different components it follows that there are a number of different log files associated with these processes. In this section we will first document the log files associated with the various Clusterware processes. We will then discuss a method of collecting the data in these logs into a single source that you can reference when doing trouble diagnosis.Oracle Clusterware 11g Release 2 generates a number of different log files that can be used to troubleshoot Clusterware problems. Oracle Clusterware 11g adds a new environment variable called GRID_HOME to reference the base of the Oracle Clusterware software home. The Clusterware log files are typically stored under the GRID_HOME directory in a sub-directory called log. Under that directory is another directory with the host name and then a directory that indicates the Clusterware component that the specific logs are associated with. For example, GRID_HOME/log/myrac1/crsd stores the log files associated with CRSD for the host myrac1. The following table lists the log file directories and the contents of those directories:

Directory Path Contents

GRID_HOME/log/<host>/alert<host>.log Clusterware alert log

GRID_HOME/log/<host>/diskmon Disk Monitor Daemon

GRID_HOME/log/<host>/client OCRDUMP, OCRCHECK, OCRCONFIG, CRSCTL

GRID_HOME/log/<host>/ctssd Cluster Time Synchronization Service

GRID_HOME/log/<host>/gipcd Grid Interprocess Communication Daemon

GRID_HOME/log/<host>/ohasd Oracle High Availability Services Daemon

GRID_HOME/log/<host>/crsd Cluster Ready Services Daemon

GRID_HOME/log/<host>/gpnpd Grid Plug and Play Daemon

GRID_HOME/log/<host>/mdnsd Mulitcast Domain Name Service Daemon

GRID_HOME/log/<host>/evmd Event Manager Daemon

GRID_HOME/log/<host>/racg/racgmain RAC RACG

GRID_HOME/log/<host>/racg/racgeut RAC RACG

GRID_HOME/log/<host>/racg/racgevtf RAC RACG

GRID_HOME/log/<host>/racg RAC RACG (only used if pre-11.1 database is installed)

GRID_HOME/log/<host>/cssd Cluster Synchronization Service Daemon

GRID_HOME/log/<host>/srvm Server Manager

GRID_HOME/log/<host>/agent/ohasd/oraagent_oracle11

HA Service Daemon Agent

GRID_HOME/log/<host>/agent/ohasd/oracssdagent_root

HA Service Daemon CSS Agent

GRID_HOME/log/<host>/agent/ohasd/oracssdmonitor_root

HA Service Daemon ocssdMonitor Agent

GRID_HOME/log/<host>/agent/ohasd/orarootagent_root

HA Service Daemon Oracle Root Agent

GRID_HOME/log/<host>/agent/crsd/oraagent_oracle11

CRS Daemon Oracle Agent

GRID_HOME/log/<host> agent/crsd/orarootagent_root

CRS Daemon Oracle Root Agent

GRID_HOME/log/<host> agent/crsd/ora_oc4j_type_oracle11g

CRS Daemon Oracle OC4J Agent

GRID_HOME/log/<host>/gnsd Grid Naming Service Daemon

The following diagram provides additional detail as to the location of the Oracle Clusterware log files:

Oracle Clusterware will rotate logs over time. This is known as a rollover of the log. Rollover log files will typically have the same name as the logfile but it will have a version number attached to the end. This helps to maintain control of space utilization in the GRID_HOME directory. Each log file type has its own rotation time frame. An example of rolling of log files can be seen in this listing of the GRID_HOME/log/rac1/agent/crsd/oraagent_oracle directory. In the listing note that there is the current oraagent_oracle log file with an extension of .log and then there are the additional oraagent_oracle log files with extensions from l01 to l10. These latter log files are the backup log files, of which 10 are maintained.

[oracle@rac1 oraagent_oracle]$ pwd/ora01/app/11.2.0/grid/log/rac1/agent/crsd/oraagent_oracle[oracle@rac1 oraagent_oracle]$ ls -altotal 109320drwxr-xr-t 2 oracle oinstall 4096 Jun 10 20:02 .drwxrwxrwt 5 root oinstall 4096 Jun 8 10:29 ..-rw-r--r-- 1 oracle oinstall 10565073 Jun 10 20:02 oraagent_oracle.l01-rw-r--r-- 1 oracle oinstall 10583355 Jun 10 13:35 oraagent_oracle.l02-rw-r--r-- 1 oracle oinstall 10583346 Jun 10 07:13 oraagent_oracle.l03-rw-r--r-- 1 oracle oinstall 10583397 Jun 10 00:51 oraagent_oracle.l04-rw-r--r-- 1 oracle oinstall 10583902 Jun 9 18:29 oraagent_oracle.l05-rw-r--r-- 1 oracle oinstall 10584515 Jun 9 13:17 oraagent_oracle.l06-rw-r--r-- 1 oracle oinstall 10584397 Jun 9 09:26 oraagent_oracle.l07-rw-r--r-- 1 oracle oinstall 10584344 Jun 9 05:37 oraagent_oracle.l08-rw-r--r-- 1 oracle oinstall 10584126 Jun 9 01:50 oraagent_oracle.l09-rw-r--r-- 1 oracle oinstall 10539847 Jun 8 21:09 oraagent_oracle.l10-rw-r--r-- 1 oracle oinstall 5955542 Jun 10 23:38 oraagent_oracle.log-rw-r--r-- 1 oracle oinstall 0 Jun 8 10:29 oraagent_oracleOUT.log-rw-r--r-- 1 oracle oinstall 6 Jun 10 21:20 oraagent_oracle.pid

Collecting Clusterware Diagnostic DataOracle provides utilities that make it easier to determine the status of the Cluster and collect the Clusterware log files for problem diagnosis. In this section we will review the diagcollection.pl script which is used to collect logfile information. We will then look at the Cluster Verify Utility (CVU).

Using Diagcollection.plClearly Oracle Clusterware has a number of log files. Often when troubleshooting problems you will want to review several of the log files. This can involve traversing directories which can be tedious at best. Additionally Oracle support might well ask that you collect up all the Clusterware log files so they can diagnose the problem that you are having.The diagcollection.pl script comes with the following options:

--collect – Collect diagnostic information. --clean – Cleans the directory of previous files created by previous runs of diagcollection.pl.

Options to only collect specific information. These options include –crs, --core or –all. –all is the default setting.

To make collection of the Clusterware log data easier Oracle provides a program called diagcollection.pl which is contained in $GRID_HOME/bin. This script will collect Clusterware log files and other helpful diagnostic information. The script has a –collect option that you invoke to collect the diagnostic information. When invoked the script creates four files in the local directory. These four files are gzipped tarballs and are listed in the following table:

Script Name Contains

Coredata*tar.gz Core files and related analysis files.

crsData*tar*gz Contains log files from GRID_HOME/log/<host> directory structure.

ocrData*tar*gz Contains the results of an execution of ocrdump and ocrcheck . Current OCR backups are also listed.

osData*tar*gz Contains /var/log/messages and other related files.

Oracle Cluster Verification Utility (CVU)The CVU is used to verify that there are no configuration issues with the cluster. CVU is located in the GRID_HOME/bin directory and also in $ORACLE_HOME/bin. CVU supports Oracle Clusterware versions 10gR1 onwards. You can also run CVU from the Oracle 11g Release 2 install media. In this case, call the program runcluvfy.sh which calls CVU. Prior to Oracle Clusterware 11g Release 2 you would need to download it from OTN. In Oracle Clusterware 11g Release 2 CVU is installed as a part of Oracle Clusterware.CVU can be run in various situations including:

During various phases of the install of the initial cluster to confirm that key components are in place and operational (such as SSH). For example, OUI makes calls to the CVU during the creation of the cluster to ensure that pre-requisites were executed.

After you have completed the initial creation of the cluster

After you add or remove a node from the cluster

If you suspect there is a problem with the cluster.

CVU diagnosis/verifies specific components. Components are groupings based on functionality. Examples of components are space, integrity of the cluster, OCR integrity, clock synchronization and so on. You can use CVU to check one or all components of the cluster. In some cases, when problems are detected CVU can create fixup scripts that are designed to correct problems that were detected.

Checking the Oracle Cluster RegistryNode evictions or other problems can be caused by corruption in the OCR. The ocrcheck program provides a way to check the integrity of the OCR. Performs checksum operations on the blocks within the OCR to ensure they are not corrupt. Here is an example of running the ocrcheck program:[oracle@rac1 admin]$ ocrcheckStatus of Oracle Cluster Registry is as follows : Version : 3 Total space (kbytes) : 262120 Used space (kbytes) : 2580 Available space (kbytes) : 259540 ID : 749518627 Device/File Name : +DATA Device/File integrity check succeeded Cluster registry integrity check succeeded

Logical corruption check bypassed due to non-privileged user

Oracle Clusterware Trouble ResolutionWhen dealing with difficult Clusterware issues that befuddles you there are some basic first steps to perform. These steps are:

1. Check and double check your RAC database backups are current. If they are not and at least one node survives, backup your database. If you have a good backup, backing up the archived redo logs is also a very good idea. The bottom line is that you have an unstable environment. Make sure you have protected your data should the whole thing go bottom up.

2. Open an SR with Oracle Support.

3. After opening the SR, search Metalink Oracle Support (MOS) for the problem you are experiencing.

4. If you find nothing on MOS, do a Google search for the problem you are experiencing.

5. Using the diagcollection.pl script, collect the Clusterware logs. Review the logs for error messages that might give you some insight into the problem at hand.

The truth is that Oracle Clusterware is a very complex beast. For the DBA who does not deal with solving Clusterware problems on a day-in-day-out basis, determining the nature and resolution to a problem can be an overwhelming challenge. In your attempts to solve the problem, you can cause additional problems and damage to the cluster. It is far better, if you do not know the solution, to let Oracle support work with you on a solution. That’s what you pay them for.Note that the number one step is backup of any RAC databases on the cluster. Keep in mind that one possible problem your cluster could be starting to experience is issues with the storage infrastructure. Consider this carefully when performing a backup on an unstable cluster. It may be that you will want to try to backup to some other storage medium (NAS for example) that uses a different hardware path (for example, does not use your HBA’s) if possible. If the cluster is starting to have issues, there is a lot that can go wrong and a lot of damage that can be occur (this is true with a non-clustered database too).

Dealing With Node EvictionsNode evictions can be hard to diagnose. There are many possible causes of node evictions, some of which might be obvious and some which might not be. In this section we address dealing with node evictions. First we ask the question, what can cause a node eviction. We then discuss finding out what actually caused our node eviction.

What Can Cause an Eviction?A common problem that DBA’s have to face with Clusterware is node evictions which usually leads to a reboot of the node that was evicted. With Oracle Clusterware 11g Release 2 there are two main processes that can cause node evictions:

Oracle Clusterware Kill Daemon (Oclskd) – Used by CSS to reboot a node when the reboot is requested by one or more other nodes.

CSSDMONITOR – OCSSD daemon is monitored by cssdmonitor. If a hang is indicated (say the ocssd daemon is lost) then the node will be rebooted.

Previous to Oracle Clusterware 11g Release 2 the hangcheck-timer module was configured and could also be a cause of nodes rebooting. As of Oracle Clusterware 11g Release 2 this module is no longer needed and should not be enabled.

Finding What Caused the EvictionVery often with node evictions you will need to engage Oracle support. Clusterware is complex enough that it will take the support tools that Oracle Support has available to diagnose the problem. However, there are some initial things you can do that might help to solve some basic problems, like node mis-configurations. Some things you might want to do are:

1. Determine the time the node rebooted. Using the uptime UNIX command for example. This will help you determine where in the various logs you will want to look for additional information.

2. Check the following logfiles to begin with:

a. /var/log/messagesb. GRID_HOME/log/<host>/cssd/ocssd.logc. GRID_HOME/log/host/alert<host>.logPerhaps the biggest causes for node evictions are:

1. Node time coordination – We have found that even though Oracle Clusterware 11g Release 2 does not indicate that NTP is a requirement Clusterware does seem to be more stable when it’s enabled and working correctly. We recommend that you configure NTP for all nodes.

2. Interconnect issues – A common cause of node eviction issues is that the interconnect is not completely isolated from other network traffic. Ensure that the interconnect is completely isolated from all other network traffic. This includes the switches that the interconnect is attached to.

3. Configuration/certification issues – Ensure that the hardware and software you are using is certified by Oracle. This includes the specific version and even patchset number of each component.

4. Patches – Ensure that all patch sets are installed as required.

5. OS Software components – Ensure that all OS software components have been installed as directed by Oracle. Don’t decide to not install a component just because you don’t think you are going to need to use it. Make sure that you are installing the correct revision of those components. The Oracle documentation and OMS provide a complete list of all required patch sets that must be installed for Clusterware and RAC to work correctly.

6. Software bugs

The biggest piece of advice that can be given to avoid instability within a cluster is to get the setup and configuration of that cluster right the first time. Take the time to ensure that you have the correct patch sets installed and that you have followed the install directions carefully. If in doubt about any step of the instillation, contact Oracle for support.

rac trouble

Documents

oracle rac1 admin

collecting diagnostic information

cluster ready services

cluster synchronization services

oracle cluster registry

clusterware log files

oracle rac1 oraagent oracle

oracle support