aos alarms guidev2

Avaya Operational Services BACKBONE

ALARMS - GUIDE Version: 2.0

- Prashanth Burugula Copyright c 2014 all rights reserved. The information in this document is subject to change without notice and should not be construed as a commitment by Avaya Corporation. Every effort has been made to ensure the accuracy of this document; however, Avaya Corporation assumes no responsibility for any error that may appear. All trademarks mentioned herein are the property of their respective owners

Media Gateways Alarms Alarm ->MainObject_MED-GTWY 5

Alarm->MaintObject_CMG_EventID_14 7

Alarm->MaintObject_CMG_EventID_16 8

CMG-19 9

CMG-34 9

CMG-21/cmgIccMissing 9

CMG-22/cmgIccAutoReset 9

CMG-23 10

CMG-35/VoIPOccFault 12

CMG-36/VoIPStats AppFault Trap 12

CMG-25/26 12

CMG-47/48/49 13

CMG-50/51/52 13

Platform Alarms SME 14

Tripwire 15

GW_ENV_EventID_10 17

DAL2/DAL1/DAJ 18

DUPLICATION LINK ALARMS 19

FILESYNC ALARMS 21

ARBITER ALARMS/A_EventID_X 23

BKP_EventID_10 24

WD_EventID_22 26

WD_EventID_26 27

PE HEALTHCHECK/PE_EventID_1 28

Malformed_INADS 29

LOGIN_EventID_x 30

USB1_EventID_X 30

UPD_EventID_X 31

UPG_EventID_x 32

UPS_EventID_X 32

STD_EventID_X 33

ENV_EventID_x 34

SVC_MON_EventID_X 34

TN Circuit Packs Alarms PKT-INT 36

PKT-BUS 36

G3_Cabinet-Down/G3_CircuitPack-Down 36

SYS-LINK 38

TONE-BD 40

ETR-PT 40

CLAN-BD 42

IPMEDPRO 43

MEDPROPT 44

VAL-PT 45

VAL-BD 45

SNI-BD 47

SNI-PEER 47

SN-CONF 48

SNC-LINK/SNC-BD/SNC-REF 48

EXP-INTF 49

EXP-PN 50

FIBER-LINK 51

DS1C-BD 52

TDM-BUS 53

POW-SUP 54

M/T-BD / MT-ANL/ M/T-DIG/ M/T-PKT 55

PS-RGEN/RING-GEN 56

NR-CONN 57

Survivable Processor Alarms LIC-ERR 58

ESS_LOCATION_C000 61

ESS_EventID_1 62

ESS_EventID_2 62

ESS_EventID_3 62

ESS_EventID_4 62

ESS_EventID_5 63

ESS_EventID_6 63

Adjuncts Associated Alarms PRI-CDR/SEC-CDR 64

ASAI-PT/BD 65

ADJ-IP/AESV-SES/ASAI-IP 66

Trunk/Trunk Board Associated Alarms MG-IAMM 67

UDS1-BD/MG-DS1 68

MG-ANA 70

ANL-BD 71

BRI-BD /MG-BRI/TBRI-BD 71

BRI-PT/TBRI-PT 73

CO-TRK 74

ISDN-SGR / ISDN-TRK 75

H323-SGR 76

1009:Total Processing exceeds 70 % on \\LSP-LAPG45CC03PV 77

HARD DISK,S,77 78

Switch Alarms PowerSupply_Fault 79

pethPsePortOnOffNotification 80

Interface_Fault_MIB2 80

ExceededMaximumUptime 80

HighErrorRate 80

Switch Down/Interface Down/Host Down 81

Alarm Description:

alarmMinor: ExternalName=J201C004-cm1-virtual-mdc: Type=MIN: MaintName=MED-GTWY: On Board=n: AlarmIPAddress=XX.XX.XX.XX: AlarmPort=XXX: AlarmCategory=

Understanding: The MED-GTWY maintenance object monitors the H.248 link to the Media Gateway. It logs errors when there are H.248 link problems or when hyperactive H.248 link bounce occurs. Solution:

almdisplay v / almdisplay res |more If active:

Check if the reported MED GTWY is registered with the main server or not from autosat and then do Display media-gateway

If the GTWY is not registered (in case if you see n for Registered? option) then try to ping the ip address of the GTWY to see if it is pingable from the main server. If not pingable, contact the customer and check for any network outage or for any scheduled power outage at the site. If Pingable:

Do traceroute from the main server. If any errors or shows any * values, it has issues.

Login to the MED GTWY from ssh init@

Do show event-log to check for the issues with the time stamp of the alarm occurred.

In my case, h248 link was down which caused this alarm.

Contact the customer and check for any Network issue and for any scheduled power outage or any other activity at the site.

If yes, monitor the alarm and check if the MED GTWY registered back to main server after the activity from show mgc by logging into MED GTWY.

Check for any faults before proceeding for case closure.

Just FYI.. In my case, there was some network fluctuation for which the H.248 link went down, Customer was able to find it and once the network is back. The H.248 link went up. The below screenshot from Event log shows the link is up which cleared the alarm and proceeded for the case closure.

Network Flap Scenario where the H248 link goes down, the logs would appear as below:

If there is a reset on MG, the logs would appear as below:

Probable cause: Alarm can be reported either due:

LAN issue / Power outage at the site OR

ESS/LSP server reloads after getting translations from main server OR Main Server is down due to bad health or scheduled activity at Customers Site.

Alarm Description:

Alarm->MaintObject_CMG_EventID_14 cmgSyncSignalFault / cmgSyncSignalClear Understanding: If the Avaya G700 Media Gateway contains an MM710 T1/E1 Media Module, it is usually advisable to set the MM710 up as the primary synchronization source for the G700. In so doing, clock sync signals from the Central Office (CO) are used by the MM710 to synchronize all operations of the G700. If no MM710 is present, it is not necessary to set synchronization. If neither primary nor secondary sources are identified, then the local clock becomes Active. By setting the clock source to primary, normal failover will occur. Setting the source to secondary overrides normal failover, generates a trap, and asserts a fault. Probable cause: Alarm can be reported either due:


ESS/LSP server reloads after getting translations from main server OR

ESS/LSP is down due to bad health Solution:

almdisplay v / almdisplay res|more (Display resolved alarms in page wise, use spacebar to go to next page)

telnet to media-gateway and get following o/p

show faults (verify for any sync faults/DS1 board faults, if present)

show sync timing (check for any errors for synchronization, if present )

show events (check for loss of signal or signal fault clear statement in logs)

test board < DS1 board location>

status trunk X (check the status whether trunk is in service or out-of service)

list measurement ds1 log (to check for any slip errors, if trunk is in service)

Alarm Description: Alarm->MaintObject_CMG_EventID_16 Understanding: This trap indicates that any of the media-module (may be voip-module) have undergone a change. Change could be either new media-module has been inserted or reseated or busyout-released or any configuration file has been uploaded or firmware has been downloaded Solution:

almdisplay v / almdisplay res |more

show faults (login to media-gateway and check for any active faults)

show event-log (check logs for the exact event happened on the media-module)

list configuration board (check for the board detection)

test board (check for all test are getting passed for the board)

Probable Cause: The common reason for the alarm to get reported is administration work on any of the media-module:

firmware update activity on media-module OR

Bad health of the media-module because of which it was resetted/reseated

Alarm Description: Alarm->MaintObject_CMG_EventID_19/ 34 Description: Alarm indicates an attempt to download a software module (cmg 19) or an attempt to upload a configuration file (cmg 34) has failed Procedure:


show faults (login to media-gateway and check for any active faults)

show event-log (check logs for the exact event happened on the media-module)

list configuration board (check for the board detection)

test board (check for all test are getting passed for the board) Try downloading the software for module (for cmg19 alarm) / uploading the configuration file (for cmg 34 alarm), if this fails then follow below step with required Customers permission

reset the board (ie busyout-releasing the board) followed by reseat of the board and then if required replacing the board

Probable Cause: The most common reason is failure of an update activity Alarm Description: Alarm->MaintObject_CMG_EventID_21/22 Alarm->cmgIccMissing /cmgIccAutoReset Understanding: Alarm indicates that an ICC, expected in Slot 1, is either missing/present (cmg21) and/or that the Media Gateway automatically reset the ICC (cmg 22). Procedure:


restartcause (to check who has initialised server)

list survivable-processor (to check time of saved translation file incase ICC is an LSP)

show event-log (check logs of media-gateway for finding the exact event happened)

* If CM version is 5.2.x check for PCN 1690Pu.*

Probable Cause: The alarm maybe reported due to :

CM release is 5.x (which has software limitation) OR

Translations were pushed onto LSP by main server and in the event, CM was reloaded. OR

Bad health of S8300 server Alarm Description: Alarm->MaintObject_CMG_EventID_23 Description: Telephone services on a Media Gateway are controlled by a Media Gateway Controller (MGC). All media-gateways integrate seamlessly with Avaya Media Servers. Media gateway can be configured to have upto 4 controllers, so that if primary controller goes down, second controller from the list will take the control and keep telephony services active. Alarm indicates that the Media Gateway cannot contact the first controller defined in its controller list. Procedure: A. If the first controller is Clan Board


cd /var/log/ecs

grep -R MG (to identify Media-gateway if alarm is in resolved state)

show mgc list (to identify the ip-address of first cotroller ie Clan board)

list ip-interface clan (to find the Clan board location)

status clan-port (port locationnote 17th is the required Ethernet port here)

display errors If Ethernet-link is down, inform Customer and ask to check the physical lan connectivity to the Clan board. If no issues found in connectivity then, with the required permission reset the Clan board (ie busyout-release the board) followed by reseat and then replacing the board, if required. Probable Cause: The alarm was reported maybe due to:

LAN Issue OR

Bad health of the Clan B. If the first controller is ICC


restartcause (to check how and when ICC was rebooted)

statapp (check whether all services are up and fine)

If ICC is not accessible/down, inform Customer and ask to reseat the ICC board. If still ICC doesnt come up try replacing the board. Probable Cause: The alarm may get reported due to:

Lan Issue /Power outage issue OR

Bad health of S8300 main server or it was rebooted.

Alarm Description: Alarm->MaintObject_CMG_EventID_35 /36 Alarm->cmgVoipOccFault / VoIPStats AppFault Trap Description: One or more or all of the VoIP engines in the media gateway is over is its occupancy threshold (Channels In Use/Total Channels(cmg 35))/below its occupancy threshold.(ie occupancy is back below threshold value after exceeding it (cmg 36)) Procedure:

show faults

show voip-parameters

show event-log (confirm whether occupancy is back to normal after exceeding the threshold

value)

Typically no other action is required here. Probable Cause: The most common cause is voip occupancy exceeds its threshold value. Alarm Description: Alarm->MaintObject_CMG_EventID_25/26/47-52 Description: Telephone services on a Media Gateway are controlled by a Media Gateway Controller (MGC). All media-gateways integrate seamlessly with Avaya Media Server. For MGC to control media-gateways, later needs to be registered with the media servers. If S87xx is the primary controller, then MG has to register with a Clan board. For S85xx, MG registers either with Clan board / Ethernet Processor port , if enabled. And for S8300, it registers with Ethernet Processor Port. The Alarms simply means that Media-Gateway is not registered to its controller.

Procedure:


cd /var/log/ecs

grep -R MG (to identify the Media-gateway, if alarm is in resolved state)

ping (check whether MG is reachable) from main server)

display media-gateway X (If MG is pingable but not registered check for recovery rule)

display system-parameters mg-recovery-rule X (check the configuration)

reset media-gateway level 2 (If required, with Customers permission)

traceroute (if MG is not pingable to check from which hop, trace is being failed).

Inform customer and ask to check the network integrity at the site.

show mgc list (login to MG, If no issues have been found with MG and/or network integrity check for Clan board which is defined in controller list)

ping

list ip-interface clan (to find the Clan board location)

status clan-port (port locationnote 17th is the required Ethernet port here)

display errors If Ethernet-link is down, inform Customer and ask to check the physical lan connectivity to the Clan board. If no issues found in connectivity then, with the required permission reset the Clan board (ie busyout-release the board) followed by reseat and then replacing the board, if required. Probable Cause:

LAN Issue or power outage at Customer site OR

Bad health of the Clan / Media-gateway

Platform Alarms

Alarm Description:

Alarm->MaintObject_SME_EventID_1

Description: The Server Maintenance Engine (SME) is a Linux process which provides error analysis, periodic testing, and demand testing for the server.SME means that alarms are not being reported by the other Server in duplex configuration, due to failure of either the GMM or administered reporting mechanism. Procedure:

testinads / testcustalm If testinads or testcustalm replies affirmative then the cause due to which alarm was reported, no longer exist.

statapp (check whether all required processes are up)

grep GMM /var/log/ecs/wdlog

start s GMM if GMM failure found then,Need to inform customer and with the permission restart GMM because restarting GMM may take a server interchange.

if GMM failure not found then,

if testinads is getting failed on any of the server then we need to kill almindsagt process (With the customers permission) Example:

b) init@pacehqs8720b> kill -9 5303 c) init@pacehqs8720b> ps -ef |grep alm root 26897 3790 0 07:27 ? 00:00:00 /opt/ws/almindsagt root 27044 26878 0 07:28 pts/0 00:00:00 grep alm

Note*: almindsagt process automatically restarts after it gets killed

testinads

stop s SME and stop s MVSubAgent followed by

start s SME and start s MVSubAgent if testinads is getting failed , then we need to restart the sme and mvsubagent process that sends trap from the server but with customers permission.

testinads/testcustalm if testinads fails then need to warm reboot (stop a followed by start -a)followed by cold reboot (reboot) if required but with customers permission.

logger -t svc_mon[2343] atd could not be restarted try to run a false alarm and check whether it gets reported to Remedy

once alarm is resolved then, almclear a

Probable Causes: The most common cause is that one of the duplex server could not call out an alarm and the other server calls this alarm to inform Administrator of that. This could be either due to:

GMM failure OR

Failure of the sub-process which are essential to administered reporting mechanism such as sme or mvsubagent etc process OR

Any scheduled activity at the Customer site may also cause affect the reporting mechanism.

Alarm Description:

Alarm->MaintObject_TRIPWIRE_EventID_7 Description: Tripwire is an intrusion detection system (IDS), which, constantly and automatically, keeps your critical system files and reports under control if they have been destroyed or modified by a cracker (or by mistake). It allows the system administrator to know immediately what was compromised and fix it. The first time Tripwire is run it stores checksums, exact sizes and other data of all the selected files in a database. The successive runs check whether every file still matches the information in the database and report all changes. Procedure:

grep -R tripwire /var/log/messages Output would in form of :

cd /var/lib/tripwire/report (go to this directory & then)

list ltr (to verify the *file-name.twr * that was modified.)

twprint --print-report -- report-level N --twrfile /var/lib/tripwire/report/*file-name.twr * --- where "N" is level from 0 to 4 Now shoot the above command to verify the sub-files which were modified. Output of above command would be in form of , as shown below

Probable Cause: When any of the critical system files and reports is changed or modified, we get this alarm.

Alarm Description:

Alarm->MaintObject_GW_ENV_EventID_10 Description: This environment alarm is raised in case of power supply faults with the gateway Procedure: Login to Media Gateway and shoot following commands.

show faults

show platform

show voltage

show event-log

show system

Ask customer to check following : a) Verify that the power cord is firmly inserted and that power is being supplied to the power unit reporting the event. b) Reinsert the power supply and monitor the Event Log If customer replies with everything is fine but alarm is still present, and then send a technician to Confirm the same and then to replace the power supply unit.

Alarm Description:

Alarm->MaintObject_DAL2/DAL1/DAJ1 Understanding: This MO supports each S8700 media servers Duplication Memory board, a NIC (Network interface card) serving as the physical and data-link interface for an Ethernet-based Duplication link between the servers. This link provides a Call-status data path for sending:

TCP-based communication between each servers Process Manager

UDP-based communication between each servers Arbiter to:

Enable arbitration between the active and standby servers

Provide status signaling for memory refreshes Procedure:


server (check for server, if it is in curbs in mode and status of standby shadowing & duplication link)

testdupboard

(Note: If a cable has become unplugged from either of the DAJ1 boards both boards will test ok. The dup link will show down/not refreshed but both DAJ1 boards will test ok.)

restartcause (if alarm is in resolved state and output of step 2 and 3 are fine)

testdupboard -t localloop (if the standby server is in busy-out state then, only on standby server)

reboot (If test continues to fail with Customers permission)

If test continues to fail then replace the DAL/DAJ card Probable Cause: The alarm may be due to:

bad health of any of the duplication board OR

duplication link got refreshed because of periodic/scheduled maintenance activity OR

CM server got reloaded because of save translational activity OR

Server got rebooted/reloaded

Alarm Description:

Alarm->MaintObject_DUP_EventID_X Understanding: The Duplication Link is a 10/100BaseT Ethernet link which is used by the Duplication Manager (ndm) Process to communicate with the other servers ndm process. The Duplication Manager process (via coordination of the Arbiter process) runs on each S8700 Multi-Connect server to control data shadowing between them. Meanwhile, at the physical and data-link layers, an Ethernet duplication link Provides a TCP communication path between each servers Duplication Manager to enable their control of data shadowing. The dupmgr is responsible for monitoring the status of this link. It raises a major alarm in the event that the Duplication Link is non-functional, by logging an entry into syslog that the Global Maintenance Monitor (GMM) uses to report alarms. Procedure: For Duplication Link for S87xxx server


server (check for curbs in mode and status of standby shadowing & duplication link)

testdupboard

filesync -Q dup (check the status of filesync on duplication link)

pingall d (check whether dup-ip is pingable from each server)

cat /proc/mdd (check for crc errors)

cd /var/log/ecs (do ls ltr to check the log file with the latest date tag .Eg: 2014-0203-070101.log)

cat (check for logs , when duplication link went down or refreshed. Was there any scheduled / periodic maintenance running at that time or some other activity which may affect functioning of duplication link )

restartcause (to check if CM on any of the server was reloaded or there was a server interchange) Probable Cause: The alarm is reported, if Dup-Link is not-functional maybe due to:

bad health of any of the server or duplication board OR

duplication link got refreshed because of periodic/scheduled maintenance activity OR

CM server got reloaded because of save translational activity OR

Server got rebooted/reloaded OR

due to any scheduled activity at the Customer site

Alarm Description:

Alarm->MaintObject_FSY_EventID_X Description: When multiple servers (i.e. processors) are present in a network, the active server shares configuration information (translations) with all the other servers (standby server and LSP/ESS servers) so that in the event of failure, a surviving processor can take over and have the latest information.Sharing occurs in a process known as file synchronization (filesync) and can happen once per day or whenever the translation file is changed. The system must be operated in a manner, and the network connectivity designed, to accommodate this activity. Procedure:

For ESS/LSP server (Note: Click for Dup-FSY alarm)


filesync -Q all (check the status of File Synchronisation)

statapp (check for filesync process is running on both main and ess/lsp server)

filesync -w -a lsp trans (Manually saving translations to lsp.)

filesync -w -a ess trans (in case of an ess server) (Check whether manual push is successful or else check for the error-reason code)

list survivable-processors (check the connectivity of ESS/LSP with main server and whether translation were Saved on ESS/LSP because CM reloads on ESS/LSP after getting the translation file and in the event alarm is reported on Main server)

restartcause Incase alarm is active and above mentioned steps doesnt Identify the cause then:

date (shoot this command on both main server as well as lsp/ess because time mismatch could be the cause)

ip_fw -q -s 21874/tcp service (check whether tcp ports defined for filesync are open in both directions,on each server)

cat /etc/sysconfig/network-scripts/ifcfg-ethX

(check whether Ethernet ports are locked to 100MBps-Full Duplex on each server and ethX is an etherent port defined for Customer lan)

/sbin/ifconfig ethX (check whether Ethernet port is seeing errors and ethx is the Ethernet port defined for customer lan)

Probable Causes: The alarm can be reported may be due to:

CM (on ESS/LSP)got reloaded, as per the design, after getting the translations from main server OR

Network integrity issue between ESS/LSP and main server OR

Scheduled activity at the Customer site or some recent changes made on any of the server OR

Date/Time is mismatched on main server and ESS/LSP OR

Tcp/IP ports are blocked in any direction on any of the server OR

Issue with Ethernet port where Customer Lan is defined

Alarm Description: Alarm->MaintObject_A_EventID_X Description: Alarm indicates malfunctioning of Arbiter Process, used in duplex server to determine the Health of the server.The Arbiter process runs on each S87xx server to: Decide which server is healthier (more able to be active) Coordinate data shadowing between them (under the Duplication Managers control). Meanwhile, at the physical and data-link layers, an Ethernet-based duplication link Provides an inter-arbiter UDP communication path to: Enable this arbitration between the active and standby servers Provide the necessary status signaling for memory refreshes

Procedure: Need to follow DUP Alarm and then ...

server

(If output indicates corrupt failed then inform Customer and with the desired permission restart Arbiter process executing following Commands):

stop SF -s arbiter

start -s arbiter

server c

cd /var/log/ecs

grep -R Arbiter

verify host name and corresponding ip-address are identical in host file and configuration file:- a) more /etc/host b) more /etc/opt/ecs/servers.conf c) ifconfig a ( verify that ip-address matches with host and configuration file and all Ethernet ports have ip-address assigned.) d)/sbin/arp a ( verify MAC address is complete)

verify whether the alarm is still active on port using following command

netstat -a | grep -R 1332

Probable Cause: The alarm is reported, whenever Arbiter Process detects: detects bad health of any of the duplex server or any issue OR issue with data shadowing between duplex servers

Alarm Description:

Alarm->MaintObject_BKP_EventID_10

Understanding: Backups are designed to preserve off server copies of translations, configuration files, Security files, logs, and other important information. The backup command is used for both backup and restore of data sets. Above alarm is reported when Scheduled backup has failed. Procedure: A Backup is being failed on FTP server


sudo backup t |more (gives history of successful and failed backups

ping of ftp server

traceroute If ftp server is not pingable then check from which hop it is failing and ask customer to check the network integrity.

If ftp server is pingable then take manual backup: a)from sroot cd /etc/cron.d ls and then open any file using cat (to find the location of the file where it should need to backup)

Then copy the string from backup (as shown above) in a notepad. Then add --verbose d to the string after b as shown below

b) Or cat web* or cat back* (to get login password for ftp server, if required) c) sudo backup -b --verbose -d ftp://'login':'paswd'@/ -c full d) Backup can be taken on server as well using following command sudo backup -b --verbose -d /var/home/ftp/pub/ -n 3 -c -x -c -- xln os security Once the backup is Successful, check the backup t|more to capture the backup logs and then proceed to case closure.

Probable cause: Alarm can be reported either due:


ESS/LSP server reloads after getting translations from main server OR

Main Server is down due to bad health or scheduled activity at Customers Site. B Backup is being failed on PC Card/Flash Card


sudo backup t |more

take manual backup a) cd /etc/cron.d

b) cat web* or cat back* c) backup -b --verbose -w -d usb-flash:// -n 2 -c xln os security

search_scsi v t CF (check whether device is present and at which location)

df -h /mnt/flash (to check where PC Card was originally mounted) Before mounting PC Card or formatting it, take Customers permission (below steps)

mount /dev/cciss/"device location" /mnt/flash (to mount PC card)

/sbin/mkfs.ext2 / (format the device if it is present but not being detected while backup is ran)

Try manual backup command again as in step 3-c

sudo backup t PC card (to verify the contents of PC card)

Probable Cause: The backup on PC-Card may get fail sue to:

PC Card got un mounted OR

PC Card was not detected by CM when scheduled backup script was executed. Alarm Description:

Alarm->MaintObject__WD_EventID_22 Understanding: The watchdog keeps an eye on all processes in the system, maintaining heartbeats with both Communication Manager and platform processes. The watchdog is responsible for stopping and starting processes when necessary. This process watches over the entire system. Event Id-22 indicates that one of the WD process was terminated. Procedure:


restartcause (check whether CM was reloaded )

grep -R terminated /var/log/messages (identify the Application that was terminated and corresponding time-stamp)

Check statapp to find watchdog application is UP inlucding all the applications are UP.

grep -R Application /var/log/messages (to confirm WD restarted the application )

start s (If necessary restart the application using above mentioned command)

If still the application doesnt come up then inform Customer and with the permission , go for Warm Reboot and then Cold Reboot , if required.

Probable Cause: The alarm gets reported either due to:

Malfunctioning of the server cause termination of any Watch-dog process OR

CM reloads after getting translational file from main server, in case of ESS/LSP OR

CM was rebooted, may be due to some scheduled activity at the Customer site Alarm Description:

Alarm->MaintObject__WD_EventID_26 Description: Watched handshake error IF USB alarms are also present, this strongly points to a global SAMP or networking problem. This error implies malfunctioning or missing or configurational / firmware mismatches issues of SAMP or else it may point to usb modem malfunctioning. Procedure:


sampdiag v (gives status of SAMP)

grep -R SampEth /etc/opt/ecs/ecs.conf (to chk detection of SAMP card)

sampcmd date (to check synchronization of SAMP with host)

restartcause

testmodem

testmodem -t reset_usb (soft reset of USB modem, if any of the test is getting failed)

stop -s ModemMtty followed by start -s ModemMtty (If soft reset doesnt work then restart ModemMtty process but do take Customer before doing that)

If testmodem still get failed, ask Customer to get the Modem reseated followed by telephone cable (inserted into modem) reseated )

If testmodem still fails then get the modem replaced. Probable Cause: The alarm gets reported either due to:

Malfunctioning of SAMP OR

Malfunctioning of Modem OR

Server got rebooted Alarm Description:

Event Name: PE Health Check device is not responding to ARP request / Event Name: MaintObject_PE_Event ID_1 Understanding: Description: Processor Ethernet (procr) feature was added to duplicated Main servers in CM 5.2.This allows configurations not having Port Networks. In addition, a weight relative to the IPSIs is assigned to the PE interface. The reason being is that if only one adjunct is connected to the system using procr, but everything else is still IPSI connected, you wouldn't want the servers to interchange simply because the procr interface went down. This priority can be seen in the output of the server command, and may be set to HIGH, LOW, or IGNORE using the server web pages. If set to HIGH then the PE is favored over IPSIs and if LOW, the IPSIs are favored over the PE. If set to IGNORE the SOH of the PE is not used in interchange decisions; if the PE fails on the active server, it has no effect on the server SOH and does not cause a server interchange Procedure:


cd /var/log/ecs

grep -R arping 2010* (ie search for arping in cd /var/log/ecs/*filename*)

switch to sroot login and then arping -I ethX -f -c 1 -w 1 (check whether arping is getting passed & value of X need to be identified in step 3)

Note: If this issue occurs 3 times in a row it could lead to an interchange if only one server sees the failure. Many of the examples seen have been chronic issue that occur many times over a week or 2. In this case additional analysis should be done to determine if there is a possible issue that is occurring. Probable Cause: When arping gets fail for any Ethernet port on server, this alarm is reported. Alarm Description:

Event Name: Malformed_INADS_alarm-2000000000 31/01:07,EOF,ACT|AUDIT,VM,101,MIN Description: It indicates a messaging alarm for IA 770. Procedure:

almdisplay v

statapp (check for messaging / INDADS AlarmAgent service is down)

almdisplay res |more Probable Cause: Malfunctioning of IA 770 when detected, CM reports the alarm.

Alarm Description:

Alarm-> MaintObject_USB1_EventID_X Understanding: Modems are used for their ability to call out Alarms to external Alarm Monitoring System and also to access Avaya servers remotely by dialing through the Modem (eg: from toolsa we can access Customers network only through Modem). The modems in the system are tested every 15 minutes to verify that dial tone can be achieved. If dial tone is not achieved every 15 minutes, Watch Dog reports as an alarm. Procedure:

almdisplay v /almdisplay res |more

testmodem

restartcause

testmodem

testmodem -t reset_usb (soft reset of USB modem, if any of the test is getting failed)

stop -s ModemMtty followed by start -s ModemMtty (If soft reset doesnt work then restart ModemMtty process but do take Customer before doing that)

If testmodem still get failed, ask Customer to get the Modem reseated followed by telephone cable (inserted into modem) reseated Note: If Handshake Test is failing then reseat the modem and if Off-Hook Test is failing then get the telephone cable (inserted into the modem) reseated.

If testmodem still fails then get the modem replaced Probable Cause: The alarm is reported mostly due to

Malfunctioning of modem OR

Modemtty service is hung or stopped on server OR

Telephone line connected to modem is not functioning properly

Alarm Description:

Alarm->MaintObject_UPD_EventID_X Description: The kernel update is activated but the activation is not committed. Procedure:

almdisplay v

swversion a (gives the date when the update was executed)

swversion r (gives the info about previous CM load)

update_show

update_commit (making the update permanent)

almclear a Probable Cause: The kernel update was not committed and hence this alarm is reported.

Alarm Description:

Alarm->MaintObject_UPG_EventID_1 Description: THE UPG raises an alarm if the upgrade was not made permanent within a certain amount of time after the upgrade. Procedure:


swversion a ( verify the date and time of upgrade)

cat /var/log/ecs/commandhistory (check the command usedfor the upgrade activity)

commit (if upgrade was not performed permanently).

almclear a (clear the alarm) Probable Cause: The alarm is mostly due to the upgrade activity scheduled at the customer side and the the upgrade was not made permanent in a specific time after the upgrade Alarm Description:

Alarm->MaintObject_UPS_EventID_X Description: The UPS process is for monitoring the status of the UPS for each 8700 server. An alarm will be raised when there is a loss of commercial power or there is some other power problem such as a spike, sag, brownout or blackout. Procedure:


pingall u

(verify if switch is pingable or else ask customer to check the network integrity)

snmpwalk -c public 33 | more (to verify if the system is currently on backup power)

If alarm is active inform Customer and ask to verify AC Power is being supplied to UPS and coordinate with the vendor, if required.

Probable Cause: The alarm is reported maybe due to:

Network issue between server and UPS OR

Mal-functioning of UPS OR

AC Power supply is not being supplied to UPS properly Alarm Description: Alarm->MaintObject_STD_EventID_X Description: These are Standard SNMP Traps (ie SNMPv2 protocol) sent by an entity indicating Media Server that either it (entity) has undergone a reboot or else it has recognize a failure in one of the Communication links.

Procedure: If Event ID is 1 or 2


ping (ip-address can be identified in alarms itself)

login to the entity having the above ip-address and check whether it had under-gone a cold or warm reboot Note: Event ID 1 corresponds to Cold Reboot and 2 to Warm Reboot of the entity If Event ID is 3


ping (ip-address can be identified in alarms itself)

Identify the entity and alarm indicates that communication link between media server and that particular entity is either down or have come up after a failure.

Note: STD_EventID_X alarms are generally in resolved state and can be closed either by stating cold/warm reboot of ip-entity or communication link flap between media server and ip-entity, as per the X value.

Alarm Description:

Alarm->MaintObject_ENV_EventID_X Description: The ENV MO monitors environmental variables (including temperature, voltages, and fans) within the server. Alarm indicates that any of these variable has deviated from its nominal value Procedure:


environment (command can be executed only on root shell prompt. Normal o/p is shown below)

If alarm is in active state and you find some of the below parameter deviated from its normal value, inform the Customer and ask to check the room temperature and power supply. Monitor the alarm for couple of hours and if still, alarm is active then get the Motherboard replaced.

Normal o/p of environment

root@S8700_A> environment *** Hardware Health *** Feature Value Status CPU IO: 1.50 Normal CPU CORE: 1.74 Normal 3.3V: 3.33 Normal 5V: 5.03 Normal +12V: 11.94 Normal -12V: -11.93 Normal Fan: 9000 Normal Fan: 3277 Normal Fan: 9507 Normal Fan: 9507 Normal Temp: 37.00 Normal

Probable Cause: The most common reason is :

Room temperature or power-supply are not proper maybe due to power-supply fluctuation / physical connection OR

Mother-Board has gone fault

Alarm Description: Alarm->MaintObject_ SVC_MON_EventID_X

Description: MO-SVC_MON is a media server process, started by the Watchdog, to monitor Linux services and daemons. It also starts up threads to communicate with a hardware-sanity device. This alarm indicates one of the Linux Daemon is down. Procedure:


cd /var/log

grep svc_mon messages (to check which daemon was affected)

service status (to check the status of daemon whether it is running)

service start (if service is not running )

service status (to confirm the service is running)

If still daemon doesnt come up, then inform the Customer and get the permission of Warm Reboot followed by Cold Reboot (only if required), in lean period.

Probable Cause: Whenever one of the below daemon is either stopped or restarted maybe due to server Reboot or CM reload or maybe due to health of the server degrades.

TN Circuit Packs Alarms

Alarm Description:

Alarm->MaintObject _PKT-INT_Location_

Alarm->MaintObject_PKT-BUS

G3_Cabinet-Down / G3_CircuitPack-Down Description: An IPSI board contains several different functionalities, one of them: the PKTINT. This is the resource on the IPSI board that is the manager for the LAPD links travelling through the packet bus. These links include RSCLs, EALs, INLs, etc. The packet bus consists of a single bus, and one such bus appears in each port network.The packet bus in each port network is physically independent from those in other port networks, so each port network has a separate PKT-BUS MO. In addition to affecting telephone service, a Packet Interface/Packet Bus failure affects the service Provided by circuit packs e.g. ISDN-signaling service, service provided by the C-LAN or VALor IPMEDPRO boards etc. Procedure:


pingall i (check whether all ipsis are pingable)

list ipserver-interface (check whether ipsi is up or down)

status packet-interface X (X is cabinet number)

test packet-interface X (X is cabinet number)

status cabinet X

status port-network Y (Y is port-network to which cabinet X belongs to)

status sys-link

test sys-link

test board

cd /var/log/ecs

grep sanity (check for sanity failures)

grep WARM (check for Warm reboot of Port-Network)

grep COLD (check for Cold reset of Port-Network)

display errors

If alarm is active and any of the ipsi is down, inform Customer and with required permission go for reset of ipsi followed by reseat and then, if required, replace it. Probable Cause: The alarm may get report due to:

Lan Issue / Power outage at the site OR

Reboot of port-network may be due to too many sanity failures OR

Bad health of ipsi. Alarm Description:

Alarm->MaintObject_SYS-LINK Understanding: System links are packet links that originate at the Packet Interface board and traverse various hardware components to specific endpoints. The hardware components involved on the forward and reverse routes can be different, depending upon the configuration and switch administration. Various types of links are defined by their endpoints: EAL, PRI, RSCL, RSL, MBL etc. The state of a system link is dependent on the state of the various hardware components that it travels over. Hence, when analyzing any system link problem, look for other active alarms present for corresponding hardware component. If so then follow the maintenance procedures for the alarmed components to clear those alarm first.

Note: All the above links originates from Pkt-Int ie from an IPSI board and terminates on corresponding circuit-packs If none of the alarms for above listed hardware components are present, accept the sys-link then execute below steps to clear the alarm.

Procedure:


list sys-link (to identify the sys-link)

status sys-link (check whether current path is up or down)

test sys-link long clear (to clear the dead alarm and/or to identify any test if

getting failed)

Probable Cause: The alarm may get report due to:

Lan Issue / Power outage at Customer site OR

Bad health of any of the hardware component of the sys-link

Alarm Description: Alarm->MaintObject_TONE-BD Description: For IPSI-equipped EPNs, the TONE-BD MO consists of a module located on the IPSI circuit pack and provides tone generation, tone detection, call classification, clock generation, and synchronization. For non-IPSI EPNs, the TN2182B Tone-Clock circuit pack provides the functions. Note: Check for any other IPSI related alarms, if present follow corresponding procedure to resolve the alarms. If there are no other alarms then follow the below procedure. Procedure:


test tone-clock

display errors If alarm is still present, then inform Customer and with required permission proceed with following steps:

busyout-release of the board followed by reseat and then, if required replace it Probable Cause: Mal-functioning of Tone-Clock board. Alarm Description: Alarm->MaintObject_ETH-PT Description: The TN799DP Control LAN (C-LAN) circuit pack provides TCP/IP connection to adjuncts applications such as CMS, INTUITY, and DCS Networking. The C-LAN circuit pack has one 100BASE-T Ethernet connection and up to 16 DS0 physical interfaces for PPP connections. Also C-Lan acts as a gatekeeper for IP-Endpoints registration. Procedure:


display port ( identify data-module/link number, say X)

status link X/ status data-module X (verify the current status of the link/data-module ie it is in-service or not)

get ethernet-options (check for Ethernet port settings-Avaya

Recommends having Ethernet-port on 100Mbps-Full Duplex and Auto negotiation off)

test port (check whether all test are getting passed)

display errors (to identify the cause of the alarm)

test port long r 3 (to clear any warnings regarding link integrity test)

ping (check whether server is able to ping the Clan-board) If alarm is active, inform the Customer and ask to check and confirm the network integrity to the Clan Ethernet-port and if Customer replies everything is fine , then follow below procedure with required permission

busyout port and then release . (for C-Lan board 17th port is always the Ethernet

port and rest 16 are PPP ports . 32nd port is RSCL link port)

busyout board followed by reset board and then release board (If alarm is still active, try resetting the C-Lan board)

Get the board re-seated either with the help of Customer or else by sending a technician on-site. If still alarm doesnt clear off then try inserting C-Lan board into some other slot. If alarm clears off, then replace the Carrier or else replace the Circuit Pack.

Note: While resetting the ip-interface board through Sat-prompt, first busyout the board and then disable the Ethernet interface by changing ip-interface . Also enable the same after reseting the board and then only release it Probable Cause: The alarm may get report due to:

Lan Issue OR

Bad health of any of the Clan board

Alarm Description: Alarm->MaintObject_CLAN-BD Description: The TN799DP Control LAN (C-LAN) circuit pack provides TCP/IP connection to adjuncts applications such as CMS, INTUITY, and DCS Networking. The C-LAN circuit pack has one 100BASE-T Ethernet connection and up to 16 DS0 physical interfaces for PPP connections. Also C-Lan acts as a gatekeeper for IP-Endpoints registeration. Procedure:


test board (to check whether all test are getting passed)

display port ( identify data-module/link number, say X)

status link X/ status data-module X (verify the current status of the link/data-module ie it is in-service or not)

get ethernet-options (check for Ethernet port settings-Avaya recommends to have Ethernet-port on 100Mbps-Full Duplex and Autonegotiation off)


ping (check whether server is able to ping the Clan-board) If alarm is active, inform the Customer and ask to check and confirm the network integrity to the Clan Ethernet-port and if Customer replies everything is fine , then follow below procedure with required permission

busyout board followed by reset board and then release board (ie If alarm is active, try resetting the C-Lan board)

Get the board re-seated either with the help of Customer or else by sending a technician on-site.

If still alarm doesnt clear off then try inserting C-Lan board into some other slot and check. If alarm clears off, then replace the Carrier or else replace the Circuit Pack

Note: While resetting the ip-interface board through Sat-prompt, first busyout the board and then disable the Ethernet interface by changing ip-interface . Also enable the same after reseting the board and then only release it Probable Cause: The alarm may get report due to:

Lan Issue OR

Bad health of any of the Clan board OR

Wrong configuration of Clan board Alarm Description: Alarm->MaintObject_IPMEDPRO Description: In an IP telephony solution, digital signal processing (DSP) resources are used for handling media streams. DSP resources inter-work audio between the media gateways time division multiplex (TDM) bus and the IP network, as well as transcoding (ie when needed to convert one codec to another). DSP resources are dynamically allocated on a call-by-call basis and are provided by the IP Media Processor (IPMEDPRO) circuit pack for solutions using S8100, S8500 or S8700 Media Server with G600 or G650 Media Gateways (or traditional SCC1, MCC1 gateways) There 2 types of IPMEDPRO circuit packs

TN 2302AP IP Media Processor provides TN 2602AP IP Media Processor provides

The TN2302/TN2602 includes a 10/100 BaseT Ethernet interface to support IP audio for IP trunks and H.323 endpoints and also for adjuncts such as Voice Recording Logger.The IPMEDPRO circuit pack acts as a service circuit to terminate generic RTP streams used to carry packetized audio over an IP network. Procedure:


list configuration board (to identify TN2303 / TN 2602 circuit pack)

display ip-interface (verify that ethernet port is enabled and is set to 100 Mbps

speed, Full duplex and Autonegotiation is disabled)

test board (check for all test are getting passed)

display errors (check for any errors against the IPMEDPRO circuit pack )

ping (check whether server is able to ping the Medpro-board) If alarm is active, inform the Customer and ask to check and confirm the network integrity to the Medpros Ethernet-port and if Customer replies everything is fine , then follow below procedure with required permission

busyout board followed by reset board and then release board (ie If alarm is active , try resetting the Medpro board)

Get the board re-seated either with the help of Customer or else by sending a technician on-site. If still alarm doesnt clear off then try inserting Medpro board into some other slot and check. If alarm clears off, then replace the Carrier or else replace the Circuit Pack. Note: While resetting the ip-interface board through Sat-prompt, first busyout the board and then disable the Ethernet interface by changing ip-interface . Also enable the same after reseting the board and then only release it Probable Cause: The alarm may get report due to:

Lan Issue OR

Wrong Configuration of the Medpro board OR

Bad health of any of the Medpro board. Alarm Description: Alarm->MaintObject_MEDPROPT_Location_X_OnBoard_Y Description: The Media Processor Port (MEDPROPT) MO monitors the health of the Media Processor (MEDPRO) digital signal processors (DSPs). This maintenance object resides on the TN2302/TN2602 Media Processor circuit packs which provide audio bearer channels for H.323 voice over IP calls. One TN2302AP has 8 MEDPROPTs; each TN2302 MEDPROPT has the processing capacity to handle 8 G.711 coded Channels, for a total of 64 channels per TN2302. The capacity provided by the TN2602 is controlled by the Avaya Communication Manager license file and may be set at either 80 G.711 channels or 320 G.711 channels. If individual DSPs on the TN2302AP or TN2602 fail, the board remains in-service at lower capacity. The MEDPROPT is a shared service circuit. It is shared between H.323 trunk channels and H.323 stations. An idle channel is allocated to an H.323 trunk/station on a call-by-call basis. Note: If any Medpro-board/TDM/Pkt-Int alarm is present along with Medpropt, follow corresponding procedure to proceed further or else follow below procedure Procedure:


status media-processor board (Check whether Ethernet link and MPCL links are up and all DSP channels are inservice/idle or busy state.)

test port (check for all test against the port are getting pass)

test board (check for all test for the board are getting pass)

display errors (check for any errors against the Medpropt to identify the cause)

test board long r 5 (to execute the test board command five times)

busyout-release port (soft reset of Medpropt)

Probable Cause: Bad health of the Medpro board Alarm Description: Alarm->MaintObject_VAL-PT Description: Alarm indicates that CM has sensed some fault in either playback or recording of an announcement through a particular port/board Note: If any Val-board/TDM/Pkt-Int alarm is present along with Val-Pt, follow corresponding procedure to proceed further or else follow below procedure Procedure:


test port (check for all test are getting passed for the port)

test board (check for all test are getting passed for the board)

display errors (check for errors against the board to identify the cause)

busyout and release (i.e. soft reset of Val-Pt) Note: If error type 1 and firmware of Val Board is 20 then board needs to be upgraded to firmware 21 because firmware 20 has certain software limitation . Probable Cause: The alarm may get reported due to:

Bad health of Val-port or

usage of Val port has exceeded its threshold. Alarm Description: Alarm->MaintObject_VAL-BD Description: The Voice Announcements over the LAN (VAL) TN2501AP provides per-pack announcement storage time of up to one hour, up to 31 playback ports, and allows for announcement file portability over a LAN. The VAL circuit pack also allows for LAN backup and restore of announcement files and the use of user provided (.WAV) files.

Procedure:


display ip-interface (verify that ethernet port is enabled and is set to 100 Mbps speed, Full duplex and Auto negotiation is disabled)

test board (check for all test are getting passed)

display errors (check for any errors against the Val board to identify cause )

ping (check whether server is able to ping the Val-board)

If alarm is active, inform the Customer and ask to check and confirm the network integrity to the Medpros Ethernet-port and if Customer replies everything is fine , then follow below procedure with required permission

busyout board followed by reset board and then release board (ie If alarm is active , try resetting the Val board)

Get the board re-seated either with the help of Customer or else by sending a technician on-site. If still alarm doesnt clear off then try inserting Val board into some other slot and check. If alarm clears off, then replace the Carrier or else replace the Circuit Pack.

Note1: While resetting the ip-interface board through Sat-prompt, first busyout the board and then disable the Ethernet interface by changing ip-interface . Also enable the same after reseting the board and then only release it Note2: Before resetting the Val board or else getting it reseated, it is recommended to take the backup of announcements present on the Val board because sometimes announcement files may get erased. Same can be confirmed with Customer , since they may be have a schedule Val backup in place. Incase, we are required to take the announcements backup, follow the below procedure : Val-Backup Procedure:

list directory board (Val board location) ( this command runs on Sat-prompt---here you get all the announcement files present in Val board)

enable filexfer (this command needs to be run on Sat-prompt---here you need to define a

login/password, secure as no and mention the Val-board location)

sudo ftpserv on (this command needs to be run on Shell-prompt---means we are turning on the ftp service, so that we can ftp the Val board from server)

ftp

bin (to get the binary version)

hash (to get the file transfer status)

prompt (to get more than one file with one command)

mget .* (gets all the files from Val board to the server)

Probable Cause: The alarm may get reported due:

Lan issue OR

Bad health of Val-board OR

Wrong Configuration of the board Alarm Description: Alarm->MaintObject_SNI-BD _Location_X Alarm->MaintObject_SNI-PEER_Location_X Description: The SNI circuit pack reporting the error indicates that it has a problem with the control path, circuit path, or packet path to the SNI peer in the slot indicated. Procedure:


display errors (check for any errors against the board. In case of SNI-PEER alarm failed SNI

board can be identified from the error type given in below table)

test board X (check for any test, if getting failed)

status switch-node-clock (to identify the active SNC and standby SNC)

set switch-node-clock (to make the standby SNC as active but with Customers permission. Now if alarm is off then replace the current standby SNC board or else revert back the action)

If alarm do not clear off, Customer permission go ahead with soft reset (ie busyout then reset and then releasing the board) followed by reseat of the board (ie removing the board from the slot and then re-inserting it) and then, if alarm still doesnt go off, replace the board.

Probable Cause: Bad health of SNI-Board /SNC board or the fiber link associated with the SNI-BD

Alarm Description: Alarm->MaintObject_SN-CONF _Location_X Description: A switch node carrier contains: Up to 16 Switch Node Interface (SNI) TN573 circuit packs in slots 2 through 9 and slots 13 through 20 One or two Switch Node Clock (SNC) TN572 circuit packs in slots 10 and 12 An Expansion Interface (EI) TN570 circuit pack, a DS1 Converter (DS1C) TN574 circuit pack, or no circuit pack in slot 1 An optional DS1 converter circuit pack in slot 21 Procedure:


test board X (check for any test, if getting failed)

list fiber-link (to identify fiber link and other end-point to which it is connected)

test fiber-link Y (check for any test, if getting failed)

display errors (check for any errors against the board)

clear firmware-counters location (SNC firmware generates error reports independently of demand tests. Therefore, test board X does not affect the error status by firmware Hence this command needs to be executed to clear any firmware generated errors unconditionally.)

Inform customer and verify that the fiber-link physically connected to any SNI board and the other end-point of the fiber-link are properly administered

If alarm do not clear off, then Inform Customer and with required permission go ahead with soft reset (ie busyout then reset and then releasing the board) followed by reseat of the board (ie removing the board from the slot and then re-inserting it) and then, if alarm still doesnt go off, replace the board. Probable Cause: SN-CONF errors and alarms are generated for two types of failures:

Failure of SNI or SNC board OR

Absence of physical connectivity of a fiber-link between 2 end-points (ie either between 2 SNIs or 2 EIs or between SNI & EI or DS1C & SNI/EI) but is administered on CM OR

Two endpoints are physically connected but not administered on CM software.

Alarm Description: Alarm->MaintObject_SNC-LINK _Location_X Alarm->MaintObject_SNC-BD _Location_X Alarm->MaintObject_SNC-REF _Location_X Description: The Switch Node Clock (SNC) TN572 circuit pack is part of the Center Stage Switch (CSS) configuration. It resides in a switch node carrier that alone or with other switch nodes make up a CSS. In a high-reliability system (duplicated server and control network, unduplicated PNC), each SNC is duplicated such that there are two SNCs in each switch node carrier. In a critical-reliability system (duplicated server, control network, and PNC), each switch node is fully duplicated, and there is one SNC in each switch node carrier. SNCs are placed in slots 10 and 12 of the switch node carrier. These are the alarms associated with SNC circuit pack: -The SNC-LINK MO reports errors in communications between the active Switch Node Clock and Switch Node Interfaces over the serial channel (Aux Data 1) and the TPN link (Aux Data 2). -The SNC-BD MO covers general SNC board errors and errors with the serial communication channel between the active and standby SNCs.

-The SNC-REF MO reports errors in SNI reference signals detected by the active Switch Node Clock. Note: If any alarm, related to SNI-BD or SNI-PEER or fiber-link or DS1C-BD, is present the follow corresponding repair procedures first Procedure:


test board X (check whether all test are getting passed)

display errors (check for any errors against the board)

clear firmware-counters location (SNC firmware generates error reports independently of demand tests. Therefore, test board X does not affect the error status by firmware Hence this command needs to be executed to clear any firmware generated errors unconditionally.)

If alarm do not clear off, then Inform Customer and with required permission go ahead with soft reset (ie busyout then reset and then releasing the board) followed by reseat of the board (ie removing the board from the slot and then re-inserting it) and then, if alarm still doesnt go off, replace the board. Probabale Cause: The alarm may get reported due to:

Bad health of any of the hardware component mentioned in above note OR

Bad health of the SNC board OR

Configuration issue for new installation or some change activity at the custom. Alarm Description: Alarm->MaintObject_EXP-INTF_Location_X Description: The TN570 or the TN776 Expansion Interface (EI) circuit pack provides a TDM- and packet busto- fiber interface for the communication of signaling information, circuit-switched connections, and packetswitched connections between endpoints residing in separate PNs. EI circuit packs are connected via optical fiber links. Note: If any alarm, related to IPSI which is acting as archangel or fiber-link or TDM bus or Tone-Clk, is present then follow corresponding repair procedures first to resolve the alarm Procedure:


status cabinet X (to check status of connectivity of EPN)

status synchronization (to confirm that there is no issue with synchronization)

display errors (check for errors to identify cause)

test board (check whether test are getting passed for board)

test board long r 3 (to clear the minor errors)

If alarm do not clear off, then Inform Customer and with required permission go ahead with soft reset (ie busyout then reset and then releasing the board) followed by reseat of the board (ie removing the board from the slot and then re-inserting it) and then, if alarm still doesnt go off, replace the board. Probable Cause: The alarm may get report due to:

Bad health of any of the hardware component mentioned in above note OR

Bad health of the EI board OR

Configuration issue for new installation or some change activity at the customer site.

Alarm Description:

Alarm->MaintObject_EXP-PN_Location_PN X Description: The EXP-PN MO is responsible for overall maintenance of an Expansion Port Network (EPN) and monitors cross-cabinet administration for compatible companding across the circuit-switched connection. The focus of EPN maintenance is on the EI or IPSI circuit pack that is acting as the Expansion Archangel link in an EPN. Note: If alarms, involving EI board or IPSI board which is acting as Expansion Archangel or any of the hardware involved with CSS such as SNI-BD or SNC-BD or DS1C-BD or fiber link, are present, then these alarms needs to be repaired first Procedure:


status port-network X (check status of EPN )

status sys-link (To identify which IPSI is controlling EPN and check whether

any other alarm is present for the identified IPSI and/or for EI circuit pack. If yes, follow corresponding procedure to resolve the alarm)


grep sanity (check for sanity failures of the IPSI acting as archangel for the EPN)

grep WARM (check for Warm reboot of Port-Network)

grep COLD (check for Cold reset of Port-Network) If still alarm is active and corresponding IPSI and EI circuit-packs are fine, inform Customer about the alarm and with required permission follow below procedure:

reset port-network X level 1 (ie perform Warm Restart of EPN)

reset port-network X level 2 (ie perform Cold Restart of EPN) Probable Cause: The alarm may get report due to:

Bad health of any of the hardware component as stated in above Note. OR

Network Issue/Power outage at the customer site OR

Port Network had undergone a reboot OR

Configurational Issue due to new installation or some change activity at customer site Alarm Description: Alarm->MaintObject_FIBER-LK_Location_X Description: A fiber link consists of the endpoint boards that are connected via the optical fiber, the lightwave transceivers or metallic connections on the endpoint boards, and, if administered, the DS1 Converter (DS1 converter) complex that exists between the two fiber endpoints. The fiber endpoints are EI circuit packs and/or SNI (SNI) circuit packs. Fiber link errors and alarms are generated only on fibers that have at least one SNI endpoint. Fiber errors for fibers that have EIs as both endpoints are detected by the EI circuit pack, thus generating off-board EXP-INTF errors and alarms. Fiber errors and alarms on EI-SNI fiber links generate FIBER-LK and/or off-board EXP-INTF errors and alarms Note: If any of the active alarm is also present for any of the end-point of the fiber-link, then follow corresponding repair procedures. Procedure:


status synchronization (to confirm that there is no issue with synchronization of the system)

display fiber-link X (to identify the end-points and check whether any alarm is present for any of the

end-point. If yes, follow corresponding repair procedure to resolve the alarm)

test fiber-link X (check whether all test are getting passed for fiber-link)

display errors (to identify cause for the alarm)

busyout-release the fiber link.(ie soft reset of fiber-link with Customers permission)

If still alarm is active and corresponding end-points are fine, ask Customer to check the physical connectivity ie fiber-link is properly terminated onto the end-points and also to check the fiber-cable, to ensure no cuts are present on fiber-cable. Probable Cause: The alarm may get report due to:

Bad health of end-points that are connected through fiber-link (ie either SNI/EI/DS1C board as per the solution deployed at customer site)

Physical Connectivity issue (ie either fiber-link is not properly terminated onto the end-points or else fiber-link is broken in between.)

Configuration Issue for a new installation or due to some change activity at the customer site. Alarm Description: Alarm->MaintObject_DS1C-BD_Location_X Description: The DS1 converter complex is part of the port-network connectivity (PNC) consisting of two TN574 DS1 Converter or two TN1654 DS1 Converter circuit packs connected by one to four DS1 facilities. It is used to extend the range of the 32-Mbps fiber links that connect PNs to the Center Stage Switch, allowing PNs to be located at remote sites. The DS1 converter complex can extend a fiber link between two EIs or between a PN EI and an SNI. Fiber links between two SNIs or between a PN and the Center Stage Switch (CSS) cannot be extended. Note: If SYNC, TDM-CLK, SNC-BD,SNI-BD, Fiber-Lk or DS1-FAC alarms are present then follow corresponding repair procedures first. If only DS1C-Bd alarm is present, follow below procedure. Procedure: A. If alarm is off board


display errors (to identify the cause of alarm ie either TDM-Clk/SYNC/SNC-BD or fiber-link or ds1 facility alarms)

If errors are associated with ds1-facility then follow step 3 & 9 but with customers permission

busyout & release ds1-facility (If errors associated to DS1 facility are present, then have a soft reset of the ds1-facility)

If errors are associated with Synchronisation or fiber-link then follow step 4 & 9 status synchronization (to check the synchronization status and fiber-link could be the source for

synchronization issue in an EPN)

list fiber-link (to identify corresponding fiber-link no and also the extreme end-points of that fiber-link.)

test fiber-link X (check for any test, if getting failed follow the corresponding repair procedure)

test board (check for any test, if getting failed)

If alarm do not clear off, then Inform Customer and with required permission go ahead with soft reset (ie busyout then reset and then releasing the board) followed by reseat of the board (ie removing the board from the slot and then re-inserting it) and then, if alarm still doesnt go off, replace the board.

Probable Cause: The alarm may get report maybe due to:

If an issue has been detected due to synchronization issue or mal-functioning of TDM-Clk or fiber-link for an EPN OR

Issue has been encountered with ds1-facility provided by DS1C board OR

Bad health of DS1C board OR

Configuration Issue for a new installation or due to some change activity at the customer site.

B. If alarm is on board


test board (check for any test, if failing)

display errors (to identify cause)


Probable Cause: The alarm may get report maybe due to:

Bad health of DS1C board OR Configuration Issue for a new installation or due to some change activity at the customer site. Alarm Description: Alarm->MaintObject_TDM-BUS_Location_PN X Description: Each Each port network has a pair of TDM buses, designated TDM bus A and TDM bus B, each with 256 time slots. This division allows for duplication of control channels and dedicated tone time slots. The first five time slots on each bus are reserved for the control channel, which is active on only one bus at a time in each port network. The next 17 time slots are reserved for system tones such as dial tone, busy tone and so on. As with the control channel, these time slots are active on only one bus, A or B, at a time. The rest of the time slots on each bus are for general system use such as carrying call-associated voice data. The 17 dedicated tone time slots that are inactive can also be used for call processing when every other available time slot is in use.When the system initializes, the control channel is on TDM bus A and the dedicated tones on TDM bus B in each port network. If a failure occurs on one of the two buses, the system will switch any control, tone and traffic channels to the other bus. Service will still be provided, though at a reduced capacity.. TDM-bus faults are usually caused by one of the following:

A defective circuit pack connected to the backplane

Bent pins on the backplane

Defective bus cables or terminators

Procedure:


status port-network X

test tdm port-network X (to check all the test are getting passed for the tdm bus in a port network)

disp errors (to identify the cause of the issue)

If alarm is active, then follow below procedure to isolate and detect TDM-Bus Fault. Also always inform Customer about the issue and plan of action stated below before proceeding further. Always have a SFM or TM involve in the execution: Step 1: Check for any active alarms for Tone-Clock /Detectors Board, Expansion Interface i.e. EI board alarms and Packet Interface i.e. IPSI board alarms or any other TN Circuit-Pack. Follow corresponding procedure to resolve the respective alarms and then check for TDM-Bus alarm, if it is cleared, close the case. Step 2: If no active alarm is present for any of the Tone board or EI or IPSI board or any other circuit-pack, then

a) If duplicated circuit-pack is present, then switch standby circuit-pack to active and check for the alarm. If alarm is resolved, remove the then standby circuit-pack and check for backplane pins. If

they are bent then switch-off the power of this Carrier and straighten or replace the pins and reinsert the circuit-pack and restore the power.

b) Try re-moving all the circuit-packs in the Port-Network one by one depending upon the criticality of the function of the circuit-pack. This means IPSI/EI board should be removed at the last and Tone-Clock board needs to be removed at last but one (This is because removing these circuit packs will result in disconnection of corresponding Port-Network).

c) When any of the circuit-packs is removed , determine whether the backplane pins in the slot appears to be bent. If yes, then switch-off the power to this Carrier and straighten or replace the pins and then re-insert the circuit-pack and restore the power. If backplane pins are not bent , then re-insert the circuit-pack.

d) If all the circuit-packs are checked as mentioned above, and alarm is still active then try replacing TDM cable assemblies and TDM Bus terminators and then if required replace carrier itself. Probable Cause: The alarm may get report due to:

When control of system tones is switched from one bus to other OR

Bad health of the Circuit-Pack providing Tone-Clock functions OR

Physiscal Connectivity issue ie TDN Cable assemblies or TDM bus terminators or backplane pins which connects to Circuit-Pack inside the slot.

Alarm Description: Alarm->MaintObject_POW-SUP_Location_X Description: This MO verifies physical presence of power supply and output voltage of each power supply in G650 is within tolerance Procedure:

almdisplay v / almdisplay res | more

test board (Check for any test , if being failed)

status environment (Check the environment of the cabinet)

test environment (Check whether all test are getting passed)

display errors (and select board) (check for errors to find the cause of alarm)

Inform Customer and with required permission go ahead with soft reset followed by reseat of the board and then, if required, replace it.

Probable Cause: Bad health of Power Supply board OR Power supply being delivered to the board. Alarm Description: Alarm->MaintObject_M/T-BD / MT-ANL/ M/T-DIG/ M/T-PKT Decription: The Maintenance/Test circuit pack (TN771D) supports packet-bus fault detection and bus reconfiguration for the port network where it is installed. The circuit pack also provides Analog Trunk testing and data loop-back testing of DCP Mode 2 endpoints and Digital (ISDN) Trunk facilities via the TDM bus. Port 1 of the Maintenance/Test board is the Analog Test port which provides the Analog Trunk testing function for Automatic Transmission Measurement System (ATMS). M/T-ANL maintenance ensures that the analog trunks testing function is operating correctly. Ports 2 and 3 are the digital ports which provide the digital (ISDN) trunk-testing functions. M/T-DIG maintenance ensures that the digital trunk testing function is operating correctly. Port 4 is the packet port provides the packet-bus maintenance functions: Packet-bus fault detection & Packet-bus re-configuration. Procedure:


test board (check for any test, if failing)

display errors

test board long


Probable Cause: Bad health of Maintenance port/ board.

Alarm Description:

Alarm->MaintObject_PS-RGEN_Location_X Alarm->MaintObject_RING-GEN_Location_X Understanding: The PS-RGEN maintenance object monitors the ringing voltage of each 655A power supply. The TN2312BP IPSI uses the ring detection circuit on the 655A to monitor ring voltage for the G650. Failure of the ring generator results in loss of ringing on analog phones. Ringing on digital and hybrid phones is not affected. Procedure:

almdisplay v / almdisplay res | more

test board (Check for any test , if being failed)

status environment (Check the environment of the cabinet). The results should appear as below with OK.

test environment (Check whether all test are getting passed)

display errors (and select board) (check for errors to find the cause of alarm)

Inform Customer and with required permission go ahead with soft reset followed by reseat of the

board and then, if required, replace it. Probable Cause: Bad health of Power Supply board OR Power supply being delivered to the board

Alarm Description:

Alarm->MaintObject_NR-CONN Understanding: The Network-Region Connect (NR-CONN) MO monitors VoIP connectivity between network

test between IP endpoints in separate network regions.& Minor alarm for multiple failures: Once a single failure is detected, Test #1417 is re-executed between different IP endpoints in the same pair of network regions. Procedure:


display failed-ip-network-region (check which ip-network-regions are alarmed). The below screenshot shows with no IP network regions are alarmed.

ping ip-address A board B (where A is ip-address of one NR and B is the ip-board of the other

NR and these boards needs to be from alarmed ip-network-region )

status ip-network-region X

test failed-ip-network-region X (to clear the alarms and or to check whether all test are being

passed for ip-network-region)

display errors (to identify the cause of the alarms)

display ip-network-map (to identify and confirm an entry, as required, against the failed ipnetwork- region. Because value may get modified here, after any change/update activity )

Probabale Cause: the alarm may get report due to:

network issue between two network region OR

configurational issue on ip-network-region map form

Survivable Processor Alarms Alarm Description: Alarm->MaintObject_LIC-ERR_Location__OnBoard_

Description: License is either missing or have gone corrupted or alarm is on one of the lsp/ess where it is controlling any of the Media-Gateway/Port-Network Procedure: If license is either missing or is corrupted


statuslicense v (check whether license is corrupted/missing/normal)

Download a license copy from https://rfa.avaya.com onto your laptop (for CM 5.2)

For CM version latest to CM 5.2, download the license from the PLDS.

stage it onto the server through sig Commandline on sig:

ssh init@:/var/home/ftp/pub

loadlicense (on shell prompt of server, You will have to wait for 30 minutes, till license comes to Normal status)

Installing Certificate through Citrx:

Open Drop Box option from Citrix Web page.

Now, open folder and then drag and drop the downloaded license file into this location

Now, open the Desktop folder and then drag and drop the License file to the Notes folder under FTP od-va.w.ag.60\TSH (this may be different for some) as shown below.

Now, login to the Web LM with the default user name and password (admin/admin01) or check with the customer if it is changed. And click on Install License option and then select Browse

Now, open select the License file which was earlier saved on the FTP server.

Now, click on Install. This will install the required License on the CM.

Now, click on Communication Manager option and see the new license details.

Note: If you receive any conflict error with the old existing Certificate, try uninstalling the old License from Uninstall License option and install a new one. Probable Cause: The alarm could have been reported maybe due to: After an update/upgrade either license file was not installed or corrupt license file was installed OR License file got corrupted due to bad health of the server Solution: lsp/ess has become active


list survivable-processors (check whether lsp/ess is registered with main server)

list media-gateways (to check if any Media-Gateway is not registered to main server and is registered to lsp.)

ping

traceroute

check with customer for further update on network issues/power outage, if any at the site Probable Cause: The alarm could have been reported due to:

Lan Issue /Power outage at the site OR

Main server is down and hence PNs and/or MGs got registered to the ESS or an LSP.

Alarm Description: Alarm->MaintObject_ESS_Location_CL 000_OnBoard_N Description: One or more IPSI is not pingable from the ESS server or ESS server is not able to detect the serial-number of ipsi Procedure:


pingall i (check whether all ipsis are pingable)

cd /var/log/ecs

grep -R sanity

serialnumber (check whether serialnumber of all ipsis are being detected by the server)

netstat -v |grep "5010" (check whether tcp link is established between server and ipsi)

ipsiversion a (check the firmware version of ipsi and its compatibility with the CM load on the server) Probable Cause: The common cause is ESS is not either to ping any of the ipsi or not able to detect serial number of an ipsi maybe because of

Lan issue OR

ipsi firmware mis-match with CM release OR

bad health of an ipsi Alarm Description: Alarm->MaintObject_ESS_EventID_X (where X=1,2,3or4) Description:

Procedure:


status ess port-networks (check whether any port-network being controlled by an ess)

cat /etc/ecs.conf , (get the ip-address of main server)

pingall i , (check whether all ipsis are pingable from main server)

traceroute ,(traceroute only those ipsis from main server which are not pingable)

get forced-takeover ipserver-interface port-network X (If IPSI is pingable from main server but being controlled by ESS, then with customers permission get force control of IPSI to main server)

cd /var/log/ecs

grep -R sanity (If IPSI is pingable and being controlled by Main Server, then check for sanity failures if any which could be the cause for the alarm)

If IPSI is not pingable, check with customer for network issue , if any. Note: For EventID_3/EventID_4, either the IPSI controlling EI link of EPN is registered to ESS server or there could be some issue with fiber link. Incase, no issues has been found with ipsi then check for the fiber link issues and continue the below steps. But below steps are only to be followed for EventID_3or4.

cat /var/log/messages |more (to check for any fiber-link issues trace)

list fiber-link (get the details of fiber-links)

test fiber-link X (check for any test, if getting failed)

status sys-link (check which IPSI is controlling the alarmed EPN)

list ipserver-interface (check for any errors on ipsi ie CPEG and ip-address of ipsi. )

display errors

restartcause

Probable Cause: Lan issue OR Bad health of main server OR Issue with physical Connections of fiberlink (only for Event-Id 3&4)

Alarm Description: Alarm->MaintObject_ESS_EventID_5 Alarm->MaintObject_ESS_EventID_6 Description: Enterprise Survivable Server cluster not registered ie ESS is not registered to the main server (EventID_5) or it is registered back to main server (EventID_6) Procedure:


status ess cluster

list survivable-processor

cd /var/log

grep -R register messages (to check which clustered is/was not registered)

ping (ping LSP from main server)

traceroute (if LSP is not pingable then traceroute to check from which hop ping is being failed)

Inform Customer to check the network issue, if any at the site

Restartcause

Probable Cause: The alarm could have been reported because of:

Lan Issue / Power outage at site OR

ESS server is down due to its bad health.

Adjuncts Associated Alarms Alarm->MaintObject_PRI-CDR Description: The CDR feature records detailed call information about every incoming and outgoing call on specified trunk groups and sends this information to a CDR output device. The two physical links can be administered for connecting external CDR output devices to the system. They are identified as the primary CDR (PRI-CDR) link and the secondary CDR (SEC-CDR) link. Procedure:


status cdr-link

display ip-services (to find the node-name used for CDR and associated Clan board)

display node-names ip (to get the ip-address of CDR and

Clan board)

test board (check for any test failing and/or any active alarm for the clan

board. If yes, proceed further with investigation on the Clan board, as discussed in the respective section )

ping board (Check any issues with the lan connectivity)

display errors (to identify the cause) Probable Cause: The alarm may get report due to:

Network Issue between CDR and the Clan board, it connected to OR

Bad health of Clan / CDR OR

Scheduled Activity at Customer site Alarm Description: Alarm->MaintObject_ASAI-PT/BD_Location_X Description: ASAI-PT corresponds to the fault detection of a port which is connected to one of the adjunct which is not of Avaya make Procedure:


display port (check the CTI-Link number)

test board X (X is the board location)

test cti-link Y (Y is cti-link number)

display errors (to identify the cause)

busyout/release cti-link Y/board X (need to reset the link and/or the board to which that Adjunct is connected with. Do inform the customer about the same)

If alarm is still active, need to check with customer for the functioning of the Adjunct.

Probable Cause: The alarm may get reported due to:

Malfunctioning of Adjunct OR

Bad health of Clan board/MapD board to which that Adjunct is Connected to OR

Network Issue /Physical connectivity issue of the link

aos alarms guidev2

Documents