Alarm and Event Audit and Correlation
Root cause analysis (RCA) is key to incident management. RCA can be accelerated by following a steady alarm and event analysis practice. The practice can be developed through following basic exercise:
- Have a vigorous exercise of understanding the alarm and alarm groupings for all the network elements. As a result a key alarm summary list should be produced for all the products.
- A periodic collection of alarm history should be introduced then monitored actively through a Dashboard.
- The Dashboard should be used for history trending of alarms by market and functional groups. As a example two functional groups of SIP and PSTN related alarms should have their own visual mentoring groups.
- In a VOIP network, the VOICE switch (or corresponding EMS), sees all the protocol and most NE reported anomalies, It can be a good starting point for creating an alarm summary report.
The figure 6-5 shows a summary table of alarms for a particular customer site. The table has been generated off the alarm history report for CISCO BTS 10200. The table is periodically generated for many customer sites. It allows for identifying the hot spots and infrequent anomalies.
The figure shows the general group buckets and the detailed breakdown of the alarm frequency within the groups.
The most prevalent alarm buckets surface up. The periodic report can be easily used to first identify the problem, then the results be fed into the systemic RCA. In the case of Table 6-8 we clearly see the signaling alarms at the top, after analyzing the report, the alarms like “Trunk remotely blocked”, were ignored in this case, due to large number trunks were being turned-up and were down. But other alarms within the same category and other categories also need to be looked at for rooting out system issues.
Summary Alarm Table from CISCO Softswitch
|
Alarm Count |
Alarm Group Type |
Alarm Explanation |
|---|---|---|
|
1 |
CALLP |
Country Code Dialing Plan Error |
|
1 |
DATABASE |
EMS database alert.log alerts. |
|
1 |
OSS |
SNMP Authentication error |
|
1 |
SIGNALING |
SS7 Message Decoding Failure |
|
1 |
SIGNALING |
Unanswered REL |
|
2 |
BILLING |
FTP/SFTP transfer failed |
|
2 |
DATABASE |
Daily database backup completed successfully |
|
2 |
SIGNALING |
AGGR Connection Down |
|
2 |
SIGNALING |
Feature Server is not up or is not responding to Call Agent |
|
2 |
SIGNALING |
Continuity Recheck Successful |
|
3 |
SIGNALING |
AGGR Gate Set Failed |
|
3 |
SIGNALING |
Continuity Recheck is performed on specified CIC |
|
6 |
SIGNALING |
RLC received in response to RSC message on the specified CIC |
|
10 |
AUDIT |
Start or Stop of SS7-CIC audit |
|
11 |
CALLP |
No Route Available for Carrier Dialed |
|
14 |
DATABASE |
There are errors in EMS database DefError queue |
|
27 |
MAINTENANCE |
Admin State Change Failure |
|
56 |
MAINTENANCE |
Admin State Change Successful with Warning |
|
76 |
AUDIT |
Call exceeds a long-duration threshold |
|
112 |
SIGNALING |
Continuity Recheck Failed |
|
215 |
SIGNALING |
Media gateway/termination down |
|
379 |
MAINTENANCE |
Admin State Change |
|
469 |
SIGNALING |
Timeout on Remote Instance |
|
803 |
CALLP |
Invalid Call |
|
981 |
SIGNALING |
Unexpected Message for the Call State is received : Clear Ca |
|
1166 |
CALLP |
Call Failure |
|
1194 |
SIGNALING |
COT message received on the specified CIC |
|
1267 |
BILLING |
Message content error |
|
1391 |
SIGNALING |
General MGCP Signaling Error between MGW and CA. |
|
5257 |
SIGNALING |
Trunk locally blocked |
|
5495 |
SIGNALING |
Trunk remotely blocked |
|
1391 |
SIGNALING |
General MGCP Signaling Error between MGW and CA. |
|
5257 |
SIGNALING |
Trunk locally blocked |
Thus the results are obvious of performing these audits on a periodic basis and analyzing them at the same time through a dashboard. The trouble segments surface up, and thus can be prioritized for resolution.


