VoIP Performance Management and Optimization: Managing VoIP Networks

Chapter Description

To ensure voice quality and to optimize media delivery over the IP, it is crucial to properly plan, design, implement, and manage the underlying network. This chapter discusses what are the best practices for planning media deployment over IP networks.

How to effectively poll the Network

I would like to re-state that VOIP network is a large and complicated solution which encompasses many integrated technologies. This presents a problem for VOIP infrastructure managers since each technology brings its own network management challenges. A device polling strategy needs to be implemented for the network, such that all VOIP functional segments have coverage.

Polling involves tapping on the existing device mechanism like the CMS to perform device audits for VOIP segments. In reality, custom probes need to be implemented through scripts for achieving a full polling coverage. A best practice would be to create or have a dedicated in house Linux or UNIX based system to develop these probes. Open source tools like Nagios accommodate these types of probes very easily.

The following sections describe processes that should be in place to allow a VOIP service provider to effectively manage their network. The key fact is that the development and deployment of specialized tracking systems require upfront cost and man hours. This can be implemented through the assistance of vendor services groups. The impacts are immediate with huge return on investments (ROI).

Polling Strategy

VOIP segments are driven by certain protocols, and all these protocols ride over IP. Thus in essence an organized and a layered approach needs to be developed to effectively poll the VOIP network for key information. A layered approach implies that the polling be done in a way such that all protocols riding IP are covered. The following Figure 6-1 reflects this concept. All the VOIP related segments need to be polled. Figure 1 reflects that the base connectivity to the network components is tested at the IP layer, and then the next set of layers are depicted for the various segments and functional components.

The base connectivity is achieved through an IP ping, the entire network should be mapped for this basic connectivity, this would allow for creating a knowledge base of the IP layer. If ping is disabled then other custom TCP probes should be developed to create this map. There are open source alternatives available to the classical ping based probes like echoping which do not use the ICMP_ECHO_REQUEST or ECHO_REPLY packets but can communicate using other protocols like http.

The SNMP connectivity needs to be verified. At a minimum the trap functionality needs be verified periodically. The Traps are critical as they map to alarm and events being generated by the NEs. The NE should be updated, to trigger an informational trap to the SNMP Manager on a periodic basis. The SNMP connectivity validation will ensure the alarm and key event stream is flowing.

MGCP polling needs to be in place, it depends on the VOIP deployment architecture, for cable environment it covers several segments (PSTN, CPE/MTA, Announcements), a periodic audit end-point can be performed to validate the MGCP based device connectivity.

SIP polling needs to be in place to make sure SIP supporting devices are functional. The devices could be SIP based Voice Mail servers or Session Border Controllers (SBC) that terminate SIP trunks. Custom probes can be used to periodically test the SIP functionality to these devices. A custom SIP probe example is included in Appendix A.

A periodic test of the SS7 link needs to be performed along with verifying that the ISUP functionality is up. ISUP is used to backhaul the SS7 information to the CMS. We will cover the ISUP KPI tracking in the upcoming chapter 7. The reported alarms can be also used to track SS7 and ISUP functionality.

Management connectivity needs to be verified, making sure the devices can always be reached through Telnet or SSH. In some cases out of band connectivity through modem dial-ups need to be verified as well.

A periodic polling of DNS and DHCP functionality needs to be in place. All the DNS servers, primary and secondary need to be polled for DNS queries. Similarly the DHCP functionality needs to be verified on a periodic basis, this can be done by tracking the DHCP statistics (Discovers sent, leases granted) and creating a visual dashboard for the stats.

So in summary, the idea behind this approach is two folds, one is to make sure that the base or core IP connectivity is up then connectivity to each of the segments is tracked in some fashion, and this is done by a polling mechanism at the protocol layer. This allows for easily isolating the down segment in case of issues.

Key Alarms and Event Monitoring

An exercise of identifying the key alarms and events should be done across the VOIP network. It is imperative that this exercise be done on a periodic basis especially when major software upgrades are performed on the network. The new software would deprecate some old key alarms and introduce potentially new critical and more verbose alarms to better track the network. This information is available in the release notes of the new software.

The following table in Figure 6-2 depicts the general buckets of alarms as collected from the CISCO BTS 10200 product. This can be used as a high level guideline for making sure that the Operation centers are including such alarm categories in their Network monitoring tools.

Alarm/Event Groups

Alarm/Event Type

OSS

DATABASE

AUDIT

MAINTENANCE

SYSTEM

BILLING

CALLP

SIGNALING

The general buckets do have alarms that get reported on a frequent basis and can cause the Network Operation Centers (NOC) to drop them in an ignore bucket. I have come across situations where a major VOIP provider was ignoring even critical alarms due this behavior. Typically end devices going in and out of service trigger these audit alarms. A scenario can be of DNS functionality failing for particular market thus causing a flood of audit end-point failure notifications. This should be tracked immediately and cannot be ignored.

A best practice would be to go through the Vendor Services groups and classify key and chatty alarms on new major software releases. At the same time understand the behavior of the chatty alarms, so the critical ones do not get ignored.

SNMP Configuration and Setting

SNMP configurations and connectivity is at the core of network operations. It is among the first set of configurations being pushed on the NEs. We want to give an overview of some basic configurations then describe key SNMP trap related configuration settings.

Basic configuration

The basic SNMP configurations involve the setting up of:

  1. SNMP community string (read and write) with passwords. Typically the default is set to “public” and should be overwritten.
  2. SNMP Trap destination configurations. This basically tells the SNMP agent running on NE the location (IP address) of the NMS to send the traps to. Most of the time the default port also needs to be overridden for security reasons. There can be multiple NMS devices listening for the traps, so all of these devices need to be accounted for.

SNMP Trap settings

It is very important to configure the SNMP trap settings correctly. The monitoring of the VOIP network will be effected, if the key traps are not generated.

The following categories of traps need to be generated at a minimum:

  • CRITICAL, MAJOR, MINOR and WARNING. These categories constitute the alarms. The next category is of events which includes INFO and DEBUG types. Lot of time the events are also critical as could include audit information.

The NMS needs to be optimized to handle all alarms and effectively such that the chatty alarms and even events go in a other buckets but can still be tracked. The chatty alarms can have custom threshold trigger alarm mechanism on the NMS, thus catching a systematic issue like the DNS failure example mentioned earlier.

Traps – use case BTS 10200 CISCO soft switch

Soft switches allow for even greater flexibility of tracking the type of traps. In a centralized model where they front all VOIP segments, it becomes critical to generate and subscribe to relevant types of traps. It may very well be that all types need to be included. The BTS 10200 allows the flexibility of the following detail types:

  • BILLING, CALLPROCESSING, CONFIGURATION, DATABASE, MAINTENANCE, OSS, SECURITY, SIGNALING, STATISTICS, SYSTEM and AUDIT

Thus subscribing to the traps related to all these verbose types, and then in turn effectively monitoring them would be key to VOIP network management success. We will expand on the usefulness of this extensive subscription in the upcoming section Alarm and Event Correlation.

Standard Polling Intervals and Traps

The minimum-polling interval depends on the type of the SNMP object(s) being polled, the number of devices you are polling and how much network bandwidth you want to devote to network management. Most critical SNMP objects (e.g., ifOperStatus, ifInErrors, etc.) should be polled every 5 minutes. Other SNMP objects may require more frequent polling (e.g., nl-ping-response).

Most performance SNMP objects should be polled at 30 minute intervals. This is a fairly conservative polling interval, providing 48 data points per 24 hour reporting period. 48 data points provides enough granularity to establish general performance baselines.

Traps from the managed devices will be sent to Network Management/Monitoring Systems (NMS) unsolicited on a reactive basis as the problem occurs. Traps notify the problems such as link down when there is outage. But there is no trap to identify link congestion. For that we have to rely on polling using the SNMP object identifier (OID). There is a specific set of OIDs or MIB for network statistics gathers like QoS Traffic shaping packet discards, FECNs or BECNs etc. This poses a challenge because excessive polling to increase the time resolution of problem notification via polling may increase management traffic on production network.

Challenge

Since the polling responses arrive at a certain time delay, the problems encountered for short duration during the polling intervals may go unnoticed. Moreover, the traps will need to be correlated to the polled results or the other traps that may be related. Also, some tools and certain SNMP OID tables calculate and store values that are representative of average number rather than instantaneous values. This may not give an accurate idea about the severity of the problem as the counters show an average value over a spread of time period.

This challenge is illustrated by following two scenarios:

Scenario 1: Phones Un-registering, from Unified CM and re-registering to SRST Router Due to WAN Link Outage

In this scenario, as shown in figure 4, the NMS is polling the WAN for Frame Relay congestion (monitoring for FECN or BECN or QoS TS discards or packet drops) with the interval of 30 minutes. Assume that the polling intervals are 30 second apart, and a poll occurs at 9:00AM, and the successive polls reoccur at 9:30AM, 10:AM and 10:30AM. Around 9:35AM, the Frame Relay network close to the Aggregation site encounters an outage which cause the Frame Relay link connecting the branch to go down. This will cause the IP phones to un-register from the CallManager cluster and register with SRST gateway local to the branch. This will be notified by a trap from the CallManager originally hosting those IP Phones. At the same time the aggregation router will send a ifDown (link down) trap to the NMS. Because of the close time proximity of these traps, a NOC staffer will be able to correlate the IP Phone registration with SRST gateway to the Frame Relay link outage.

Scenario 2: Phones Un-registering, from Unified CM and re-registering to SRST Router Due to WAN Congestion

In this scenario, the NMS is polling the WAN for Frame Relay congestion (monitoring for FECN or BECN or QoS TS discards or packet drops) with the interval of 30 minutes. Assume that the polling intervals are 30 second apart, and a poll occurs at 9:00AM, and the successive polls reoccur at 9:30AM, 10:AM and 10:30AM. Around 9:35AM, the Frame Relay network close to the Branch starts to experience network congestion and at about 9:40AM the IP phones cannot get the keepalives serviced by the CallManager cluster. By 9:42AM, they will un-register themselves from the CallManager cluster and register with SRST gateway local to the branch. This will be notified by a trap from the CallManager originally hosting those IP Phones. But the underlying problem in the Frame Relay network will be reported by the next polling cycle occurring at 10:00AM. There is a possibility that the Frame Relay network congestion is relived right around 9:45AM. In that case the Network Administrator will not be able to correlate the phone unregistering and registering problem with the actual cause. The troubleshooting efforts may be directed towards the source of the trap, which is, the CallManager.

These challenges can be addressed by adopting a layered approach as discussed earlier. Chapter 7 will further elaborate using statistical data collection is instantaneous rather than average to get a more accurate profile of the network as well as Syslog analysis to bridge gaps in polled data.

Using Extensible Markup Language (XML) for Polling and extraction of key information

Extended markup language (XML) is being used extensively in the industry these days to address a variety of needs. Some of the key ones that are in scope of our discussion are: XML Simplifies Data Sharing, XML Simplifies Data Transport and XML is Used to Create New Internet Languages. We will cover these aspects briefly then describe how they are applied in the VOIP network management and polling.

XML overview

NEs, soft switches and EMSs are providing communication and reporting interfaces through XML. Figure 6-3 depicts most of these interfaces. This XML capability thus allows for generic integration to third party reporting provisioning and monitoring systems.

The third–party flow through provisioning systems can be integrated to specific Vendor EMS via the generic XML interface. There are many reporting engines which take in XML data reports and transform them to manageable information. These reporting engines facilitate Trouble Ticket tracking, managing Billing records and data mining information for capacity planning through performance measurement reports. In some cases there are explicit XML agents present on the NEs and EMS that listening for XML based queries over a TCP socket. The XML agents would handle the request from the client, typically these requests that would be otherwise made through CLI. The returned result is a well formed XML report.

In short the XML usage in VOIP network operations is very crucial. It allows the VOIP provider to scale and integrate with third-party vendors.

XML APIs available

Generally the XML APIs can be broken down in categories of exporting data, communication payload and as Agent interface. These aspects allow for reducing the interoperability challenges and facilitate adaptability to a changing industry.

Data exported as XML reports

VOIP network EMS, in particular BTS 10200 provides interfaces to periodically generate Call Detail Record reports, Billing Record reports and Performance measurement reports in XML format. Each of these reports can be imported into a third party system.

SOAP/CORBA Communication utilizing XML

In VOIP networks, Provisioning Applications are used for flow through configuration. These systems communicate through the NMS and EMS and utilized XML. The well formedness aspect of XML allows for strict syntax checking. The XML capability of incorporating new tags facilitates third party Provisioning systems to adapt to changes with VOIP vendor’s provisioning API interfaces. This makes the interoperability within a slew of VOIP related applications manageable.

Thus XML based communication (which basically implies that the communication protocol is utilizing XML to format its payload) allows for VOIP service provider to better manage their flow through provisioning systems.

XML queries to XML Agent for retrieving information

The Vendor Industry is heading towards providing a XML based Agent interface on their products. The companies utilizing web services extensively thus easily integrate their strategic applications with those of their partners, both internally and over the Internet.

The XML Agent is introduced in the NE or an EMS. This Agent would allow the following sample feature set:

  1. Provide a mechanism using to transfer, configure, and monitor objects in the
  2. This XML capability allows you to easily shape or extend the CLI
  3. Query and reply data in XML format to meet different specific business needs.
  4. Transfer show command output from the CLI interface in XML format for statistics and status monitoring. This show command output transfer capability allows you to query and extract data from the NE/EMS.
  5. Utilize the NE/EMS XML Document Type Definition (DTD) schema for formatting CLI queries or parsing the XML results from the NE/EMS to enable third-party software development through XML communications.
  6. Provide remote user authentication through AAA.
  7. Allow for communication to happen through HTTP or HTTPS
  8. Provide a set of return error codes to easily diagnose the issue, which can be as simple as wrong XML form or syntax issue with the XML NE/EMS query.
  9. And other NE/EMS specific feature support through the DTD.

To summarize a VOIP service provider needs to understand and implement the need for XML based applications and devices that support XML interfaces. This would allow flexibility to adapt to change and accommodate growth easily.

4. Using the Trace/Syslogs Logs for Deep Analysis | Next SectionPrevious Section