To ensure voice quality and to optimize media delivery over the IP, it is crucial to properly plan, design, implement, and manage the underlying network. This chapter discusses what are the best practices for planning media deployment over IP networks starting from how to assess the readiness of the network, traffic engineering, high availability and managing the IP network and of its integrated components that process voice and other media transmissions. Managing VoIP network starts from a discovery process that maps out the entire network by identifying all the network elements and their roles, key performance indicators (KPI’s) to track, establishing baseline, and effectively monitoring it on regular basis. This chapter will cover all the monitoring mechanism available to network administrators and their scope and effectiveness in managing VoIP. All networks are bound to experience outages. Managing outages and correlating network problems with end user problems is critical. This chapter will cover how incident management systems including trouble ticketing systems can be integrated with network management. In essence this chapter establishes the baseline for managing the VoIP networks.
Requirements for Enabling Voice in IP Networks
The task of managing the VoIP networks begins even before implementation phase when the decision is made to deploy Voice over an IP network. Since traditionally voice communications have been regarded as mission critical service supporting public safety access points for emergency services and critical business support over TDM networks, the same level of service is expected out of VoIP networks as they begin to provide the primary method of voice communications.
Network Readiness Assessment
In order to ensure same level of service, the underlying IP network must:
- Have bandwidth and performance to handle converged services.
- Meet the demand of high availability voice services by providing resiliency to mitigate the effect of network outages.
- Should be modular, hierarchical and consistent to promote consistency and manageability.
This pre-deployment infrastructure assessment should address each of these areas. The infrastructure assessment is accomplished by gathering the needed information from network engineering staff to evaluate the current or planned network implementation including hardware, software, network design, security baseline, network links, and power/environment.
The following areas of the infrastructure assessment must be evaluated:
Network Design
-
Hierarchy & modularity- Network hierarchy is perhaps the single most important aspect of network design resiliency. A hierarchical network is easier to understand and easier to support because of consistent expected data flows for all applications occur on the network over similar access, distribution and backbone layers. This reduces overall management requirements of the network, increases understanding and supportability of the network and often results in decreased traffic flow problems, troubleshooting requirements, and improved IP routing convergence times. Network Hierarchy also improves scalability of the network by allowing it to grow without major network changes. Hierarchy also promotes address summarization, important in larger IP routing environments.
Network modularity can be defined as a consistent building block for each hierarchical layer of the network and should include like devices, like configurations and identical software versions. By using a consistent “model” for each layer of the network, supportability can be improved as it becomes much easier to properly test modules, create troubleshooting procedures, document network components, train support staff and quickly replace broken components. Each defined hierarchical layer should have a basic solution that will be used repeatedly throughout the network. If enhancements or special requirements are added to specific modules then special attention should be paid to testing, documentation and supportability. -
IP Routing- IP routing is a design issue for all larger network environments. The primary issues are that the routing protocol will converge quickly following various failure scenarios. In addition, the routing scales differently in the particular network environment. The readiness assessment process should include IP routing protocol selection, configuration, IP summarization and IP routing protocol safeguards to prevent IP overhead and undesirable routing loops. In general, Cisco recommends OSPF or EIGRP for improved convergence with added VLSM (variable length subnet mask) support. In environments that anticipate an excess of 100 routes, IP summarization is recommended into the core, generally configured at the distribution layer. For WAN environments additional summarization may be needed at the edge to reduce overhead on slower WAN links. Routing protocol safeguards are also recommended to reduce overhead and prevent unexpected routing behavior, such as routing across user or server VLANs. In larger WAN environments, it is also recommended that only routes from a particular area be routed into the core and that a particular site should not be a potential reroute point for core or other major traffic.
-
IP Addressing- IP addressing scheme’s evaluation should investigate specifically how the allocation of current IP address space will effect the allocation of IP addressing for IP phones and other communications devices. In most cases, an organization will not have available with in its current allocation IP address space within the existing user VLANs or subnets for an IP phone rollout. Several strategies exist for the allocation of space including;
- Increasing subnet size.
- Providing additional VLANS and subnets for phones using either additional access ports or 802.1Q trunking on user ports.
- Use secondary interfaces to allocate additional address space where 802.1q trunking is not supported.
- Use of Network Address Translation (NAT) should be used with caution. The VoIP network architect should explore the impact on voice signalling protocols when packets from end points with NAT’ed IP address traverse firewalls and proxy servers.
-
HSRP- HSRP, hot standby router protocol, is a software feature that permits redundant IP default gateways on server and client subnets. HSRP is configurable for router prioritization, preempt capability to return gateway back to higher priority router and “backbone track” support to track the availability of a backbone or WAN interface on the router. On user or server subnets that require default gateway support, HSRP provides increased resiliency by providing a redundant level III IP default gateway.
Other redundancy schemes for applications’ servers may include DNS or server load balancing.
-
Quality of Service- Voice quality on a network is ensured by the use of Quality of Service (QoS) features in the network. These features must be enabled and available end-to-end in a network to provide high quality voice services on the converged network.
Voice quality is affected by two major factors: lost packets and delayed packets. Packet loss causes voice clipping and skips. The industry standard codec algorithms used in Cisco Digital Signal Processors (DSP) can correct for up to 30 ms of lost voice. Cisco VoIP technology uses 20 ms samples of voice payload per VoIP packet. Therefore, for the codec correction algorithms to be effective, only a single packet can be lost during any given time. Packet delay can cause either voice quality degradation due to the end-to-end voice latency or packet loss if the delay is variable. If the delay is variable, such as queue delay in bursty data environments, there is a risk of jitter buffer overruns at the receiving end. To provide consistent voice latency and minimal packet loss, QoS is normally needed (except in cases where bandwidth is always available). The following major rules apply to QoS in campus LAN & WAN environments:- Use 802.1Q/p connections for the IP phones and use the auxiliary VLAN for voice
- Classify voice RTP streams as EF or IP precedence 5 and place them into a second queue (preferably a priority queue) on all network elements
- Classify voice control traffic as AF31 or IP precedence 3 and place it into a second queue on all network elements.
- Enable QoS within the campus if LAN buffers are reaching 100% utilization
- Always provision the WAN properly allowing 25% of the bandwidth for overhead including routing protocols, network management and Layer 2 link information.
- Use Low Latency Queuing (LLQ) on all WAN interfaces
- Use Link Fragmentation & Interleaving (LFI) techniques for all link speeds below 768 kbps
Service Providers may employ advanced QoS features such as DQoS or RSVP to manage dynamic nature of voice enabled end points.
Network Infrastructure Services
-
DNS- DNS or Domain Name Services is a critical network naming service within almost all IP networks. Network clients and servers both request connections to other devices by specifying a name. To resolve the name to IP address a request is sent to a configured DNS server and from this point on the client can use the returned IP address. DNS servers should be located centrally within core areas of the network with backup servers available. DNS should also be set up as a hierarchy with a master and secondary so the secondary servers are updated in an appropriate time frame. Devices should also be named and routers should have DNS entries for all ports to avoid IP address conflicts. Network devices should also be set up with backup DNS servers in case the 1st chosen server is down. DNS servers should also be considered critical services within the organization requiring the highest level of security, power backup and potential redundancy. DNS may not be required in Enterprise IP Communications environments as IP addresses are configured in Call Manager and IP phones for access but is useful for managing the overall network environment. In SP networks, DNS functionality is a must as it plays a crucial role in provisioning of voice endpoints such as MTAs. It is also required for establishing voice calls since the voice endpoints rely on DNS to resolve the Call Agent’s FQDN to an IP address.
-
DHCP- DHCP is typically used for client IP addressing. This allows mobility and improves IP address manageability. The only downside is that critical IP Allocation State for many nodes is kept within one file or database on the DHCP server. An organization should have adequate support for DHCP services and treat the DHCP data as highly critical data. Generally DHCP databases should be mirrored or backed up on a continual basis. In addition, an organization should have a plan in case DHCP services fail. This can be mitigated by implementing redundant DHCP servers and distributing the load among different servers, or by configuring backup DHCP servers in case the primary server fails. DHCP servers should also support option 150 for IP phone provisioning. This permits the DHCP server to pass control to the Call Manager to download phone configurations following IP address allocation. In addition, the DHCP server should also support option 43 (vendor-specific information, option 60 (vendor class identifier), option 122 (CableLabs Client Configuration) etc. for provisioning voice endpoints in SP environments.
Network Links
-
WAN Link Redundancy/Diversity- WAN link redundancy and diversity may be a consideration for distributed IP Telephony deployments where WAN links are required for call setup and RTP voice traffic. The organization should determine the backup strategy when the primary WAN link is down. Distributed call processing and gateways or WAN redundancy may accomplish this. WAN redundancy and diversity may include local loop providers as well as long distance providers.
-
Twisted-pair installation- Copper installation standards & testing helps to ensure that the intra-building copper plant meets expected quality and performance expectations. The installation should follow standards and quality guidelines for signal attenuation, near-end cross-talk, bend radius, cable routing, distance, termination standards & components, labeling standards, patch cord routing and building conduit requirements. The current documented standard for category 5 testing requirements is TIA/EIA TSB-67 standards. All verification and testing should be done following this guideline which specifies required values for attenuation and NEXT (near-end cross-talk).
-
Fiber cabling installation- Campus/MAN fiber cabling installation standards and testing, helps to ensure that the inter-building fiber plant meets expected quality and performance expectations. The installation should follow standards and quality guidelines, which include parameters such as dB loss per connection, bend radius, cable routing, termination components or trays, labeling standards, patch cord routing and organization and building conduit requirements. All fibers should be tested following termination to ensure high-quality and minimal signal loss. Campus cabling should generally also offer diversity to prevent disasters caused by cable cuts.
Hardware and Software Considerations
-
Device Selection- Network Infrastructure devices identified for an IP telephony infrastructure should have recommended features for IP telephony and improved network convergence for High Availability. IP telephony features include Inline Power, 802.1Q/p support, hardware priority queuing and QoS. Availability feature include spanning tree convergence features like uplink fast & backbone fast. Devices should also generally have improved backplane capacity, latency and increased bandwidth. This section also looks at the Mean Time Between Failure (MTBF) of the chosen devices to determine theoretical availability. If the number of total devices is known, Cisco can also provide an expected annual failure rate for the devices.
-
Hardware Resiliency- Redundant modules and chassis are a major contributor to network resiliency and allow normal or frequent maintenance on network equipment without service affecting outages in addition to minimizing power, hardware or software failure impact. Redundant chassis can also provide load-sharing capabilities used in conjunction with routing protocols. Default gateway chassis can be redundant when used in conjunction with the HSRP protocol. Many organizations have redundant backbone chassis and redundant distribution models. Redundant modules include power modules, Supervisor modules and interface modules. Redundant modules help insure that individual module failure does not affect network availability. It is recognized that in many cases redundant chassis are cost-prohibitive at the access layer they are often a necessity due to port density with the introduction of IP telephony
-
Hardware redundancy- Redundant modules and chassis are a major contributor to network resiliency and allow normal or frequent maintenance on network equipment without service affecting outages in addition to minimizing power, hardware or software failure impact. Redundant chassis can also provide load-sharing capabilities used in conjunction with routing protocols. Default gateway chassis can be redundant when used in conjunction with the HSRP protocol. Many organizations have redundant backbone chassis and redundant distribution models. Redundant modules include power modules, Supervisor modules and interface modules. Redundant modules help insure that individual module failure does not affect network availability. It is recognized that in many cases redundant chassis are cost-prohibitive at the access layer they are often a necessity due to port density with the introduction of IP telephony
-
Software Resiliency- The software chosen must support the required features for IP telephony and provide the highest software reliability. Software reliability is a factor of software configuration and software version control. For the most part, software reliability is the responsibility of software development and testing groups within Cisco. However, the organization must still validate whether the software is appropriate for their environment by testing or piloting the intended versions. Where possible, Cisco will recommend general deployment versions. The network architect must ensure that these target versions of code have been widely deployed in many customer environments and it is believed that critical and major bugs have been resolved. Software version control is another factor of software reliability. Organizations running a limited number of software versions that have been tested or piloted generally experience higher availability due to lower complexity and reduced bug identification.
-
Software version control- Software version control is the process of testing, validating and maintaining authorized software releases within the network. Most organizations will require a handful of versions due to different platform and feature requirements. A process should be in place to choose release candidates, review potential impacting bugs, test or pilot release candidate software, deploy authorized software and review version accounting information to ensure software version control is being maintained as expected. Large organizations without software version control processes generally have well over 70 software versions within the network, resulting in a higher number of software bugs, unexpected behaviors and hardware/software incompatibility problems. Organizations requiring high availability should also weigh feature requirements with known software stability in general deployment software. Another issue is software age. Older general deployment software is considered more reliable than recently released newer versions with untested production history.
Power and Environment
-
Power protection- Power protection is often a concern in IP telephony environments and may be needed to provide parity to legacy telephone systems. Power protection for IP telephony includes the use of Inline Power to provide backup power to phones for UPS protected LAN switching gear and power protection of all critical networking components. In addition, key networking equipment should have redundant power supplies with connectivity to separate power distribution units to prevent power loss due to a tripped circuit or PDU failure. This may range from 10 minute UPS to prevent failure due to more common short-term power outages to UPS arrays with backup generators to prevent failure due to even long term power outages. The following information from UPC provides availability estimates power with various power protection strategies. In addition, the organization should consider monitoring and servicing of UPS equipment.
-
Environmental conditioning- Environment is a major factor of hardware resiliency as location and temperature of network equipment affects the MTBF, (mean time between failure), of network equipment. Standard operating surface temperatures of most Cisco equipment is approximately 40 degrees centigrade (104 Fahrenheit). When equipment fluctuates more than 10 degree centigrade the MTBF or hardware reliability of the component can be reduced significantly. To maintain documented MTBF estimates, the organization should ensure that proper HVAC (heating, ventilation & air-conditioning) is maintained for critical equipment.
-
Physical security- Physical device security should be done to ensure that unauthorized personnel do not have access to network equipment. Access to equipment allows unauthorized personnel to make unauthorized changes, obtain passwords on equipment and even perform malicious activities. Equipment should be kept in locked rooms, preferably with card access so entrance is logged.
An Audit approach for VoIP Network Readiness
A comprehensive audit of IP network is always recommended pre-deployment and even post-deployment for network optimization. Auditing methodology consists of analyzing network standards for resiliency, modularity, QoS, High Availability, IP Addressing, Security, Software version control as described earlier. The auditor should also analyze operational practices for FCAPS (Fault, Configuration, Accounting, Performance, and Security) as adopted by the organization maintaining the VoIP network.
The preliminary step of this analysis includes interviewing all the stake holders such as network administrators, architects, and network operations center (NOC)staff. The interview process will typically reveal high level information including network topology and ITIL or ISO or their own corporate standards for FCAPS in case of enterprise or commercial deployments. Service Providers may have additional standards as mandated by regulatory bodies. However, this may not present accurate and complete state of the network since compliance to standards is an ongoing process. Large networks constantly go under changes. For example, a corporate or service provider may have the most comprehensive strategy for QoS and security. But they may be in initial phases of implementation at the time of auditing. Appendix C contains the questionnaire that can be used as a starting point to gather information regarding VoIP Network Readiness.
Therefore, a subsequent step should include verification of the actual status of the network as it is deployed. Network Auditing tools are leveraged for this step. These tools are available from Cisco Advanced Services such as Unified Communication Audit Tool (UCAT), Cisco Network Collector – NetAudit, or Cisco Unified Readiness Assessment Manager (CURAM). Similar tools are also available from other vendors, for example, Vivinet Assessor and AppManager from NetIQ.
For large scale networks it becomes necessary to use a controlled sample for expediency. It is recommended to distribute the network into logical models during the interview in the first step. For example, a large international bank may have various lines of businesses, thousands of branch locations, tens of campus locations, and several international subsidiaries. The auditor should make an attempt to categories them logically such as:
- Campus (with MAN, OC3 uplinks, 3tiered network, 6000+ employees, contact center presence, special considerations for emergency calling)
- Small Campus (with DS3 links, 2 tiered network, up to 1000 employees, single building)
- Data Center (with redundant OC12 links, server farms, load balancing servers, UPS, physical security)
- Branch type A (Small branch with only consumer banking in grocery chains, Frame Relay Circuits)
- Branch Type B (mid size branch with locker rooms, loan officers, commercial banking services, up to T1 links)
- Branch type C (large branch with high touch services such as brokerage services, network redundancy, up to 50 employees, and multiple high speed links)
The auditor may choose to audit in detail one campus location, two small campuses where one could be an international location, all the data centers, and three branches of each type to understand the state of the network. In this process auditor must keep the dialogue open with the network administrator to ensure coverage and rectify access issues commonly faced by the assessment tools.
Analyzing Configurations, Versions, and Topology
The audit tool will look for configuration of all the devices to look for the required features as described earlier including QoS settings and redundancy in the form of HSRP or backup links and capability of access switches to provide power over Ethernet to the IP phones in future. The Configuration data will also help validate the topology of the network as it is actually implemented. In absence of such features, the audit tool must be able to evaluate if the current software version will allow the network architect/administrator to enable these features prior to deployment. The auditor should highlight the gaps and flag instances where either the hardware of software cannot fulfill the fundamental requirements for VoIP deployment. Cisco Auditing tools are capable of analyzing the configurations based on the device role according to the established network design best practices.
Traffic Simulations
Media traffic simulation on IP networks provides engineers a clear picture of how voice packets (RTP streams) will behave on the target network that is being prepared to carry voice. Test probes used to simulate and analyze streams use two methods:
- Deploying traffic generators on the edge of the network where the end points are intended to be installed. These traffic generators are controlled via central console which instructs them to choose appropriate codec, number of media sessions and streams, and the target destination. The central console will then collect metrics including delay, jitter, packet loss and MOS/PESQ score estimates for these calls. This information is collected by means of RTP Control Protocol (RTCP), which provides information about the RTP streams related to packet statistics, reception quality, network delays, and synchronization. The information collection method may involve SNMP MIBs, XML, or even simple commands issued on command line interface (CLI) of the traffic generator by the central console. The central console should attempt to run same number of calls as expected during the busiest hour of the day. All codec variation should be explored for this test
- Traffic emulation can also be performed using a rather simpler method which imploy a central traffic generator which sets network devices including switches or access routers as “reflectors”. These designated reflectors should be as close to the edge of the network as possible to cover the entire path taken by the RTP packet containing media. The central traffic generator will simulate media streams towards all of these reflector and record same parameters including delay, jitter, packet loss or other voice metrics for analysis. Cisco IOS has a feature called “IP SLA” which can configure a central device to generate traffic towards multiple reflectors. This method can generate large amount of traffic to accurately emulate the expected network load. IP SLA based method compensates for its internal serialization delays to provide accurate statistics.
The transmission statistics including voice quality metrics information is collected by means of RTP Control Protocol (RTCP), which provides information about the RTP streams related to packet statistics, reception quality, network delays, and synchronization. The information collection method may involve SNMP MIBs, XML, or even simple commands issued on command line interface (CLI) of the traffic generator by the central console.
Managing network capacity requirements
In order to perform meaningful capacity planning it is necessary to understand the traffic engineering theory and know what are expected traffic flows in the VoIP network. Call Detail Records (CDR) can provide this information accurately. This data can be helpful in migrating TDM voice networks to VoIP or expandinfg the capacity of existing VoIP networks such that the proper service level agreement (SLA) is preserved.
Traffic Engineering Theory
In traffic engineering theory, you measure traffic load. Traffic load is the ratio of call arrivals in a specified period of time to the average amount of time taken to service each call during that period. These measurement units are based on Average Hold Time (AHT). AHT is the total time of all calls in a specified period divided by the number of calls in that period, as shown in the following example:
- (3976 total call seconds)/(23 calls) = 172.87 sec per call = AHT of 172.87 seconds
The two main measurement units used today to measure traffic load are erlangs and centum call seconds (CCS). One erlang is 3600 seconds of calls on the same circuit, or enough traffic load to keep one circuit busy for 1 hour. Traffic in erlangs is the product of the number of calls times AHT divided by 3600, as shown in the following example:
- (23 calls * 172.87 AHT)/3600 = 1.104 erlangs
One CCS is 100 seconds of calls on the same circuit. Voice switches generally measure the amount of traffic in CCS. Traffic in CCS is the product of the number of calls times the AHT divided by 100, as shown in the following example:
- (23 calls * 172.87 AHT)/100 = 39.76 CCS
Which unit you use depends highly on the equipment you use and what unit of measurement they record in. Many switches use CCS because it is easier to work with increments of 100 rather than 3600. Both units are recognized standards in the field. The following is how the two relate: 1 erlang = 36 CCS. In this document we will use CCS.
Capacity planning in voice networks is based on the busiest hour of the day. In order to determine the traffic load we need determine know how many calls each station makes during the busy hour. This number is known as the number of busy hour call attempts (BHCA). Once BHCA and AHT are known the resulting traffic load in CCS can be calculated using the following formula:
- CCS = BHCA × AHT/100
Example of Estimating Capacity Requirements
The following section describes what one might call a traffic reference model for the a retail chain’s IP Communications network supporting it’s nationwide retail stores. Currently this retail chain has 225 stores and plan to add275 additional sites to this network. The numbers are based on the Call Detail Records provided by the network administrator of the retailer during for the week of Thank Giving in United States when they expected to do most business. This period was chosen since it represents the busiest time of the year for this retailer. Following calculation are based on this CDR data:
Call Detail Record Analysis
- Number of sites: 225
- Call rejected due to Out of bandwidth: None
- Call rejected due No circuit/channel available: None
- Total Call Seconds: 1804730
- Total Calls: 13077
- Sample size: 6 days (weekly report extracted from the IP PBX covering about 148 hours)
- Average Hold Time (AHT): (Total Call seconds 1804730/ total calls 13077) = 138 sec.
- CCS: (13077*138)/100/6 = 3007.7
(6 is the number of sampled data points)
- Cumulative Call Attempts: CCS/AHT * 100 = 1076/138 *100 = 2179
- Total call attempts during busiest hour (Friday after thanksgiving between 12PM and 1PM) = 778
- Per Device BHCA : 2.8
Network Bandwidth Estimate
Total call attempts during busiest hour (Friday after thanksgiving between 12PM and 1PM) = 778
This was for 225 sites.
For 500 sites we assumed this number can potentially reach up to 1600 for voice calls alone using simple interpolation.
Each G.729 call takes about 28.6Kbps. We round it to 32Kbps for calculations.
G.711 call consumes about 80Kbps. G711 codec was used for calls from fax machines and PoS devices constituting 18% of all calls(932 out of 4831).
(1600 * 32kbps * 0.82) + (1600 * 80kbps * 0.18) = 65.024Mbps
We then projected it for future expansion of the network that is able to support more sites. This estmates is calculated to be around 98Mbps.
The network architect or the administrator must plan for this amount of bandwidth and include any packet overhead given the protocol(s) choice and use of compression methods. All the links that this traffic may traverse should be provisioned accordingly with sufficient overhead to support redundancy. QoS scheme must take into account this capacity so the priority queues on network devices used by RTP packets containing media are adjusted and traffic shaping mechanism is in effect for the correct bandwidth amount.
PSTN Trunk Requirement Calculations
If all of these calls has to egress to PSTN through a gateway at the edge of the VoIP network, Erlang formulas would be used to calculate the trunk requirements. This example is for 23 bearer channel ISDN PRI circuits.
Cumulative Erlang (3150 end points): 268
Per endpoint Erlang: 0.085
Projected Erlang for 500 stores (7000 endpoints):572
Assuming desired blocking rate is no more than 1%, the Erlang B formula will tell us that we need 587 DS0 lines. Please see the Erlang B calculator at the following URL:
Here is a trunk sizing table based on different Erlangs and service level (blocking rate):
Erlang Calculations for Determining PSTN Trunk Capacity Requirement
|
Erlangs per phone |
Total Erlangs |
BHCA (7000 users) |
Blocking rate (Service level) |
Number of Trunks needed |
|---|---|---|---|---|
|
0.085 |
572 |
19,600 |
0.05 |
557 DS0 or 24 PRIs |
|
0.085 |
572 |
19,600 |
0.01 |
599 DS0 or 26PRIs |
Monitoring Network Resources
It is crucial to proactively monitor the network resources. All call processing systems offer methods to report resource utilization. In Cisco Unified Communications Manager, there are cause-codes defined to monitor. Here is a set of minimum recommended parameters to be monitored regularly:
- Trunk Resources (cause code 34 – Channel not available)
- Network Availability (cause code 38 – Network out of order)
- Conferencing Resources (Cause code 124 – Conference full)
- Bandwidth starvation (cause code 125 – Out of bandwidth)
- Other resources such as transcoders (cause code 47)
This table lists the number of calls disconnected for abnormal reason that need further investigation. The reason may include oversubcribed PSTN trunks, conference bridges, or lack of bandwidth. The analysis is done for one hour block of time.
Critical Disconnect Causes by time of the day as reported in CDR
|
Time of Day |
Cause code 34 Channel not available |
Cause code 38 Network out of order |
Cause code 124 Conference Full |
Cause code 125 Out of bandwidth |
Cause code 47 Resource unavailable, unspecified |
|---|---|---|---|---|---|
|
1-2 |
1 |
1 |
2 |
0 |
1 |
|
2-3 |
3 |
4 |
2 |
2 |
2 |
|
3-4 |
4 |
4 |
4 |
4 |
2 |
|
4-5 |
4 |
2 |
5 |
5 |
5 |
|
5-6 |
1 |
1 |
1 |
1 |
1 |
An Audit for gauging the current VOIP network utilization
The VoIP network utilization can be tracked by looking at device and link utilization as described in this section.
Device Utilization
A baseline identifying current device resource utilization is recommended to determine the potential impact of IP Telephony traffic. A baseline is normally done by monitoring peak (5 minute) utilization using SNMP over an extended period of several days to a week. This is normally done for CPU, memory, backplane utilization and LAN buffers. Most SNMP tools will allow the organization to collect and graph the utilization over this period. The result should identify peak utilization. If peak values exceed 50%, the organization should consult with their Cisco representative to determine the potential impact.
Router Readiness Ratings
The table below shows how router measurements were rated. Ratings should be based on the router result ranges for this assessment:
Router Utilization Measurements
|
Measurement |
Good |
Acceptable |
Poor |
|---|---|---|---|
|
Average CPU Utilization (%) |
Less than or equal to 30.0 |
Less than or equal to 50.0 |
Any higher value |
|
Average Memory Utilization (%) |
Less than or equal to 30.0 |
Less than or equal to 50.0 |
Any higher value |
|
Input Queue Drops (%) |
Less than or equal to 2.0 |
Less than or equal to 8.0 |
Any higher value |
|
Output Queue Drops (%) |
Less than or equal to 2.0 |
Less than or equal to 8.0 |
Any higher value |
|
Buffer Errors |
Less than or equal to 0 |
Less than or equal to 1 |
Any higher value |
|
CRC Errors (%) |
Less than or equal to 2.0 |
Less than or equal to 8.0 |
Any higher value |
Switch Readiness Ratings
The table below shows how switch measurements were rated. Ratings should be based on the switch result ranges for this assessment:
Switch Utilization Measurements
|
Measurement |
Good |
Acceptable |
Poor |
|---|---|---|---|
|
Average Backplane Utilization (%) |
Less than or equal to 50.0 |
Less than or equal to 75.0 |
Any higher value |
|
Average CPU Utilization(%) |
Less than or equal to 30.0 |
Less than or equal to 50.0 |
Any higher value |
Link Utilization
A baseline identifying current trunk link utilization is recommended to determine the potential impact of IP Telephony traffic. A baseline is normally done by monitoring peak (5 minute) utilization using SNMP over an extended period of several days to a week. Most SNMP tools will allow the organization to collect and graph the utilization over this period. The result should identify peak utilization and busy hour data traffic requirements. The organization can then estimate overall requirements based on estimated peak IP Telephony traffic.
Link Readiness Ratings
The table below shows how link measurements were rated. Ratings should be based on the link result ranges for this assessment:
Link Utilization Measurements
|
Measurement |
Good |
Acceptable |
Poor |
|---|---|---|---|
|
Average Bandwidth Utilization (%) |
Less than or equal to 30.0 |
Less than or equal to 50.0 |
Any higher value |
|
Latency |
0-20ms |
21ms-120ms |
Above 150ms |
|
Jitter |
0-20ms |
21ms-120ms |
Above 150ms |
|
Packet Loss (per hours basis) |
Under 15 |
15-30 |
Any higher value |
Measurements for Network Transmission Loss Plan
Every VoIP network still has to interface with PSTN or PLMN for greater reachability to all the communication devices worldwide. This exposes the TDM-IP hybrid network to even more sources for echo and other quality issues related to voice signal due to power loss, impedance mismatches and excessive delay. Therefore a network transmission loss plan (NTLP) must be established during pilot phase of the VoIP implementation. NTLP maps a complete picture of signal loss in the entire path identifying areas of improvements including signal strength adjustment at various points to help the echo cancellers (ECAN) on media/voice gateways. There are two primary reasons to establish a loss plan:
- Desire to have the received speech loudness at a comfortable listening level
- Minimize the effect of echo due to signal reflections that are caused by impedance mismatches at the 2-to-4 wire conversions in the transmission path
NTLP maps a complete picture of signal loss in the entire path identifying areas of improvements including signal strength adjustment at various points to help the echo cancellers (ECAN) on media/voice gateways.
NTLP uses following concepts to make loss calculations in the voice network:
- Send Loudness Rating (SLR) is defined as the loudness between the Mouth Reference Point (MRP) and the electrical interface
- Receive Loudness Rating (RLR) is defined as the loudness between the electrical interface and the Ear Reference Point (ERP)
- The Overall Loudness Rating (OLR) of a connection is the sum of the sending terminal SLR (Sending Loudness Rating), any system or network loss, and the receiving
terminal RLR (Receive Loudness Rating). The long-term goal for TELR is 8-12 dB, but because of the mix of technologies, the
short-term goal is 8-21 dB. The difference between OLR in both directions should be no more than 8 dB.
OLR = SLRtalker + [sum]attenuations + RLRlistener
-
Talker Echo Loudness Rating (TELR)— It is the loudness loss between the talker’s mouth and the ear via the echo path. TELR is calculated as follows:
- TELR(A) = SLR(A) + loss in top path +ERL(B) or TCLw(B) + loss in bottom path + RLR(A), where ERL is the echo return loss of the hybrid or echo canceller, and TCLw is the weighted terminal coupling loss of the digital phone set.
The degree of annoyance of talker echo depends both on the amount of delay as well as on the level difference between the voice and echo signals
The following figure show sample calculations for on-net calls
The following figure show sample calculations for off-net calls
Any Voice Quality test set should be able to measure the above mentioned parameters necessary to develop accurate Network Loss Plan prior to wide scale IP telephony deployment. Sage Instruments probes were used to conduct the tests in the above two examples.
The value of TELR would be plotted against the one way delay measured for the test calls on the chart with “Echo Tolerance Curve” specified by ITU-T G.131 standard for Echo as illustrated in figure 6-3.
An attempt should be made to bring the coordinates above in the acceptable range by modifying gain setting on the TDM voice gateway. Digital Signal Processors (DSP) on Cisco gateways are equipped with echo cancellers providing (ECANe) echo cancelling enhancement to conceal the echo in the speech path.
Effectively Monitoring the Network
Effective monitoring of the network is based on the philosophy of proactive or preventative maintenance
Preventative Maintenance as described in this chapter consists of performance monitoring of voice quality metrics.
Every network is prone to faults and problems. Faults may be detected by operations personnel, by users of the service equipment, by performance monitoring and testing within elements, by trend analysis, etc. Paper or electronic trouble reports may be generated and sent; packet based or internal messages may be passed between administrative layers.
-
Bottom-Up trouble shooting process– Fault indication and performance information generated at the endpoint flows upwards through a hierarchy of levels beginning at the bearer element level through the signaling element level to the network management level and finally to reach to system’s operation level.
Performance and fault information is stored in local log files, remote data bases, Call Detail Records, Call Maintenance Records, performance servers, etc. This stored information may be later analyzed by a top down trouble shooting process originating from the system operations or network management level.The endpoint may be capable of preliminarily trouble shooting the fault autonomously, or it may always expect top-down assistance from one or more of its controlling entities.
-
Top-Down trouble shooting process– Faults may be presented to the system’s operational level personnel or automatic systems arising from either lower level indications or from user or maintenance personnel trouble reports, trend analysis reports, etc. At the system’s operation level, service maintenance personnel, and in some cases automatic systems, once presented with an indication of system fault will attempt to further categorize, correlate, isolate, and otherwise trouble shoot the fault symptom from the top down.
Discovery - Complete picture
Network management starts with network discovery. The completeness of coverage and accuracy of device identification is a key to effective network management.
Network discovery must be performed only on stable networks!
The length of time required to discover a network is dependent on various factors that are not necessarily tied to the device population size. For example, one cannot assume that 200 devices can be discovered in 2 hours simply because the initial 100 devices detected in a session might have been discovered in 1 hour. The underlying threads processes of a discovery session may result in requiring 1 hour for the initial 100 devices, and only 10 minutes for the remaining 100.
Typical factors that impact the amount of time to run a discovery process on a network include the following:
-
Link speed: When deploying a discovery session across WAN links, link speed and latency result in slower discovery.
-
Bandwidth controls: Where discovery is set for low speeds, discovery throttles how much traffic itis creating on the network.
-
Device focus: Looking only for specific network elements results in slower discovery time. Firewalls and ACLs and possible blocking of SNMP traffic. This can be mitigated by Ensuring that the security control points allow SNMP and ICMP traffic through. Include a seed device from the other side of the security control point.
Seed Devices for Network Discovery
When selecting seed devices, it is helpful to remember that the start of the discovery process relies on complete Cisco Discovery Protocol (CDP) and/or routing table information from one or several key devices. These devices need not be core devices but should be devices that contain complete routing tables and/or CDP neighbor tables. Aggregate devices, rather than core devices, may be the wiser choice for use as seed devices.
Concern about the effect of SNMP queries on the seed device may be settled by observing the effect on the SNMP process or a test device, running the same version of software, and containing SNMP tables loaded using a test tool to generate IP route or ARP entries.
A good option for seed devices may be a redundant core device (one that is in hot standby with full knowledge of the network routing tables but not actually handling the majority of the network traffic. This will reduce potential SNMP network traffic.)
CDP (Cisco Discovery Protocol) Discovery
A Cisco device that supports this protocol both transmits and listens for CDP messages. As a result each Cisco device is aware of its immediately connected neighbors. The CNC discovery engine collects the CDP information from devices by using SNMP queries to form a list of all neighbors of the queried device. This list contains all devices that have advertised their presence on the network and provides clues about other devices to be discovered.
Limitations of this method include: Not all Cisco devices support CDP, for example Content Networking devices. Some transmission media, such as ATM, do not support CDP. The network administrator may have disabled CDP on some parts of the network, or even the entire network.
Otherwise CDP is the best single method of discovery. CDP does not have to be enabled on all the devices in order for it to work.
Routing Table Discovery
The Routing Table method uses the routing table from seed devices and retrieves the subnet address and subnet mask from the Routing Table MIB. It then compares each subnet against the list of subnets already discovered. If the connection point for the subnet is not found, it then uses SNMP to retrieve the next hop address, then it compares the next hop address with the IP addresses already discovered. If the next hop address is not discovered it is added to a list of devices to be discovered. Routing protocol neighbor lookup is also very effective but will not find Layer 2 devices and only currently supports the OSPF and BGP protocols
ARP Discovery
The ARP method looks at devices discovered and retrieves the ARP table using the “at” address table MIB. This method retrieves the list of all IP addresses the device has in its cache and compares the MAC address to a list of known Cisco MAC address prefixes. If the MAC address prefix matches, the IP address it is added to the list of devices to be discovered.
Although arp discovery is not very efficient for use as a discovery tool. Devices for which ARP entries have timed out will not be discovered. But it becomes useful in finding devices that do not route, do not support CDP, or have CDP disabled.
Routing Protocol - OSPF Discovery
OSPF (Open Shortest Path First) is an Internal Gateway Protocol. If OSPF is active in a network, OSPF Discovery is the preferred method to determine neighbor information. Discovery uses the OSPF MIB information, which maintains a list of all neighbors. This list contains clues for further device discovery.
Routing protocol neighbor lookup is also very effective but will not find Layer 2 devices and only currently supports the OSPF and BGP protocols (the EIGRP protocol does not have a neighbor MIB). Of course, discovery will be ineffective if OSPF is not running on the network. Therefore, it is recommended to use generic route-table instead of table-specific method.
Ping Sweep Discovery
Ping Sweep Discovery generally provides two types of Ping Sweep using either from a specified IP address range or a specified starting IP address. Both methods issue sequential pings to the IP addresses. If an address responds, discovery then attempts to communicate with the device at the IP address by SNMP. If SNMP communication is successful, the device is then considered to be manageable.
The Cisco Network management tools provide the following two Ping Sweep methods:
-
Pingsweep With Hop- This method starts from an IP address and continues pinging new addresses up to the given hop count.
-
Pingsweep Range- This method uses a range of IP address, from the starting IP address to the ending IP address for the given IP address and netmask.
This is not a very efficient method but given enough time and if ICMP messaging is not blocked, it will find everything on the network. Depending on how network devices have been configured, and the addressing scheme in the network, the network may start responding with ICMP unreachable and redirect messages. Also, proxy ARP may cause problems by falsely representing certain IP addresses.
Seed Files
Seed files contain explicit device credentials including their IP addresses, SNMP community strings, passwords to log into command line interface (CLI), or access other API’s such as XML. Seed files guarantee complete device access leading to full discovery and manageability of the network. Even though loading seed file will result into quick network discovery, creation of seed file (generally by a network) administrator can itself be a time consuming process and is prone to human errors.
It is recommended that multiple methods for discovery are employed for greater coverage. The results of the discovery process should be verified with the network administrator to ensure accuracy of the network topology as discovered by the network management system (NMS).
Voice Quality Metrics
Media has always been predisposed to quality degradation as it traverses through any network including both circuit switched and packet switched networks. IP Networks offers greater flexibility to manage media stream and supplementary applications in additions to economical advantages compared to dedicated circuit switched networks. However packet switched or IP networks may further exacerbate some problems or introduce newer issues that need to be managed. There are multiple of factors working work independently and in concert to degrade the perceived quality of the voice signal as contained in RTP (Real Time Protocol) packets traverse from point A to point B on an IP network.
-
Environmental issues: Environmental issues include acoustic problems caused by handset, headset, Analog to digital convert, and impedance mismatch Other issues may result from poor cabling or network clock synchronizations resulting into crackled voice, clicking sound or presence of crosstalk.
-
Signal processing: Signal processing include Speech compression, Voice Activity Detention, Silence suppression and Signal gain variations
Issues related to Voice Activity Detection (VAD), including front-end clipping and incorrect comfort noise levels, presence of static, hissing and often underwater sond. -
VoIP network issues: IP network introduces significant propagation and serialization delays compared to circuit switched networks. This creates network jitter, packet loss often requiring error concealment procession. Delay laso makes inherently present echo more perceiveable to human ears. Delay, jitter and packet loss is manageable through proper implementation of QoS end to end in the IP network. But if this is not managed properly it may result into robotic or synthetic voice due to periods of silence (caused by packet drops) or choppy voice.
Metrics are needed for sustaining a Toll quality Voice network. There are three comprehensive groupings of quality metrics namely the Mean Opinion Score (MOS), the Perceptual Speech Quality Measurement (PSQM), and the Perceptual Evaluation of Speech Quality (PESQ).
MOS or K-factor
MOS is a subjective measure or voice quality. An MOS score is generated when listeners evaluate prerecorded sentences that are subject to varying conditions, such as compression algorithms. Listeners then assign the scores to the received voice signal, based on a scale from 1 to 5, where 1 is the worst and 5 is the best. The test scores are then averaged to a composite score. The tests are also relative, because a score of 3.8 from one test cannot be directly compared to a score of 3.8 from another test. Therefore, a baseline needs to be established for all tests, such as G.711, so that the scores can be normalized and compared directly.
In order for Call Agent or IP PBX such as BTS10200 or Cisco Unified Communications Manager to calculate equivalent of MOS scor a computerized method is has been adopted by Cisco Engineering called K-factor. K-factor (klirrfaktor = ‘clarity measure’ in German) is a clarity, or MOS-LQ (listening quality) estimator. It is an predicted MOS score based entirely on impairments due to frame loss and codec. K-factor does not include any impairment due to delay, or channel factors (echo, levels). K-factor MOS scores are produced on a running basis, with each new MOS estimate based on the previous 8-10s of frame loss data. That is, each k-factor MOS score is VALID over the past 8 seconds. The computation of new scores can be performed at any rate (every second, for example, with the score based on the past 8 second), but the computation window of the MOS is a constant. This was the call agent of UC Manager is able to provide a meaningful value for voice quality in rather objective manner in call detail records.
PSQM
PSQM is an automated method of measuring speech quality “in service,” or as the speech happens. The PSQM measurement is made by comparing the original transmitted speech to the resulting speech at the far end of the transmission channel. PSQM systems are deployed as in-service components. The PSQM measurements are made during real conversation on the network. This automated testing algorithm has over 90 percent accuracy compared to subjective listening tests, such as MOS. Scoring is based on a scale from 0 to 6.5, where 0 is the best and 6.5 is the worst. Because it was originally designed for circuit-switched voice, PSQM does not take into account the jitter or delay problems that are experienced in packet-switched voice systems.
PSQM software usually resides with IP call-management systems, which are sometimes integrated into Simple Network Management Protocol (SNMP) systems.
PESQ
PESQ is current standard for voice quality measurement and documented in ITU Standard P.862. PESQ is the most comprehensive voice quality metric because it can take into account CODEC errors, filtering errors, jitter problems, and delay problems that are typical in a VoIP network. PESQ combines the best of the PSQM method along with a method called Perceptual Analysis Measurement System (PAMS). PESQ scores range from 1 (worst) to 4.5 (best), with 3.8 considered “toll quality” (that is, acceptable quality in a traditional telephony network). PESQ is meant to measure only one aspect of voice quality. The effects of two-way communication, such as loudness loss, delay, echo, and sidetone, are not reflected in PESQ scores.
Many equipment vendors offer PESQ measurement systems. Such systems are either stand-alone or they plug into existing network management systems. PESQ was designed to mirror the MOS measurement system. So, if a score of 3.2 is measured by PESQ, a score of 3.2 should be achieved using MOS methods.
PESQ measures the effect of end-to-end network conditions, including CODEC processing, jitter, and packet loss. Therefore, PESQ is the preferred method of testing voice quality in an IP network. When this metric is vailable on a call processing system via XML, SNMP or CDR’s it should be used for monitoring voice quality
Approach to measure Jitter, Latency and Packet Loss in the network
The above mentioned voice quality metrics provide an overall picture of the perceived voice quality. It is still important to look at the individual factors that impact the voice quality including jitter, latency, packet loss, and various aspects of voice signal such as signal strength and the bandwidth. This section explains approaches to measure these parameters in the VoIP network interfacing TDM network (PLMN).
Round-trip Delay Measurement
The MOS readings indicate the speech transmission quality or the listening clarity. The delay has no impact on the MOS readings although it affects the real phone conversation in following ways:
- Long delay affects the natural conversation interactivity, and causes hesitation and over-talk. A caller starts noticing delay when the round trip delay exceeds 150ms. ITU-T G.114[9] specifies the maximum desired round-trip delay as 300 ms. A delay over 500 ms will make the phone conversation impractical.
- Long delay exacerbates echo problems. An echo with level of -30 dB would not be “audible” if the delay is less than 30 ms. But if the delay is over 300 ms, even a -50 db echo is audible. The echo delay and level requirements are specified in ITU-T G.131[10].
Voice Jitter/Frame Slip Measurements
A frame slip or voice jitter is defined as a sudden delay variation at the audio signal side. The audio signal requires continuous and synchronous play out. The packet-switched network is inherently jittery where each packet will arrive asynchronously and may be out of order. To compensate for the jittery nature of the packet-switched (IP) network jitter buffers are used on voice gateways or MTA’s. A large jitter buffer can minimize packet loss, but will induce longer delay. To balance the conflicting need for shorter delay and less packet loss, the jitter buffer may be dynamically re-sized depending on the network traffic situation. Whenever the jitter buffer re-sizes, the audio signal will experience a sudden delay variation (jitter or frame slip) in an amount (in ms) that matches the voice frame size (6, 10 or 30 ms). This test should measure two types of frame slips:
- Positive (+) frame slip: the total amount of compressive jitters (shortening of delays) that correspond to the down-sizing of jitter buffer or the deletion of packets.
- Negative (−) frame slip: the total amount of expansive jitters (lengthening of delays) that correspond to the up-sizing of the jitter buffer or the insertion of packets.
A good system should maintain a total amount of jitter less than 3% of test duration. For a 10 second test, the total amount of positive and negative slips measured by SMOS test should be within [-300,300] milliseconds. If SMOS test measures higher amount of jitters, then the network should be re-configured for better traffic engineering and prioritization.
Effective Bandwidth Measurement
A test should measure the attenuation distortion by analyzing the frequency response of the system under test (analyzing the 300-3400Hz band). For PCM (G.711) and ADPCM (G.726) waveform coders, the effective bandwidth largely reflects the attenuation distortion caused by analog or digital filtering. If a system under test uses PCM or ADPCM waveform coders, its measured effective bandwidth should be higher than 0.9. Anything below 0.85 signifies either excessive loop attenuation distortion (for analog circuits) or excessive band-limiting digital filtering. This test may be run during quarterly audits and need not to be enabled on regular basis.
Voice Band Gain Measurement
A test should measure the overall voice band (300 to 3400 Hz) signal level change (attenuation or gain). Flat gain change is not reflected in the MOS reading. But excessive level change (too loud or too faint) does affect human perception. A VoIP network with balanced network loss plan should maintain the change in voice level (gain) in the range of [−10,−3] dB.
Silence Noise level Measurement
The silence noise level in VoIP network measures the comfort noise level generated by CNG (Comfort-Noise-Generator). The level should not be too high (sounds too noisy) nor too low (sounds like a dead line). The noise level is expressed in dBrnC. An ideal system should maintain a silence noise level between [10, 30] dBrnC. Above 30dBrnC sounds too “noisy” and below 10dBrnC may sound too “quiet”.
Voice Clipping
The intention of voice clipping measurement is to quantify the voice quality degradation caused by VADs (voice-Activity-Detectors). VADs help reduce bandwidth requirement though the silence suppression scheme. An overly aggressive VAD, however, can cause the leading or trailing edges of an active signal burst being clipped. The voice clipping will effect modem and fax tone transmission over VoIP network.
Echo Measurements
In VoIP network echo is an inherent issue because of the analog 2-wire loop presence which causes impedance mismatch at the hybrid junction (linking 2 wire analog loop with 4-wire trunk). The echo becomes perceivable due to the network delay. The higher level of echo signal and the significant network delay will make the echo perceivable to human ear. Echo canceller are employed to cancel the echo by covering the tail length to sample the original signal to cancel it out from the reflect signal hence suppressing the echo.
Figure 6-3 shows the minimum requirements for TELR as a function of the mean one-way transmission time T (half the value of the total round trip delay from the talker’s mouth to the talker’s ear). In general, the “acceptable” curve is the one to follow. Only in exceptional circumstances should values for the “limiting case” be allowed otherwise all such cases should be compensated by enabling echo cancellers and properly adjusting tail coverage.
A test equipment should be able to calculate the echo (level against the delay) to characterized the echo present in the network and to evaluate the effectiveness of the after enabling the echo cancellers (ECAN) with the appropriate tail coverage.
Voice Signalling Protocols Impairments in IP Networks
Signalling connections are implemented with protocols allowing for the detection of packet losses, and the re-transmission of lost packets. As such, they are better equipped than voice media connections to survive packet losses. For example, SCCP, used by IP telephones, uses TCP as a transport protocol. Similarly MGCP and SIP implement their own re-transmission scheme as its underlying protocol (UDP) does not provide retransmission services for lost packets. SIP can use either TCP or UDP. In either case, similar to MGCP it has its own retransmission mechanism.
Even though lost packets are retransmitted it is the network conditions that determine the success or failure of the retransmission attempts. Even successful retransmissions can have negative effects if the period needed to complete the transaction (from the initial attempt, through the retransmission attempts, to final success) delays system response by a user-perceivable amount of time. We can classify the IP Communications system behavior according to the relative severity of the interruption to the signalling link connectivity, as follows:
-
Light packet drops, with short duration and low frequency of drops: In this case, the system appears to be generally unresponsive to user input. The user may experience effects such as delayed dial tone, delayed ringer silence upon answer, and double dialing of digits due to the user’s belief that the first attempt was not effective (thus requiring hang-up and redial).
-
More frequent, longer-duration packet drops: In this case, the system alternates between seemingly normal and deteriorated operation. Packet drops cause endpoints to activate link failure measures, including re-initialization. This link interruptions emulated by continous packet drop for long duration can reach the point of causing phone or gateway reset resulting in media tear down as well. Users might experience SRST activation, whereby all active calls are dropped when the link is interrupted and again when the link is reestablished. Phones may also appear unresponsive for several minutes.
-
Complete link interruption: Although most likely caused by an actual network failure, link blackouts could be the result of a congested network where end-to-end QoS is not configured. For instance, a very high degree of packet loss can occur if a signaling link traverses a network path experiencing large, over-provisioned, sustained traffic flows such as network-based storage/disk access, file download, file sharing, or software backup operations. In such cases, the IP Communications system will interrupt calls, and the initiation of backup mechanism, for example survivable remote site telephony SRST in enterprise networks, will provide for continued telephony service for the duration of the link failure. However, the switchover to backup system may be associated with delay where the end points may have to re-register to the alternate system or the advanced telephony features may become unavailable.
These effects apply to all deployment models. However, single-site (campus) deployments tend to be less likely to experience the conditions caused by sustained link interruptions because the larger quantity of bandwidth typically deployed in LAN environments (minimum links of 100 Mbps) allows for some residual bandwidth to be available for the IP Communications system.
In any WAN-based deployment model and any Service Provider managed residential services model, (see Figure 2), traffic congestion is more likely to produce
sustained and/or more frequent link interruptions because the available bandwidth is much less than in
a LAN (typically less than 2 Mbps), so the link is more easily saturated. The effects of link interruptions
impact the users, whether or not the voice media traverses the packet network.
How to effectively poll the Network
I would like to re-state that VOIP network is a large and complicated solution which encompasses many integrated technologies. This presents a problem for VOIP infrastructure managers since each technology brings its own network management challenges. A device polling strategy needs to be implemented for the network, such that all VOIP functional segments have coverage.
Polling involves tapping on the existing device mechanism like the CMS to perform device audits for VOIP segments. In reality, custom probes need to be implemented through scripts for achieving a full polling coverage. A best practice would be to create or have a dedicated in house Linux or UNIX based system to develop these probes. Open source tools like Nagios accommodate these types of probes very easily.
The following sections describe processes that should be in place to allow a VOIP service provider to effectively manage their network. The key fact is that the development and deployment of specialized tracking systems require upfront cost and man hours. This can be implemented through the assistance of vendor services groups. The impacts are immediate with huge return on investments (ROI).
Polling Strategy
VOIP segments are driven by certain protocols, and all these protocols ride over IP. Thus in essence an organized and a layered approach needs to be developed to effectively poll the VOIP network for key information. A layered approach implies that the polling be done in a way such that all protocols riding IP are covered. The following Figure 6-1 reflects this concept. All the VOIP related segments need to be polled. Figure 1 reflects that the base connectivity to the network components is tested at the IP layer, and then the next set of layers are depicted for the various segments and functional components.
The base connectivity is achieved through an IP ping, the entire network should be mapped for this basic connectivity, this would allow for creating a knowledge base of the IP layer. If ping is disabled then other custom TCP probes should be developed to create this map. There are open source alternatives available to the classical ping based probes like echoping which do not use the ICMP_ECHO_REQUEST or ECHO_REPLY packets but can communicate using other protocols like http.
The SNMP connectivity needs to be verified. At a minimum the trap functionality needs be verified periodically. The Traps are critical as they map to alarm and events being generated by the NEs. The NE should be updated, to trigger an informational trap to the SNMP Manager on a periodic basis. The SNMP connectivity validation will ensure the alarm and key event stream is flowing.
MGCP polling needs to be in place, it depends on the VOIP deployment architecture, for cable environment it covers several segments (PSTN, CPE/MTA, Announcements), a periodic audit end-point can be performed to validate the MGCP based device connectivity.
SIP polling needs to be in place to make sure SIP supporting devices are functional. The devices could be SIP based Voice Mail servers or Session Border Controllers (SBC) that terminate SIP trunks. Custom probes can be used to periodically test the SIP functionality to these devices. A custom SIP probe example is included in Appendix A.
A periodic test of the SS7 link needs to be performed along with verifying that the ISUP functionality is up. ISUP is used to backhaul the SS7 information to the CMS. We will cover the ISUP KPI tracking in the upcoming chapter 7. The reported alarms can be also used to track SS7 and ISUP functionality.
Management connectivity needs to be verified, making sure the devices can always be reached through Telnet or SSH. In some cases out of band connectivity through modem dial-ups need to be verified as well.
A periodic polling of DNS and DHCP functionality needs to be in place. All the DNS servers, primary and secondary need to be polled for DNS queries. Similarly the DHCP functionality needs to be verified on a periodic basis, this can be done by tracking the DHCP statistics (Discovers sent, leases granted) and creating a visual dashboard for the stats.
So in summary, the idea behind this approach is two folds, one is to make sure that the base or core IP connectivity is up then connectivity to each of the segments is tracked in some fashion, and this is done by a polling mechanism at the protocol layer. This allows for easily isolating the down segment in case of issues.
Key Alarms and Event Monitoring
An exercise of identifying the key alarms and events should be done across the VOIP network. It is imperative that this exercise be done on a periodic basis especially when major software upgrades are performed on the network. The new software would deprecate some old key alarms and introduce potentially new critical and more verbose alarms to better track the network. This information is available in the release notes of the new software.
The following table in Figure 6-2 depicts the general buckets of alarms as collected from the CISCO BTS 10200 product. This can be used as a high level guideline for making sure that the Operation centers are including such alarm categories in their Network monitoring tools.
Alarm/Event Groups
|
Alarm/Event Type |
|---|
|
OSS |
|
DATABASE |
|
AUDIT |
|
MAINTENANCE |
|
SYSTEM |
|
BILLING |
|
CALLP |
|
SIGNALING |
The general buckets do have alarms that get reported on a frequent basis and can cause the Network Operation Centers (NOC) to drop them in an ignore bucket. I have come across situations where a major VOIP provider was ignoring even critical alarms due this behavior. Typically end devices going in and out of service trigger these audit alarms. A scenario can be of DNS functionality failing for particular market thus causing a flood of audit end-point failure notifications. This should be tracked immediately and cannot be ignored.
A best practice would be to go through the Vendor Services groups and classify key and chatty alarms on new major software releases. At the same time understand the behavior of the chatty alarms, so the critical ones do not get ignored.
SNMP Configuration and Setting
SNMP configurations and connectivity is at the core of network operations. It is among the first set of configurations being pushed on the NEs. We want to give an overview of some basic configurations then describe key SNMP trap related configuration settings.
Basic configuration
The basic SNMP configurations involve the setting up of:
- SNMP community string (read and write) with passwords. Typically the default is set to “public” and should be overwritten.
- SNMP Trap destination configurations. This basically tells the SNMP agent running on NE the location (IP address) of the NMS to send the traps to. Most of the time the default port also needs to be overridden for security reasons. There can be multiple NMS devices listening for the traps, so all of these devices need to be accounted for.
SNMP Trap settings
It is very important to configure the SNMP trap settings correctly. The monitoring of the VOIP network will be effected, if the key traps are not generated.
The following categories of traps need to be generated at a minimum:
- CRITICAL, MAJOR, MINOR and WARNING. These categories constitute the alarms. The next category is of events which includes INFO and DEBUG types. Lot of time the events are also critical as could include audit information.
The NMS needs to be optimized to handle all alarms and effectively such that the chatty alarms and even events go in a other buckets but can still be tracked. The chatty alarms can have custom threshold trigger alarm mechanism on the NMS, thus catching a systematic issue like the DNS failure example mentioned earlier.
Traps – use case BTS 10200 CISCO soft switch
Soft switches allow for even greater flexibility of tracking the type of traps. In a centralized model where they front all VOIP segments, it becomes critical to generate and subscribe to relevant types of traps. It may very well be that all types need to be included. The BTS 10200 allows the flexibility of the following detail types:
- BILLING, CALLPROCESSING, CONFIGURATION, DATABASE, MAINTENANCE, OSS, SECURITY, SIGNALING, STATISTICS, SYSTEM and AUDIT
Thus subscribing to the traps related to all these verbose types, and then in turn effectively monitoring them would be key to VOIP network management success. We will expand on the usefulness of this extensive subscription in the upcoming section Alarm and Event Correlation.
Standard Polling Intervals and Traps
The minimum-polling interval depends on the type of the SNMP object(s) being polled, the number of devices you are polling and how much network bandwidth you want to devote to network management. Most critical SNMP objects (e.g., ifOperStatus, ifInErrors, etc.) should be polled every 5 minutes. Other SNMP objects may require more frequent polling (e.g., nl-ping-response).
Most performance SNMP objects should be polled at 30 minute intervals. This is a fairly conservative polling interval, providing 48 data points per 24 hour reporting period. 48 data points provides enough granularity to establish general performance baselines.
Traps from the managed devices will be sent to Network Management/Monitoring Systems (NMS) unsolicited on a reactive basis as the problem occurs. Traps notify the problems such as link down when there is outage. But there is no trap to identify link congestion. For that we have to rely on polling using the SNMP object identifier (OID). There is a specific set of OIDs or MIB for network statistics gathers like QoS Traffic shaping packet discards, FECNs or BECNs etc. This poses a challenge because excessive polling to increase the time resolution of problem notification via polling may increase management traffic on production network.
Challenge
Since the polling responses arrive at a certain time delay, the problems encountered for short duration during the polling intervals may go unnoticed. Moreover, the traps will need to be correlated to the polled results or the other traps that may be related. Also, some tools and certain SNMP OID tables calculate and store values that are representative of average number rather than instantaneous values. This may not give an accurate idea about the severity of the problem as the counters show an average value over a spread of time period.
This challenge is illustrated by following two scenarios:
Scenario 1: Phones Un-registering, from Unified CM and re-registering to SRST Router Due to WAN Link Outage
In this scenario, as shown in figure 4, the NMS is polling the WAN for Frame Relay congestion (monitoring for FECN or BECN or QoS TS discards or packet drops) with the interval of 30 minutes. Assume that the polling intervals are 30 second apart, and a poll occurs at 9:00AM, and the successive polls reoccur at 9:30AM, 10:AM and 10:30AM. Around 9:35AM, the Frame Relay network close to the Aggregation site encounters an outage which cause the Frame Relay link connecting the branch to go down. This will cause the IP phones to un-register from the CallManager cluster and register with SRST gateway local to the branch. This will be notified by a trap from the CallManager originally hosting those IP Phones. At the same time the aggregation router will send a ifDown (link down) trap to the NMS. Because of the close time proximity of these traps, a NOC staffer will be able to correlate the IP Phone registration with SRST gateway to the Frame Relay link outage.
Scenario 2: Phones Un-registering, from Unified CM and re-registering to SRST Router Due to WAN Congestion
In this scenario, the NMS is polling the WAN for Frame Relay congestion (monitoring for FECN or BECN or QoS TS discards or packet drops) with the interval of 30 minutes. Assume that the polling intervals are 30 second apart, and a poll occurs at 9:00AM, and the successive polls reoccur at 9:30AM, 10:AM and 10:30AM. Around 9:35AM, the Frame Relay network close to the Branch starts to experience network congestion and at about 9:40AM the IP phones cannot get the keepalives serviced by the CallManager cluster. By 9:42AM, they will un-register themselves from the CallManager cluster and register with SRST gateway local to the branch. This will be notified by a trap from the CallManager originally hosting those IP Phones. But the underlying problem in the Frame Relay network will be reported by the next polling cycle occurring at 10:00AM. There is a possibility that the Frame Relay network congestion is relived right around 9:45AM. In that case the Network Administrator will not be able to correlate the phone unregistering and registering problem with the actual cause. The troubleshooting efforts may be directed towards the source of the trap, which is, the CallManager.
These challenges can be addressed by adopting a layered approach as discussed earlier. Chapter 7 will further elaborate using statistical data collection is instantaneous rather than average to get a more accurate profile of the network as well as Syslog analysis to bridge gaps in polled data.
Using Extensible Markup Language (XML) for Polling and extraction of key information
Extended markup language (XML) is being used extensively in the industry these days to address a variety of needs. Some of the key ones that are in scope of our discussion are: XML Simplifies Data Sharing, XML Simplifies Data Transport and XML is Used to Create New Internet Languages. We will cover these aspects briefly then describe how they are applied in the VOIP network management and polling.
XML overview
NEs, soft switches and EMSs are providing communication and reporting interfaces through XML. Figure 6-3 depicts most of these interfaces. This XML capability thus allows for generic integration to third party reporting provisioning and monitoring systems.
The third–party flow through provisioning systems can be integrated to specific Vendor EMS via the generic XML interface. There are many reporting engines which take in XML data reports and transform them to manageable information. These reporting engines facilitate Trouble Ticket tracking, managing Billing records and data mining information for capacity planning through performance measurement reports. In some cases there are explicit XML agents present on the NEs and EMS that listening for XML based queries over a TCP socket. The XML agents would handle the request from the client, typically these requests that would be otherwise made through CLI. The returned result is a well formed XML report.
In short the XML usage in VOIP network operations is very crucial. It allows the VOIP provider to scale and integrate with third-party vendors.
XML APIs available
Generally the XML APIs can be broken down in categories of exporting data, communication payload and as Agent interface. These aspects allow for reducing the interoperability challenges and facilitate adaptability to a changing industry.
Data exported as XML reports
VOIP network EMS, in particular BTS 10200 provides interfaces to periodically generate Call Detail Record reports, Billing Record reports and Performance measurement reports in XML format. Each of these reports can be imported into a third party system.
SOAP/CORBA Communication utilizing XML
In VOIP networks, Provisioning Applications are used for flow through configuration. These systems communicate through the NMS and EMS and utilized XML. The well formedness aspect of XML allows for strict syntax checking. The XML capability of incorporating new tags facilitates third party Provisioning systems to adapt to changes with VOIP vendor’s provisioning API interfaces. This makes the interoperability within a slew of VOIP related applications manageable.
Thus XML based communication (which basically implies that the communication protocol is utilizing XML to format its payload) allows for VOIP service provider to better manage their flow through provisioning systems.
XML queries to XML Agent for retrieving information
The Vendor Industry is heading towards providing a XML based Agent interface on their products. The companies utilizing web services extensively thus easily integrate their strategic applications with those of their partners, both internally and over the Internet.
The XML Agent is introduced in the NE or an EMS. This Agent would allow the following sample feature set:
- Provide a mechanism using to transfer, configure, and monitor objects in the
- This XML capability allows you to easily shape or extend the CLI
- Query and reply data in XML format to meet different specific business needs.
- Transfer show command output from the CLI interface in XML format for statistics and status monitoring. This show command output transfer capability allows you to query and extract data from the NE/EMS.
- Utilize the NE/EMS XML Document Type Definition (DTD) schema for formatting CLI queries or parsing the XML results from the NE/EMS to enable third-party software development through XML communications.
- Provide remote user authentication through AAA.
- Allow for communication to happen through HTTP or HTTPS
- Provide a set of return error codes to easily diagnose the issue, which can be as simple as wrong XML form or syntax issue with the XML NE/EMS query.
- And other NE/EMS specific feature support through the DTD.
To summarize a VOIP service provider needs to understand and implement the need for XML based applications and devices that support XML interfaces. This would allow flexibility to adapt to change and accommodate growth easily.
Using the Trace/Syslogs Logs for Deep Analysis
Syslogs or Trace logs contain information that was typically used during deep debugging and root cause analysis sessions. There is lot of valuable information embedded in the Syslogs, which if tapped, on a real time basis can take the VOIP network monitoring to another level. It allows for creating in-depth view of the network elements.
The basic idea is to stream the Syslogs to a dedicated server for deep analysis, where a continuous process should be introduce to parse the Syslogs for key trends. Key exercises that need to be performed are:
- Identify systemic metrics that can be tracked through the logs.
- Identify frequent service affecting issues which can be tracked through logs.
- Capture frequency of the systemic metrics.
- Lastly group the metrics into functional buckets, thus allowing for easier functional debugging and tracking.
Figure 6-4 captures a table highlighting some key metrics that are captured from the CISCO Softswitch BTS 10200. These metrics are basically derived from the log text. Each piece of text acts as signature for the metric. The Syslog is periodically parsed for these signatures. The signatures would then map to event types, based on their criticality. They could fall into an ERROR or a WARNING category. These events can then trigger traps or other notifications to monitoring systems.
Sample key metrics from CISCO Softswitch Trace logs
|
Log text |
Possible Severity |
Explanation |
|---|---|---|
|
MGW admin state not allow subscriber maint request |
Error |
Cannot perform maintenance request( administrative and diagnostic module related) |
|
ANM_process state_waitcrxresp Connection failed event type |
Error |
Announcement connection failed (Announcement Manager) |
|
Failed SRV lookup and A record lookup while attempting add port and Domain name |
Error |
DNS issue for softsw_tsap_addr. BTS could not resolve domain name during a an audit of the SIP table. |
|
KA timer expired for aggrIdx |
Error |
Cops protocol(BTS to CMTS) related error. Keep-alive timer expired for AGGR. |
These metrics should be used for tracking overall system health through periodic summary reports along side with triggered event notifications.
The smart analysis of the Syslogs allows the service provider to develop an extra layer monitoring which might not be covered through the typical alarm notification functionality and may even drive towards improving the reported alarms.
Alarm and Event Audit and Correlation
Root cause analysis (RCA) is key to incident management. RCA can be accelerated by following a steady alarm and event analysis practice. The practice can be developed through following basic exercise:
- Have a vigorous exercise of understanding the alarm and alarm groupings for all the network elements. As a result a key alarm summary list should be produced for all the products.
- A periodic collection of alarm history should be introduced then monitored actively through a Dashboard.
- The Dashboard should be used for history trending of alarms by market and functional groups. As a example two functional groups of SIP and PSTN related alarms should have their own visual mentoring groups.
- In a VOIP network, the VOICE switch (or corresponding EMS), sees all the protocol and most NE reported anomalies, It can be a good starting point for creating an alarm summary report.
The figure 6-5 shows a summary table of alarms for a particular customer site. The table has been generated off the alarm history report for CISCO BTS 10200. The table is periodically generated for many customer sites. It allows for identifying the hot spots and infrequent anomalies.
The figure shows the general group buckets and the detailed breakdown of the alarm frequency within the groups.
The most prevalent alarm buckets surface up. The periodic report can be easily used to first identify the problem, then the results be fed into the systemic RCA. In the case of Table 6-8 we clearly see the signaling alarms at the top, after analyzing the report, the alarms like “Trunk remotely blocked”, were ignored in this case, due to large number trunks were being turned-up and were down. But other alarms within the same category and other categories also need to be looked at for rooting out system issues.
Summary Alarm Table from CISCO Softswitch
|
Alarm Count |
Alarm Group Type |
Alarm Explanation |
|---|---|---|
|
1 |
CALLP |
Country Code Dialing Plan Error |
|
1 |
DATABASE |
EMS database alert.log alerts. |
|
1 |
OSS |
SNMP Authentication error |
|
1 |
SIGNALING |
SS7 Message Decoding Failure |
|
1 |
SIGNALING |
Unanswered REL |
|
2 |
BILLING |
FTP/SFTP transfer failed |
|
2 |
DATABASE |
Daily database backup completed successfully |
|
2 |
SIGNALING |
AGGR Connection Down |
|
2 |
SIGNALING |
Feature Server is not up or is not responding to Call Agent |
|
2 |
SIGNALING |
Continuity Recheck Successful |
|
3 |
SIGNALING |
AGGR Gate Set Failed |
|
3 |
SIGNALING |
Continuity Recheck is performed on specified CIC |
|
6 |
SIGNALING |
RLC received in response to RSC message on the specified CIC |
|
10 |
AUDIT |
Start or Stop of SS7-CIC audit |
|
11 |
CALLP |
No Route Available for Carrier Dialed |
|
14 |
DATABASE |
There are errors in EMS database DefError queue |
|
27 |
MAINTENANCE |
Admin State Change Failure |
|
56 |
MAINTENANCE |
Admin State Change Successful with Warning |
|
76 |
AUDIT |
Call exceeds a long-duration threshold |
|
112 |
SIGNALING |
Continuity Recheck Failed |
|
215 |
SIGNALING |
Media gateway/termination down |
|
379 |
MAINTENANCE |
Admin State Change |
|
469 |
SIGNALING |
Timeout on Remote Instance |
|
803 |
CALLP |
Invalid Call |
|
981 |
SIGNALING |
Unexpected Message for the Call State is received : Clear Ca |
|
1166 |
CALLP |
Call Failure |
|
1194 |
SIGNALING |
COT message received on the specified CIC |
|
1267 |
BILLING |
Message content error |
|
1391 |
SIGNALING |
General MGCP Signaling Error between MGW and CA. |
|
5257 |
SIGNALING |
Trunk locally blocked |
|
5495 |
SIGNALING |
Trunk remotely blocked |
|
1391 |
SIGNALING |
General MGCP Signaling Error between MGW and CA. |
|
5257 |
SIGNALING |
Trunk locally blocked |
Thus the results are obvious of performing these audits on a periodic basis and analyzing them at the same time through a dashboard. The trouble segments surface up, and thus can be prioritized for resolution.
Effectively monitoring the PSTN Bearer traffic
Tracking VOIP Service Provider’s PSTN bearer trunk utilization is growing rapidly with the fast growth of their customer base. Trunks represent T1s which carry basically 24 DS0s/CICs.
These CICs represent the state of the voice channel and basically the voice capacity. If for whatever reason these CICs are not available to carry voice traffic the Service Provider’s Voice network would be critically impaired as OFFNET or PSTN calls will not go through. Again this depends on the percentage of CICs affected.
Another important need to monitor these CICs is because Service Provider routes certain categories of OFFNET calls like emergency, long distance calling, toll free calling directory assistance and so on, on specific set of trunks. So if these particular trunks are affected then that whole category of service is down.
Summary email reports are one form of output for the abnormal CIC states, the same data can be used to generate web/html reports which are updated on pseudo real time basis. These reports are to be used to create a Trunk Monitoring Dash Board. This monitoring dash board would greatly improve Voice Network Operations, the Dash Board would give view into the health of the voice trunks on a readily basis. The figure 6-9 shows this report.
An alerting mechanism can be setup to send an email/epage when certain configured thresholds of trunks go local block (LBLK) or remote block (RBLK) in between two monitoring periods.
Also this captured data can be used to generate XML reports that can plug into web browsers or other aggregation tools to show the monitoring information.
The ideas presented so far have the same key theme, create specialized reports and monitor them through dashboards. This specialization cost time and money but the positive impacts are immediate to the overall VOIP service.
Quality of Service (QoS) in VoIP Networks
As discussed earlier, VoIP is most commonly deployed over converged IP networks carrying data, voice and video traffic. When network resources are congested they can severely affect the quality of VoIP traffic causing poor user experience for the subscribers. This can result in increased customer calls (trouble tickets) for the Voice SP and loss of revenue due to customer turnover.
Therefore, it is very important for the Voice SP or an Enterprise to implement QoS for VoIP traffic in their networks. This can help guarantee good voice quality when network resources are congested.
There are a number of factors that can affect the quality of VoIP traffic as perceived by the end user. Some of the common factors include delay, jitter and packet loss. These factors can be key indicators of the overall health of the voice network and are defined as follows:
-
Delay: The time it takes the VoIP traffic to reach from one endpoint to another is typically referred to as the end-to-end delay. Delay can be measured in either one-way or round-trip delay. The ITU G.114 recommendation states that the acceptable one-way delay for voice is 150 ms. Any delay > 150 ms can result in degraded voice quality and poor user experience.
-
Jitter: It is the variation in delay over time from one endpoint to another. If the delay of transmissions varies too widely in a VoIP call, the call quality is greatly degraded. VOIP network typically compensates for this by having jitter buffers at the endpoints, to deliver the VoIP traffic to the end user at a constant rate. If the jitter it too high it can overflow the jitter buffer at the endpoints resulting in packet loss and poor voice quality.
-
Packet loss: It is the number of dropped packets in the data path while carrying the VoIP traffic from one endpoint to another. A 3 percent packet loss is typically regarded as the maximum tolerable limit for good voice quality. The VoIP network should be design for < 1.5% packet loss in order to guarantee good voice quality.
This section does not cover the various methods of configuring and troubleshooting QoS in order to prevent delay, jitter and packet loss in VoIP networks. It describes (at a high level) the methodology of how to use these key indicators to implement and manage a QoS policy in the network. This can help the Voice SP or an Enterprise isolate problems in the network more effectively and prevent them from happening in the future.
Defining a QoS Methodology
The QoS policy implemented for VoIP traffic should encompass the end-to-end voice network. It is recommended to take a layered QoS approach which makes it easier to implement and manage the QoS policy for VoIP.
The QoS policy for VoIP traffic should cover Layer 2, Layer 3 as well as the application layer. This will help guarantee that the VoIP traffic is given preferential treatment as it is transported from one endpoint to another. QoS at the application layer is especially useful when end users are using PC-based VoIP applications to place and receive voice calls. In this case, the VoIP traffic may receive the desired QoS as it traverses the network but the end user’s PC-based application may not prioritize VoIP over other applications demanding CPU resources. This can result in poor voice quality due to delay, jitter or packet loss as described above.
One thing to keep in mind is that QoS may only help when resources are congested. If there is no contention for bandwidth or other network resources then applying QoS may not provide any additional benefits.
Differentiated Services (Diff Serv) for Applying QoS
A good QoS policy involves marking or classifying the VoIP traffic at the edge of the network so that intermediate devices in the network can differentiate voice traffic from other traffic and process them according to the defined policy. This marking or classification can be done using Differentiated Services Code Point (DSCP) values or by using the IP Precedence bits in the Type of Service (ToS) byte in the IP header.
Diff Serv defines the required behavior in the forwarding path to provide quality of service for different classes of traffic. A very important aspect in the definition of forwarding path behavior for Qos is the method of doing packet classification. Packet classification is required for quality of service in order to determine which treatment a particular packet will get for shared resource allocation.
The Diff Serv model also defines boundaries of trust in a network and the associated functions that occur at the edges of a region of trust. A DSCP specifies a Per Hop Behavior (PHB) for forwarding treatment. A PHB specifies a scheduling treatment that packets marked with the DSCP will receive. A PHB can also include a specification for traffic conditioning. Traffic conditioning functions include traffic shaping and policing. Traffic shaping conditions traffic to meet a particular average rate and burst requirement. Policing enforces an average rate and burst requirement. Actions to take when traffic exceeds a policing specification can include remarking or drop.
The PHB commonly used for voice bearer traffic is 46, also known as the Expedited Forwarding (EF) PHB. The PHB commonly used for call signaling is 26, also known as the Assured Forwarding 31 (AF31) PHB.
Figure 6-11 illustrates a Differentiated Services based QoS model.
In SP environments, endpoints are typically untrusted devices. This means that endpoints may not mark or classify the VOIP traffic correctly therefore this traffic would need to be re-marked and re-classified at the edge of the network. Once the VoIP traffic is re-classified at the edge, it can be scheduled into appropriate queues and receive the desired QoS.
In Enterprise networks, endpoints such as IP Phones are considered trusted devices while PC-based soft clients are conditionally trusted. Trusted devices are supposed to classify the VoIP traffic correctly while traffic from conditionally trusted devices is only trusted if it meets a defined criteria. This criteria is typically defined at access-layer switches which are directly connected to these conditionally trusted devices. If a device is compromised and starts sending mis-classified VoIP traffic, it can be policed at the edge of the trust boundary and put into a scavenger queue. This queue can be monitored periodically to discover any undesired network activity and the data can be used for trending to predict failures as well as linked to the trouble ticketing system to correlate to any network issues.
Using Bandwidth / Resource Reservation and Call Admission Control (CAC) for Providing QoS
Another approach for providing QoS to VoIP traffic is to reserve the required network resources before setting up the voice call and using CAC for rejecting calls which may not be able to receive the desired QoS due to congestion or high utilization of network resources. While this approach can definitely guarantee QoS to VoIP traffic, it does have it disadvantages.
One of the problems with using this approach is that network resources need to be reserved end-to-end to guarantee QoS to VoIP traffic from one endpoint to another. This can be very challenging since resources may not be available on certain network segments due to congestion, which will result in the call setup to fail. This also means that once the resources are reserved they cannot be used for any other traffic, hence network resources may not be efficiently utilized.
Even with the downsides mentioned above, this approach is still used in some deployment models in SP environments. The approach is slightly modified though, to make better use of network resources. Instead of reserving network resources ahead of time, they are only reserved when a voice call needs to be setup and are released once the voice call is torn down. This enables more efficient use of network resources as they can be used for other traffic if not being utilized for VoIP traffic. This approach is more preferred especially in cases where VoIP is deployed in converged networks.
Managing QoS
QoS management helps to set and evaluate QoS policies and goals. A common methodology entails the following steps:
- Establishing network baseline. This helps in determining the traffic characteristics of the network.
- Deploying QoS techniques when the traffic characteristics have been obtained and an application(s) has been targeted for QoS.
- Evaluating the results by testing the response of the targeted applications to see whether the QoS goals have been reached.
In order to effectively manage QoS policies in a VoIP network, it is important to use a layered approach. Information needs to be gathered from different points in the network and at various layers (Layer 1, 2, 3 and the application). This information needs to be correlated to different events occurring in the network such as degraded voice service in certain network segments or complete voice outage in a specific location.
It is very important to establish a baseline for the voice endpoints as well. For instance, a baseline can be established for PacketCable MTAs based on their state (In- service, Out-of-Service etc.), registration status (registered, unregistered), and so on. So if a mass de-registration occurs this can be correlated to a provisioning server failure or if a large number of MTAs go into Out-of-Service state this event can be correlated to a CMS failure.
For monitoring QoS, look for PHB as defined in the Diff Serv model for QoS. Look for QoS policy violations, queue drops, interface statistics, errors and resource over-utilization (memory, CPU) on routers, switches, voice gateways and endpoints. This information can be correlated to alarms and syslog messages stored on management servers.
The above mentioned information can be gathered using command line interface (CLI) or by polling via SNMP or XML as mentioned in earlier sections of this chapter. In order to poll information from various network devices, different MIBs can be used. An example of the QoS MIB is given below:
CISCO-CLASS-BASED-QOS-MIB
-
cbQosPoliceExceededBitRate (1.3.6.1.4.1.9.9.166.1.17.1.1.14)- The bit rate of the non-conforming traffic.
-
cbQosQueueingDiscardByteOverflow (1.3.6.1.4.1.9.9.166.1.18.1.1.3)- The upper 32 bit count of octets, associated with this class, that were dropped by queueing.
-
cbQosQueueingDiscardPkt (1.3.6.1.4.1.9.9.166.1.18.1.1.7)- The number of packets, associated with this class, that were dropped by queueing.
-
cbQosTSStatsDropPktOverflow (1.3.6.1.4.1.9.9.166.1.19.1.1.10)- This object represents the upper 32 bits counter of packets that have been dropped during shaping.
-
cbQosTSStatsDropPkt (1.3.6.1.4.1.9.9.166.1.19.1.1.11)- This object represents the lower 32 bits counter of packets that have been dropped during shaping.
If the problem is occurring due to network congestion, this can be diagnosed by monitoring QoS at different network elements and different layers. In order to explain this concept, we take an example of a PacketCable network as discussed in chapter 3.
PacketCable Use Case
In a PacketCable environment quality of service is provided using DQoS architecture which focuses on the access part of the network between the MTA and the CMTS. Resources are assigned to the MTA at the time of call setup after performing admission control, and QoS is assigned based on the information received from the CMS (via gate messaging). If the call setup fails it can be caused due any of the following reasons:
- Lack of resources on the MTA.
- Layer 2 messaging getting dropped between the MTA and the CMTS. This can be caused due to Layer 1 events such as noise on the cable plant or Layer 2 events such as DOCSIS queues filling up.
- Lack of resources on the CMTS. This can be either at the DOCSIS layer (Layer 2), at the IP layer (Layer 3) or at upper layer protocols like COPS (used for carrying DQoS messages between the CMTS and the CMS).
- Call signaling failure due to network congestion, causing delayed or dropped packets by intermediate devices between the MTA and the CMS.
Similarly, if the quality of the voice call is degraded after being setup, the problem could be related to the following issues:
- Packet drops between the MTA and the CMTS due to physical layer (Layer 1) issues (degraded SNR, Uncorrectable errors, etc.)
- Proper QoS not assigned to the voice call. The voice call maybe setup over Best Effort service flows instead of dedicated service flow with guaranteed QoS for voice.
- The voice service flows may be getting impacted due to resource over-utilization (high CPU utilization, DOCSIS scheduler issues etc.) on the CMTS. This can cause voice packets to get delayed or dropped on the service flows.
- Packets getting dropped by intermediate device between the two VoIP endpoints (Layer 3).
The layered approach for monitoring the above mentioned issues is illustrated in Figure 6-12.
In the approach mentioned above, we start at Layer 1 by monitoring the physical parameters of the cable plant like Signal-to-Noise Ratio (SNR), power levels, correctable and uncorrectable errors caused due to noise. These parameters can be monitored by using the DOCS-IF-MIB. If there are issues at the physical layer that can affect VoIP traffic (degraded SNR, power levels, errors etc.) we correlate this data to network events or alarms to see if they are causing any VoIP related issues.
Next we look at the DOCSIS layer (Layer 2) to see if the DOCSIS layer messaging between the MTA and CMTS is working as expected. We would need to look at the DSX messages (Dynamic Service Add – DSA, Dynamic Service Change – DSC and Dynamic Service Delete – DSD) to ensure that requests being sent by the MTA are not being rejected or dropped by the CMTS. Failure in DSX messaging would also need to be correlated to any VoIP events in the network to make sure service is not getting impacted. The DSX messaging on the CMTS can be monitored by using the DOCS-QOS-MIB.
Next we look at the DOCSIS QoS parameters on the CMTS to make sure the VoIP traffic is getting the appropriate QoS when it is transported over the cable network. This information can be monitored using the DOCS-QOS-MIB.
The next thing we look at is the IP layer (Layer 3) information to make sure that packet drops under the Cable (RF) interfaces or the WAN links are not affecting VoIP traffic. We would also look at the queues under these interfaces to make sure packets are not backing up in the queues which can cause delay and jitter for VoIP traffic. The interface statistics can be monitored using the IF-MIB.
Another thing to check on the intermediate devices between the MTA and the other VoIP endpoint is the QoS policy defined to make sure it is operating as designed. This information can be monitored using the CISCO-CLASS-BASED-QOS-MIB as described above.
Additional checkpoints could include any other devices such as firewalls, layer 2 switches etc. to make sure they are not interfering with the quality of the VoIP traffic. One thing to check on layer 2 devices is to make sure they are configured with the appropriate Class of Service (CoS) for the VoIP signaling and bearer traffic.
Lastly, we also need to look at the information from signaling protocol (MGCP/NCS) counters from the endpoints to track delay, jitter and packet loss. This can also help explain issues contributing to degradation of the voice quality. These performance counters can also be used for trend analysis and capacity planning as explained chapter 8.
So far we have explained the high-level methodology to monitor QoS and correlate this information to network events to help isolate problems more effectively. The details of this approach are explained in chapter 7.
As mentioned earlier in this chapter, it is important to group different network elements in the VoIP network (group endpoints, aggregation and core devices, provisioning and management servers, CMS and voice mail servers etc.) to isolate problem caused due to certain device types. This can also help in establishing a baseline for each device type and also for collecting periodic information from these devices which can be used for analysis and trending. This will be discussed in more detail in chapter 8.
Once issues are categorized by different device types or grouping they can be correlated to the trouble ticketing system so problems can be tied back to specific vendors etc. This is discussed in more detail in the next section.
Trouble Ticketing (TT) systems
A Trouble Ticket system is one of the initial problem tracking systems deployed in a VOIP network. It is crucial that the system be utilized to categorize issues so that they help in tying them back to specific vendors and partner service providers. A logging and alerting mechanism should be in place so that systemic issues can identified and rectified. Also the system should help drive better customer satisfaction and uptime.
Identifying and stream lining the categories of trouble tickets
The trouble ticketing system should be developed such that it captures the relationship between the user reported problems or proactive trouble tickets to the service uptime.
The geographic reporting should be captured in the trouble ticket along with detail information about the type of problem being reported. If it’s a voicemail related problem being reported for market A or a particular campus then the location, and the type of problem should be captured as a tag. This tagging would facilitate post analysis for correlation to reported alarms and also the vendor product.
Correlating of the TT back to the Service uptime
The customer experience can be deduced through effective tracking of trouble tickets. Following key practices centering on Trouble Ticket system will help reducing down time and improve customer satisfaction.
- A continuous network availability tracking in the trouble ticketing system, by logging user downtime.
- A proactive alarm creation and association to track volume trouble tickets
- Tracking reported faults to trouble tickets on periodic basis, thus allowing diagnosis of resolution time for reported customer issues. As an example trending this relationship would map the increasing faults to customer downtime. Similarly if there is fast resolution to issues, then faults could be high but the downtime is low, a good indication.
- The TT systems would be polled to generate reports that tie them back to vendor products and partner service providers. This can only be achieved through a detailed tagged trouble ticketing systems, which facilitates a query mechanism based on these tags. Thus this data can be correlated back to the faults reporting during the same periods.
- Based on effective correlation of TT to network uptime a process should be introduced for improving service availability. Basically this should be dynamic process derived through automated queering of TT systems and fault reporting systems.
Summary
Network management starts even before VoIP deployment. Prior to VoIP deployment the transport network is assess for its resiliency, high availability, performance and capacity. This process involves analysis of the IP network done by the tools, traffic engineering for capacity planning, verification using voice traffic simulation, and network transmission loss planning to ensure voice quality will be preserved throughout the IP network and its interfaces with public land and mobile network (PLMN) through TDM voice gateways.
Voice quality metrics including MOS/K-factor, PSQM, and PESQ will be monitored for proactive management on the call processing entities. All the contributing factors including latency, jitter, packet drop and signal levels should be analyzed on the network devices through which media traffic traverses. This is only possible if the network management systems (NMS)are accurately and completely able to discover all the devices in the network using seed devices, CDP, routing tables, arp cache analysis, or ping sweeps. All of the managed devices must be synchronized with a common time source using Network Time Protocol (NTP) so information from individual devices can be correlated accurately in the time context.
Network management systems can employ various methods including SNMP polling, subscribing to alarms/traps, and Syslog analysis to track key performance indicators (KPI’s). All of this information is correlated and tracked on customized dash boards to provide meaningful metrics in the proper context.
In essence VoIP guideline include transmitting voice the fastest way possible by keeping the delay less than 150 ms. Excessive delay will worsen the echo and cause awkward conversations. VoIP packets must be transmitted as a steady smooth stream to minimize jitter without dropping any packets. This will require end–to-end QoS implementation covering all the network layers.
The proactive management approach may still have some fallout which will be covered by trouble ticketing systems. These TT systems should be tied with NMS so the problems could be resolved quickly by virtue of correlation with underlying cause(s).
Reference
Voice over IP (CVoice), Second Edition. Copyright 2006, Cisco Systems, Inc.. Reproduced by permission of Pearson Education, Inc., 800 East 96th Street, Indianapolis, IN 46240.
http://www.cisco.com/en/US/tech/tk652/tk701/technologies_white_paper09186a00800d6b74.shtml
TIA 912 - IP Telephony Equipment: Voice Gateway Transmission Requirements
ITU-T G.107 - The E-model, a computational model for use in transmission
ITU-T G.131 - Control of Talker Echo
http://www.cisco.com/en/US/tech/tk652/tk701/technologies_white_paper09186a00800d6b68.shtml












