Understanding High Availability of IP and MPLS Networks

Chapter Description

This chapter describes major sources of network failures and provides an overview of techniques that are commonly used to improve availability of IP/MPLS networks. In particular, this chapter outlines mechanisms for reducing network downtime due to control-plane failures.

Network and Service Outages

A service is the set of tasks performed by the network upon a request from the user such as a voice call, Internet access, e-mail, and so forth. A service outage is the users' inability to request a new service or to continue to use an existing service because the service is either no longer available or it is impaired. As discussed previously, availability of a network strongly depends on the frequency of service outages and the recovery time for each outage. A network outage is the loss of network resources, including routers, switches, and transport facilities, because of the following:

  • Complete or partial failure of hardware and software components

  • Power outages

  • Scheduled maintenance such as software or hardware upgrades

  • Operational errors such as configuration errors

  • Acts of nature such as floods, tornadoes, and earthquakes

Planned and Unplanned Outages

Each network outage can be broadly categorized as either "unplanned" or "planned." An unplanned network outage occurs because of unforeseen failures of network elements. These failures include faults internal to a router's hardware/software components such as control-plane software crashes, line cards, link transceivers, and the power supply or faults external to the router such as fiber cuts, loss of power in a carrier facility, and so forth. A planned network outage occurs when a network element such as router is taken out of service because of scheduled events (for example, a software upgrade).

Main Causes of Network Outages

What are the main causes of network outages? As it turns out, several culprits contribute to network downtime. According to a University of Michigan one-year reliability study of IP core routers conducted in a regional IP service provider network, router interface downtime averaged about 955 minutes per year, which translates to an interface availability of only 0.998.3 As a reference point, a carrier-class router is expected to have a downtime of only 5.2 minutes per year. The same study indicated the following percentages of causes for total network downtime:

  • 23 percent for router failure (software/hardware faults, denial-of-service attack)

  • 32 percent for link failures (fiber cuts, network congestion)

  • 36 percent for router maintenance (software and hardware upgrade, configuration errors)

  • The remaining 9 percent for other miscellaneous reasons

According to another study, router software failures are the single biggest (25 percent) cause of all router outages.4 Moreover, within software-related outages, router control-plane failure is the biggest (60 percent) cause of software failures. The following section provides a brief overview of various node- and network-level fault-tolerance approaches that can help to improve network availability.

