Home > Articles > NX-OS Troubleshooting Tools

NX-OS Troubleshooting Tools

Chapter Description

In this sample chapter from Troubleshooting Cisco Nexus Switches and NX-OS, you will review the various tools available on the Nexus platform that can help in troubleshooting and day-to-day operation.

Embedded Event Manager

Embedded Event Manager (EEM) is a powerful device- and system-management technology integrated in NX-OS. EEM helps customers harness the network intelligence intrinsic to Cisco’s software and give them the capability to customize behavior based on the network events as they happen. EEM is an event-driven tool that takes various types of trigger input and enables the user to define what actions can be taken. This includes capturing various show commands or performing actions such as executing a Tool Command Language (TCL) or Python script when the event gets triggered.

An EEM consists of two major components:

  • Event: Defines the event to be monitored from another NX-OS component

  • Action: Defines action to be taken when the event is triggered

Another component of EEM is the EEM policy, which is nothing but an event paired with one or more actions to help troubleshoot or recover from an event. Some system-defined policies look out for certain system-level events such as a line card reload or supervisor switchover event and then perform predefined actions based on those events. These system-level policies are viewed using the command show event manager system-policy. The policies are overridable as well and can be verified using the previous command. The system policies help prevent a larger impact on the device or the network. For instance, if a module has gone bad and keeps crashing continuously, it can severely impact services and cause major outages. A system policy for powering down the module after N crashes can reduce the impact.

Example 2-24 lists some of the system policy events and describes the actions on those events. The command show event manager policy-state system-policy-name checks how many times an event has occurred.

Example 2-24 EEM System Policy

NX-1# show event manager system-policy
           Name : __lcm_module_failure
    Description : Power-cycle 2 times then power-down
    Overridable : Yes
 
           Name : __pfm_fanabsent_any_singlefan
    Description : Shutdown if any fanabsent for 5 minute(s)
    Overridable : Yes
 
           Name : __pfm_fanbad_any_singlefan
    Description : Syslog when fan goes bad 
    Overridable : Yes
 
           Name : __pfm_power_over_budget
    Description : Syslog warning for insufficient power overbudget
    Overridable : Yes
 
           Name : __pfm_tempev_major
    Description : TempSensor Major Threshold.  Action: Shutdown
    Overridable : Yes

           Name : __pfm_tempev_minor
    Description : TempSensor Minor Threshold.  Action: Syslog.
    Overridable : Yes
NX-1# show event manager policy-state __lcm_module_failure
Policy __lcm_module_failure  
  Cfg count :   3
    Hash        Count       Policy will trigger if
----------------------------------------------------------------
  default         0        3 more event(s) occur

An event can be either a system event or a user-triggered event, such as configuration change. Actions are defined as the workaround or notification that should be triggered in case an event occurs. EEM supports the following actions, which are defined in the action statement:

  • Executing CLI commands (configuration or show commands)

  • Updating the counter

  • Logging exceptions

  • Reloading devices

  • Printing a syslog message

  • Sending an SNMP notification

  • Setting the default action policy for the system policy

  • Executing a TCL or Python script

For example, an action can be taken when high CPU utilization is being seen on the router, or logs can be taken when a BGP session has flapped. Example 2-25 shows the EEM configuration on a Nexus platform. The EEM has the trigger event set for the high CPU condition (for instance, the CPU utilization is 70% or higher); the actions include BGP show commands that are captured when the high CPU condition is noticed. The policy is viewed using the command show event manager policy internal policy-name.

Example 2-25 EEM Configuration and Verification

event manager applet HIGH-CPU
 event snmp oid 1.3.6.1.4.1.9.9.109.1.1.1.1.6.1 get-type exact entry-op ge
       entry-val 70 exit-val 30 poll-interval 1
 action 1.0 syslog msg High CPU hit $_event_pub_time
 action 2.0 cli command enable
 action 3.0 cli command "show clock >> bootflash:high-cpu.txt"
 action 4.0 cli command "show processes cpu sort >> bootflash:high-cpu.txt"
 action 5.0 cli command "show bgp vrf all all summary >> bootflash:high-cpu.txt"
 action 6.0 cli command "show clock >> bootflash:high-cpu.txt"
 action 7.0 cli command "show bgp vrf all all summary >> bootflash:high-cpu.txt"

NX-1# show event  manager policy internal HIGH-CPU
                          Name : HIGH-CPU    
                   Policy Type : applet
   action 1.0 syslog msg "High CPU hit $_event_pub_time"
   action 1.1 cli command "enable" 
   action 3.0 cli command "show clock >> bootflash:high-cpu.txt" 
   action 4.0 cli command "show processes cpu sort >> bootflash:high-cpu.txt" 
   action 5.0 cli command "show bgp vrf all all summary >> bootflash:high-cpu.txt" 
   action 6.0 cli command "show clock >> bootflash:high-cpu.txt"
   action 7.0 cli command "show bgp vrf all all summary >> bootflash:high-cpu.txt"

In some instances, repetitive configuration or show commands must be issued when an event is triggered. Additionally, using an external script makes it difficult to continuously monitor the device for an event and then trigger the script. For such scenarios, a better solution is to use automation scripts and tools that are available with NX-OS. NX-OS provides the capability to use TCL and Python scripts in the EEM itself, which allows those scripts to be triggered only when an event is triggered.

Consider an example software problem in which any link shutdown on the switch causes the switching to get disabled on all the VLANs present on the switch. Example 2-26 demonstrates triggering the TCL script for a link shutdown. The TCL is saved on the bootflash with the .tcl extension. The TCL file iterates over all the VLAN database and performs a no shutdown under the VLAN configuration mode.

Example 2-26 EEM with TCL Script

! Save the file in bootflash with the .tcl extension
set i 1
while {$i<10} {
cli configure terminal
cli vlan $i
cli no shutdown
cli exit
incr i
}

! EEM Configuration referencing TCL Script
event manager applet TCL
event cli match "shutdown"
 action 1.0 syslog msg "Triggering TCL Script on Module Failure Event"
 action 2.0 cli local tclsh EEM.tcl

Similarly, a Python script can be referenced in the EEM script. The Python script is also saved in the bootflash with the .py extension. Example 2-27 illustrates a Python script and its reference in the EEM script. In this example, the EEM script is triggered when the traffic on the interface exceeds the configured storm-control threshold. In such an event, the triggered Python script collects multiple commands.

Example 2-27 Python Script with EEM

! Save the Python script in bootflash:
import re
import cisco
cisco.cli ("show module >> bootflash:EEM.txt")
cisco.cli ("show redundancy >> bootflash:EEM.txt")
cisco.cli ("show interface >> bootflash:EEM.txt")

! EEM Configuration referencing Python Script
event manager applet Py_EEM
event storm-control
 action 1.0 syslog msg "Triggering TCL Script on Module Failure Event"
 action 2.0 cli local python EEM.py

There are currently no related articles. Please check back later.