Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 

Root Cause Analysis

 

Root Cause Analysis is a Fault Management feature located in the Event Browser application that allows you to diagnose trap events and recommend corrective actions. It is accessed by right clicking an event and selecting Root Cause Analysis from the pop-up menu. This feature references a list of rules defined for a device and event type, performs user defined actions on the device, searches the output of those actions, and highlights if the expected results of the actions are found. The expected results can be used to diagnose the cause of the event and offer suggestions for further action.

Figure 1: Root Cause Analysis
Root Cause Analysis

The list of rules is defined in a comma-separated value file rca-rules located in the directory /u/wandl/db/config/.

Rules can be added, deleted, or modified by changing the entries using a text editor. The rca-rules file consists of the following fields and keywords.



RCA-Rules Fields



  • Format of file: vendor, type, action, expected-result, comment

  • vendor is the name of the device vendor. Example, cisco, juniper, huawei

  • type is the name of the SNMP trap. Example linkUp, linkDown, jnxVpnPwDown

  • action taken can be defined as a command executed through the device CLI, command executed on the application server, SNMP query, or post an event. Conditional actions can be defined too.

  • expected-result is a string that will be searched and highlighted from the output of the defined action. Example, “line protocol is down”. Supports variables such as (ElementName), simple regular expressions, and logical operators “&&” and “||”.

  • comment is the message to display when the expected-result is found. Example, Check cable connection or if administratively down



RCA-Rules General Keywords



  • (ElementName) corresponds to the Element Name variable in the Event Browser.

  • (Device) corresponds to the Device ID variable in the Event Browser.

  • # use to comment out a line and it will not be parsed in the file.



RCA-Rules Action Commands



  • @cli:<command> specifies the action taken is a command on the device CLI. Example, @cli:show interface.

  • @sh:<command> specifies the action taken is a command on the application server. Example, @sh:/u/wandl/bin/status_mplsview

  • @snmp:<OID> specifies the action taken is a SNMP query on the OID value. Example, @snmp:1.3.6.1.2.1.1.1.0



RCA-Rules Conditional Action



  • Only the action command @cli: or @sh: or @snmp is required in the action field. The labelname:, @match:, and @notmatch: are optional keywords used for conditional action statements. If an action command is not specified, the root cause analysis parser will attempt to identify the type of command although it’s recommended to define the action command type.

  • Format of conditional action field: labelname: [@cli: | @sh: | @snmp:] @match:@notmatch

  • <labelname:> tags an action with a label used for conditional actions. Example, mylabel:

  • @match: <labelname:> skips to the line of the labelname if the expected-result matches.

  • @notmatch: <labelname:> skips to the line of the labelname if the expected-result does not match.

  • exit will ignore all the remaining rules and exit the root cause analysis.



Usage



Once the rca-rules list has been defined, these rules will appear in the Root Cause Analysis table. Multiple actions for the same vendor and type may be specified and will execute in sequential order. In the Root Cause Analysis table select the entry you wish to Analyze.

The Root Cause Message tab will display the command action to take, the command type, the status of the action, the expected-result matching string, and the comment message.

Press Analyze to execute the actions. A pop-up window will allow you to select the commands to execute.

The Root Cause Message tab will now display the result of the action command. If expected-result string is found, the Status will indicate Matched and the string will be highlighted. If the expected-result string is not found, the Status will indicate Not Matched. If the expected-result string is not defined, the Status will indicate Executed. Using conditional actions, if a rule is skipped, the Status will indicate Skipped.

The results can be saved to file, viewed in a new window, or printed using the icons in the bottom left window.



Sample Cases



The following sample cases walk through creating new rules in the rca-rules file and using the Root Cause Analysis feature. These samples will go through a linkDown and CollectionError event to highlight several of the keywords and action commands.

Figure 2: Sample Case for Link Down Events
Sample Case for Link Down Events


Sample linkDOWN rca-rule



Open the rca-rules file located in the /u/wandl/db/config directory. Copy the following four statements into the file to create the rules for a Juniper link down event. Note that syntax Rule #: is not part of the rule statement.

  • Rule 1: juniper,linkDown,@cli:show interface (ElementName),"Flags: Down",Confirmed status down

  • Rule 2: juniper,linkDown,@cli:show configuration interfaces (ElementName) @match: operation @notmatch: admin,"disable",Check administrative down

  • Rule 3: juniper,linkDown,admin: @snmp:1.3.6.1.2.1.2.2.1.7,,Check interface admin status 2 for down

  • Rule 4: juniper,linkDown,operation: @snmp:1.3.6.1.2.1.2.2.1.8,,Check interface operation status 2 or 7 for down

    Right-click a link down event to open the Root Cause Analysis window. For the sample, the Type linkDown on Device ID J4 ElementName ge-0/0/1.3 is selected. In the table, it lists the Device J4 to be analyzed. The Root Cause Message tab has the four rules defined for this event.

    Figure 3: Root Cause Message Tab
    Root Cause Message Tab

Descriptions for each rule.

  • Rule 1: runs the command “show interface ge-0/0/1.3” on the device CLI. The (ElementName) variable is equal to ge-0/0/1.3. The expected-result is the string “Flags: Down” from the output of the command. The comment message is “Confirmed status down.”

  • Rule 2: is a conditional action. It runs the command “show configuration interfaces ge-0/0/1.3” on the device CLI. The (ElementName) variable is equal to ge-0/0/1.3. The expected-result is the string “disable”. The comment message is “Check administrative down.” If the expected-result matches, the next rule executed skips to labelname “operation” which is tagged in Rule 4. If the expected-result does not match, the next rule executed skips to labelname “admin” which is tagged in Rule 3.

  • Rule 3: has a labelname “admin”. It runs a SNMP query on OID 1.3.6.1.2.1.2.2.1.7. There is no expected-result string to match. The comment message is “Check interface admin status 2 for down.” The SNMP query will return a list of all the interfaces. The MIB Index of the interface ge-0/0/1.3 is in the Event Details of the Event Browser Window. See Figure 192 which highlights the MIB attributes.

  • Rule 4: has a labelname “operation”. It runs a SNMP query on OID 1.3.6.1.2.1.2.2.1.8. There is no expected-result string to match. The comment message is “Check interface operation status 2 or 7 for down.” The SNMP query will return a list of all the interfaces. The MIB Index of the interface ge-0/0/1.3 is in the Event Details of the Event Browser Window. See Figure 192 which highlights the MIB attributes.

    Select the Node entry from the table and press Analyze to open a list of the commands to run.

    Figure 4: Check Command List
    Check Command List

    The Root Cause Message tab displays the results of executing the rules. In this sample, it was found the linkDown event was caused by an administrator disabling the interface. The combination of rules confirms the SNMP query returned operation down results. The results can be saved to file, viewed in a new window, or printed using the icons in the bottom left window.

    Figure 5: Root Cause Message Tab Result
    Root Cause Message Tab Result


Sample CollectionError rca-rule



Open the rca-rules file located in the /u/wandl/db/config directory. Copy the following four statements into the file to create the rules for a Cisco collection error event. Note that syntax Rule #: is not part of the rule statement.

  • Rule 5: cisco,CollectionError,@sh:ping (SourceIP),(SourceIP) is alive,Device is reachable

  • Rule 6: cisco,CollectionError,@snmp:1.3.6.1.2.1.1.5.0 @match: exit,*,SNMP query on sysName returns a value

  • Rule 7: cisco,CollectionError,@sh: grep "TRAP_IP" /u/wandl/bin/mplsenvsetup.sh | nawk -F"=|;" '{print $2}',,This is the SNMP server IP receiving traps

  • Rule 8: cisco,CollectionError,@cli:show run | begin snmp-server,snmp-server enable traps && snmp-server host,Check host target contains SNMP server IP

    Right-click a collection error event to open the Root Cause Analysis window. For the sample, the Type CollectionError on Device ID 2924 with Source IP 192.10.21.188 is selected. In the table, it lists the Device 2924 to be analyzed. The Root Cause Message tab has the four rules defined for this event.

    Descriptions for each rule:

  • Rule 5: runs the command “ping 192.10.21.188” on the application server. The (SourceIP) variable is equal to 192.10.21.188. The expected-result is the string “192.10.21.188 is alive” from the output of the command. The comment message is “Device is reachable.”

  • Rule 6: is a conditional action. It runs a SNMP query on OID 1.3.6.1.2.1.1.5.0. The expected-result string is a wildcard character meaning any value returned. The comment message is “SNMP query on sysName returns a value.” If the expected-result returns a value, the next rule executed is keyword exit which ignores all remaining rules and exits the root cause analysis. If the expected-result does not return a value, the next rule is executed.

  • Rule 7: runs the command “grep "TRAP_IP" /u/wandl/bin/mplsenvsetup.sh | nawk -F"=|;" '{print $2}'” on the application server. This returns the configured SNMP server IP receiving traps. There is no expected-result string to match. The comment message is “This is the SNMP server IP receiving traps.”

  • Rule 8: runs the command “show run | begin snmp-server” on the device CLI. The expected-result string uses keyword && which requires both “snmp-server enable traps” and “snmp-server host” to be found to have a match. The comment message is “Check host target contains SNMP server IP.”

    In this sample, the combination of rules checks a CollectionError event by first trying to ping the device. Then it will run a SNMP query on the sysName to check if SNMP get returns any value. If the sysName can be queried, the entire rule set exits because SNMP collection should be working. If the SNMP query fails, the next rule displays the configured SNMP server IP on the application server. The final rule displays the SNMP-server configuration on the device and reminds you to check if your SNMP server IP is configured as a host target.