Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Automatically Monitor Device Health and Detect Anomalies

Use this topic to understand how Paragon Automation automatically monitors device health and detects anomalies, and how you can use the GUI to view anomalies related to device health.

Device Health Monitoring and Anomaly Detection Overview

Note:
  • Device Health and anomaly detection is a beta feature in this release.

  • To monitor device health, you must enable AI/ML (install-aiml) and device health monitoring (enable-device-health) when you install the Paragon Automation cluster. For more information, see Deploy the Cluster.

To ascertain the health of a network, you need to monitor the health of the devices in the network. Paragon Automation uses AI/ML (artificial intelligence [AI] and machine learning [ML]) techniques to automatically monitor Key Performance Indicators (KPIs) related to a device's health, and automatically detects any anomalies that occur. Paragon Automation also performs a root-cause analysis (RCA) of device temperature anomalies when the device is in operation.

The periodic monitoring of the device's health status and the timely detection of device health anomalies enables operators to take action and minimize the impact of any issues that occur

Paragon Automation monitors device health in the following scenarios:

  • During device onboarding—When a device is being onboarded, Paragon Automation monitors the device's health and generates an alert if any anomalies occur.

    When a device is being onboarded, if other devices of the same model that were previously onboarded exist, Paragon Automation compares the data to detect anomalies. However, if a device of a particular model is being onboarded for the first time, then the efficacy of the anomaly detection is limited because of lack of historical data.

  • During device operation—After the device is onboarded successfully and is managed, Paragon Automation continuously monitors the KPIs related to device heath. For each KPI of each device, Paragon Automation monitors the KPI, forecasts the range, and detects any anomalies that occur. During device operation, Paragon Automation detects device health anomalies (within 30 minutes) based on historical data for that device and the forecasted range.

Note:

In the validation phase, the MAPE score for the ML models used in device health monitoring was observed as varying between 2.5 to 6.5.

RCA of Temperature Anomalies

When a device is in operation, Paragon Automation provides RCA for issues related to the Routing Engine temperature and Routing Engine CPU temperature. Paragon Automation analyzes the different attributes (CPU utilization percentage, fan RPM percentage, and inlet temperature) that could cause a temperature issue. Paragon Automation also compares the device's temperature to an expected range. Based on the analysis and comparison, Paragon Automation provides an alert, an expected reason for the issue, and details on the events that might have caused the issue. Figure 1 displays a sample page showing the RCA logs for an anomaly in the Routing Engine temperature.

Figure 1: Sample Page Showing RCA for Device Temperature Anomaly Sample Page Showing RCA for Device Temperature Anomaly
  1

Device Temperature RCA Details

 

Device Health KPIs

Table 1 displays the device health KPIs that Paragon Automation monitors for each device.

Table 1: KPIs Related to Device Health
KPI Component Parameters
CPU

Routing Engine

Line card

CPU Utilization Percentage (%)
Memory

Routing Engine

Line card

Memory Utilization Percentage (%)
Fan Not applicable

RPM Percentage (%)

Temperature
  • Routing Engine (RE)

  • Routing Engine CPU

  • Line card

  • Line card CPU

Current temperature

Device Health Anomalies in the GUI

You can view and monitor the device health anomalies for a device on the Hardware accordion of the Device-Name page.

To view and monitor device health anomalies:

  1. Do one of the following.
    • To view and monitor device health anomalies during device onboarding, select Inventory > Device Onboarding > Onboarding Dashboard > Put Devices into Service > Device-Name.

    • To view and monitor device health anomalies during device operation, select Observability > Troubleshoot Devices > Device-Name.

    The Device-Name page appears.

  2. Scroll to the Hardware accordion and click > to expand the accordion.
    • The Chassis section of the accordion displays the health status of the following KPIs monitored by Paragon Automation:

      • Fans

      • CPU

      • Linecards

      • Memory

      • Temperature

    • Device events appear under Relevant Events with the following information:

      • Event notification message

      • Date and time that the last event was received by Paragon Automation.

  3. Hover over or click View Details to view the details of the event, including the number of times that the event recurred.
  4. (Optional) Click View All Relevant Events to view all the health-related events for the device.

    The events appear on the Events for Device-Name page.

  5. You can view detailed information about each KPI related to device health by doing the following:
    1. Click the health status link for the KPI; for example, Fans or Temperature.

      The Hardware details for Device-Name page appears, displaying the section for the KPI that you clicked in the preceding page.

      For example, if you click the link for Fans, then the Fans section is expanded and the graphs related to the fans are displayed.

      Figure 2 shows a sample section (Fans) of the Hardware Details for Device-Name page.

    2. To view the details of an anomaly, click the yellow triangle icon on the graph.

      The details of the anomaly appear in a pop-up, as shown in Figure 2.

  6. Click Close or the X icon to go to the Device-Name page.

For more information on the hardware accordion, see Hardware Data and Test Results.

Figure 2: Sample Hardware Details for Device-Name Page Sample Hardware Details for Device-Name Page
  1

KPI

  6

Triangle icons indicating an anomaly when the higher threshold is breached.

  2

Legend showing the colors for different sub-components used in the graphs

  7

Pop-up showing details of device health anomaly

  3

Circle icons indicating that the KPI is normal

  8

Upper and lower boundaries (dynamic thresholds) for the data displayed in the graph

  4

Critical threshold marker

  9

Hexagon icons indicating an anomaly when the critical threshold is breached.

  5

High threshold marker