Automatically Monitor Device Health and Detect Anomalies

Use this topic to understand how Paragon Automation automatically monitors device health and detects anomalies, and how you can use the GUI to view anomalies related to device health.

Device Health Monitoring and Anomaly Detection Overview

Note:

Device Health and anomaly detection is a beta feature in this release.
To monitor device health, you must enable AI/ML (install-aiml) and device health monitoring (enable-device-health) when you install the Paragon Automation cluster. For more information, see Deploy the Cluster.

To ascertain the health of a network, you need to monitor the health of the devices in the network. Paragon Automation uses AI/ML (artificial intelligence [AI] and machine learning [ML]) techniques to automatically monitor Key Performance Indicators (KPIs) related to a device's health, and automatically detects any anomalies that occur. Paragon Automation also performs a root-cause analysis (RCA) of device temperature anomalies when the device is in operation.

The periodic monitoring of the device's health status and the timely detection of device health anomalies enables operators to take action and minimize the impact of any issues that occur

Paragon Automation monitors device health in the following scenarios:

During device onboarding—When a device is being onboarded, Paragon Automation monitors the device's health and generates an alert if any anomalies occur.

When a device is being onboarded, if other devices of the same model that were previously onboarded exist, Paragon Automation compares the data to detect anomalies. However, if a device of a particular model is being onboarded for the first time, then the efficacy of the anomaly detection is limited because of lack of historical data.
During device operation—After the device is onboarded successfully and is managed, Paragon Automation continuously monitors the KPIs related to device heath. For each KPI of each device, Paragon Automation monitors the KPI, forecasts the range, and detects any anomalies that occur. During device operation, Paragon Automation detects device health anomalies (within 30 minutes) based on historical data for that device and the forecasted range.

Note:

In the validation phase, the MAPE score for the ML models used in device health monitoring was observed as varying between 2.5 to 6.5.

RCA of Temperature Anomalies
Device Health KPIs

RCA of Temperature Anomalies

When a device is in operation, Paragon Automation provides RCA for issues related to the Routing Engine temperature and Routing Engine CPU temperature. Paragon Automation analyzes the different attributes (CPU utilization percentage, fan RPM percentage, and inlet temperature) that could cause a temperature issue. Paragon Automation also compares the device's temperature to an expected range. Based on the analysis and comparison, Paragon Automation provides an alert, an expected reason for the issue, and details on the events that might have caused the issue. Figure 1 displays a sample page showing the RCA logs for an anomaly in the Routing Engine temperature.

Figure 1: Sample Page Showing RCA for Device Temperature Anomaly Line graph showing temperature from March 22-28 with thresholds at 100°C Critical and 95°C High. March 26, 44°C; CPU alert 50°C outside 30.77-55.57°C range.

Line graph showing temperature from March 22-28 with thresholds at 100°C Critical and 95°C High. March 26, 44°C; CPU alert 50°C outside 30.77-55.57°C range.

—

Device Temperature RCA Details

Device Health KPIs

Table 1 displays the device health KPIs that Paragon Automation monitors for each device.

Table 1: KPIs Related to Device Health
KPI	Component	Parameters
CPU	Routing Engine Line card	CPU Utilization Percentage (%)
Memory	Routing Engine Line card	Memory Utilization Percentage (%)
Fan	Not applicable	RPM Percentage (%)
Temperature	Routing Engine (RE) Routing Engine CPU Line card Line card CPU	Current temperature

Device Health Anomalies in the GUI

You can view and monitor the device health anomalies for a device on the Hardware accordion of the Device-Name page.

To view and monitor device health anomalies:

Do one of the following.
- To view and monitor device health anomalies during device onboarding, select Inventory > Device Onboarding > Onboarding Dashboard > Put Devices into Service > Device-Name .
- To view and monitor device health anomalies during device operation, select Observability > Troubleshoot Devices > Device-Name .
The Device-Name page appears.
Scroll to the Hardware accordion and click > to expand the accordion.
- The Chassis section of the accordion displays the health status of the following KPIs monitored by Paragon Automation:
  - Fans
  - CPU
  - Linecards
  - Memory
  - Temperature
- Device events appear under Relevant Events with the following information:
  - Event notification message
  - Date and time that the last event was received by Paragon Automation.
Hover over or click View Details to view the details of the event, including the number of times that the event recurred.
(Optional) Click View All Relevant Events to view all the health-related events for the device.

The events appear on the Events for Device-Name page.
You can view detailed information about each KPI related to device health by doing the following:
1. Click the health status link for the KPI; for example, Fans or Temperature.
  
  The Hardware details for Device-Name page appears, displaying the section for the KPI that you clicked in the preceding page.
  
  For example, if you click the link for Fans, then the Fans section is expanded and the graphs related to the fans are displayed.
  
  Figure 2 shows a sample section (Fans) of the Hardware Details for Device-Name page.
2. To view the details of an anomaly, click the yellow triangle icon on the graph.
  
  The details of the anomaly appear in a pop-up, as shown in Figure 2.
Click Close or the X icon to go to the Device-Name page.

For more information on the hardware accordion, see Hardware Data and Test Results.

Figure 2: Sample Hardware Details for Device-Name Page Graph showing fan speed monitoring in a system with fan list, critical thresholds, and alerts for performance issues.

Graph showing fan speed monitoring in a system with fan list, critical thresholds, and alerts for performance issues.

1 — KPI	6 — Triangle icons indicating an anomaly when the higher threshold is breached.
2 — Legend showing the colors for different sub-components used in the graphs	7 — Pop-up showing details of device health anomaly
3 — Circle icons indicating that the KPI is normal	8 — Upper and lower boundaries (dynamic thresholds) for the data displayed in the graph
4 — Critical threshold marker	9 — Hexagon icons indicating an anomaly when the critical threshold is breached.
5 — High threshold marker

ON THIS PAGE

Automatically Monitor Device Health and Detect Anomalies

Device Health Monitoring and Anomaly Detection Overview

RCA of Temperature Anomalies

Device Health KPIs

Device Health Anomalies in the GUI