Device Telemetry Health Probe

Probe Overview

The device telemetry health probe verifies telemetry collector health. It runs analytics on the collection statistics from available service execution and if the telemetry collection health degrades, anomalies are raised.

For more information about this probe, from the blueprint, navigate to Analytics > Probes, click Create Probe, then select Instantiate Predefined Probe from the drop-down list. Select the probe from the Predefined Probe drop-down list to see details specific to the probe.

AI Fabric Enhancements

Version 6.0 enhances the Device Telemetry Health Probe to monitor additional telemetry services and automatically include future services. These updates enhance telemetry collection and monitoring, with streamlined integration of future services.

The Device Telemetry Health Probe monitors telemetry collector health by analyzing service execution statistics. The probe also detects anomalies when telemetry collection degrades, enabling a proactive response. Key features include:

Monitoring of critical telemetry services across compute systems
Automatic integration of new telemetry services
Historical data capture for trend analysis

Telemetry Services Monitoring

As of version 6.0, the Device Telemetry Health Probe monitors the health of eight critical telemetry services running on compute systems. Below is a short description of each monitored service.

Existing Services:

Interface: Tracks the operational state and configuration of network interfaces.
Interface_Counters: Monitors interface statistics such as packet counts, errors, and drops.
LLDP (Link Layer Discovery Protocol): Collects neighbor discovery information to verify topology and connectivity between devices.
Hostname: Ensures that the hostname of the device is defined correctly.
resource_util: Monitors system resource utilization, such as CPU, memory, and storage.
disk_util: Monitors disk usage statistics, such as read/write operations and capacity utilization.

Services Added in 6.0:

gpu_hardware_counters: Monitors GPU hardware metrics such as utilization and performance counters.
gpu_infiniband_dev_to_interface: Tracks the mapping between InfiniBand Mellanox interfaces and GPU NICS. Monitors the state of the interfaces.

Probe Settings

The probe has the following settings:

Max Waiting Time: 120 seconds (maximum time for service execution).
Anomaly Time Window: 10 minutes (observation period for anomalies).
Threshold Duration: 6 minutes (time required for sustained failures to trigger alerts).
History Retention Period: 30 days (time to retain telemetry data).
Max gRPC Errors in Period: 1 error (threshold for gRPC errors).
gRPC Error Monitoring Period: 5 minutes (time window for gRPC error tracking).

ON THIS PAGE

Device Telemetry Health Probe

Probe Overview

AI Fabric Enhancements