Perform a Cluster Health Check

This topic describes the health check commands available in Routing Director.

Purpose

Perform a health check on the cluster, view the overall and detailed status of the cluster, and troubleshoot specific issues.

You can use either the Deployment Shell or the Linux root shell to execute a cluster health check.

The command performs multiple health tests on the cluster and returns a detailed list of all the tests conducted and each of their results. The health-check command checks for multiple parameters such as:

Kubernetes status
Health of each node (CPU, disk space, memory, I/O latency, and so on)
Database health (Postgres, ArangoDB, OpenSearch, Kafka, and so on)
Ceph storage health

The overall health status is categorized as green, amber, or red. A green status indicates a healthy cluster and that all health checks have passed successfully. A red status indicates critical issues in the cluster. An amber status indicates that there maybe some noncritical issues in the cluster. The status is returned amber in the following instances:

Nodes have taints
Disk usage or memory usage on any node exceeds 80% of available space
Disk I/O latency on any node exceeds 100000 ms
Rook Ceph status shows HEALTH_WARN
Number of kafka in-sync replicas are not equal to the number of kafka replicas

The results of the health-check command are stored in a Postgres database. The result of the latest health-check command can be retrieved from the database using the Routing Director GUI (Settings > Health Checks).

Additionally, a cron job runs every hour, at the top of the hour, that automatically checks the health of the cluster and stores the resulting output in the Postgres database.

Action

Perform a health check using Deployment Shell
Perform a health check using the Linux root shell

You can use either the Deployment Shell or the Linux root shell to execute a cluster status health check.

Perform a health check using Deployment Shell

Sample Output

Perform a health check using the Linux root shell

Log in to a cluster node and type exit to exit to the Linux root shell. Use the following commands to retrieve the Routing Director cluster health status.

Default health-check
Check the health of specific functions
Enable verbose mode logging
Check full cluster health

Default health-check

Use the health-check command to check, retrieve, and display the status of the Routing Director cluster health. The health of the cluster is checked for the default parameters. The output of this command is the same as the request deployment health-check command output in the Deployment Shell.

Sample Output

Check the health of specific functions

Use the -t funtion-name option to check the health of a specific function. For example, functions such as check_node_cpu_memory_status, check_etcd_logs, check_registry_status, check_replicas_status, and so on.

Sample Output

Use check_node_cpu_memory_status to check the CPU and memory usage on all nodes.

Use check_etcd_logs to check the etcd pod logs for disk latency issues.

Use check_replicas_status to check Deployments/StatefulSets replica health.

Enable verbose mode logging

Use the -v option to enable verbose mode logging while checking the complete health of the cluster (health-check -v) or while checking health of specific functions (health-check -v -t function-name). The verbose mode displays additional information while logging the output.

Sample Output

Use check_node_cpu_memory_status to check the CPU and memory usage on all nodes.

Check full cluster health

Use the -a option to check the complete health of the cluster. The cluster health is checked for the default parameters and also the following additional parameters:

OpenSearch shard status
Registry status
etcd logs

Sample Output

ON THIS PAGE

Perform a Cluster Health Check

Purpose

Action

Perform a health check using Deployment Shell

Perform a health check using the Linux root shell

Default health-check

Check the health of specific functions

Enable verbose mode logging

Check full cluster health