Perform Health Checks
Overview
The Paragon Health Check Script is meticulously designed for Kubernetes-based infrastructures, offering a seamless way to monitor and assess the operational health of system components. Executing this Bash-based script performs a series of automated checks across the following critical infrastructure elements:
-
Kubernetes cluster health (nodes, pods, containers, services)
-
Storage systems (Rook Ceph)
-
Databases (PostgreSQL with Patroni)
-
Message queues (RabbitMQ)
-
Time-series databases (InfluxDB)
-
Configuration services (ConfigServer)
-
Container registries
-
System resources (CPU, memory, disk space)
The script generates a detailed health report that is easy to interpret through its color-coded status indicators. By reviewing these reports, you can quickly identify and address potential issues before they escalate, ensuring the continuous availability and performance of your infrastructure.
Benefits of Paragon Automation Health Check Script
-
Enhance system reliability by automating comprehensive health checks across key infrastructure components, enabling early detection of potential issues.
-
Provide clear and actionable insights through detailed health reports with color-coded status indicators, allowing for quick identification and resolution of problems.
-
Support efficient resource management by monitoring disk space, CPU, and memory usage, ensuring optimal utilization and preventing performance bottlenecks.
-
Facilitate secure operations by integrating authentication for Kubernetes and PostgreSQL, safeguarding sensitive data and credentials during health assessments.
-
Improve operational efficiency through automation of health checks with cron jobs, ensuring continuous monitoring without manual intervention.
Prerequisites
Environment Requirements
-
Operating System—Linux (RHEL, CentOS, Ubuntu)
-
Kubernetes Cluster—Running and accessible
-
User Permissions—Root or sudo access required
-
Network Access—SSH access to cluster nodes (for disk space checks)
Required Tools
The following command-line tools must be installed and available in
$PATH:
| Tool | Purpose | Installation |
|---|---|---|
| kubectl | Kubernetes CLI |
curl -LO https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl |
| jq | JSON parsing |
yum install jq or apt-get install jq |
| bc | Arithmetic calculations |
yum install bc or apt-get install bc |
| curl | HTTP requests | Preinstalled |
| ssh | Remote node access | Preinstalled |
| base64 | Credential decoding | Preinstalled |
Required Configurations
-
KUBECONFIG—Must be set to
/etc/kubernetes/admin.confor equivalent. -
Cluster Access—You must have cluster-admin privileges to run the script.
-
Metrics Server—Must be deployed for resource monitoring (
kubectl topcommands). -
Environment Variables:
# PostgreSQL Configuration COMMON_PG_NS="common" # PostgreSQL namespace COMMON_PG_CLUSTER="atom-db" # PostgreSQL cluster name SECRET_NAME="atom.atom-db.credentials" # Kubernetes secret containing DB credentials LAG_THRESHOLD_MINUTES=1 # Replication lag threshold (minutes) WAL_SIZE_THRESHOLD_GB=1 # WAL directory size threshold (GB) REPLICATION_LAG_THRESHOLD_MB=730 # Replica lag threshold (MB) EXPECTED_REPLICA_COUNT=2 # Expected number of replicas # Global Settings KUBECTL_TIMEOUT=30 # kubectl command timeout (seconds) MAX_RETRIES=3 # Maximum retry attempts KUBECTL_RETRY_DELAY=5 # Delay between retries (seconds)
-
File Path:
DAILY_SUMMARY_FILE="/var/log/dv/health-check-daily-summary.log" KUBECONFIG="/etc/kubernetes/admin.conf"
Usage Instructions
Basic Usage
Table 2 lists the basic command line options that you can use to perform health checks.
| Option | Description | Example |
|---|---|---|
| -v | Enable verbose output |
./health-check.sh -v |
| -s | Silent mode (minimal output) |
./health-check.sh -s |
| -t | Run specific checks (comma-separated) |
./health-check.sh -t check_node_status,check_pods_status You can run the following checks:
|
| -u | SSH username for node access |
./health-check.sh -u root -k /root/.ssh/id_rsa |
| -k | Path to SSH private key |
./health-check.sh -k /root/.ssh/id_rsa |
| -a | Full check mode | ./health-check.sh -a |
Available Check Functions
Table 3 lists the available check functions.
| Function | Description |
|---|---|
check_node_count |
Verifies the Kubernetes cluster has the expected number of nodes (≥4 for standard clusters, ≥1 for mini deployments). |
check_node_status |
Ensures all cluster nodes are in the Ready state and reports any nodes that are NotReady or in error states. |
check_node_readiness_status |
Counts and reports nodes that are not in the Ready state, identifying potential node-level issues. |
check_node_taint_status |
Checks for taints on nodes that could prevent pod scheduling (expected on primary nodes if scheduling is disabled). |
check_disk_pressure_status |
Monitors nodes for disk pressure conditions that could prevent new pods from being scheduled. |
check_network_status |
Validates network connectivity and Calico CNI health across all nodes. |
check_memory_pressure_status |
Detects nodes experiencing memory pressure that could trigger pod evictions. |
check_pidpressure_status |
Checks if nodes are running out of process IDs, which would prevent new processes from starting. |
check_pods_status |
Scans all namespaces for pods in error states (BackOff, Error, Init, Terminating, Pending). |
check_pod_restarts |
Identifies pods that have experienced restart events in the last hour, indicating instability. |
check_containers_status |
Verifies all init containers completed successfully and regular containers are in Running state. |
check_replicas_status |
Compares desired versus ready replica counts for deployments and stateful sets to detect scaling issues. |
check_services_status |
Checks for Kubernetes services stuck in Pending state, which indicates load balancer provisioning issues. |
check_disk_space_status |
SSH to each node to check disk usage on / and /export partitions (requires SSH key); warns if the utilization is greater than 80 percent. |
check_node_cpu_memory_status |
Uses metrics-server to monitor CPU and memory usage per node; alerts if utilization is greater than 80 percent. |
check_etcd_logs |
Analyzes etcd pod logs for disk latency warnings that indicate slow storage affecting cluster performance. |
check_rook_ceph_status |
Executes ceph status in rook-ceph-tools pod to verify Ceph
cluster health (HEALTH_OK/HEALTH_WARN/HEALTH_ERR). |
check_registry_status |
Validates container registry accessibility on master nodes and checks for missing images referenced in metadata. |
check_configserver_status |
Verifies all ConfigServer pods in the healthbot namespace are running and healthy. |
check_influxdb_status |
Ensures all InfluxDB time-series database pods in healthbot namespace are operational. |
check_rabbitmq_status |
Validates RabbitMQ cluster health, checks all pods are running, and confirms AMQP (5672) and management (15672) ports are listening. |
check_postgres_status |
Comprehensive PostgreSQL check including schema validation, Patroni cluster status, leader election, replica health, replication lag, and WAL size monitoring. |
Run Health Check
Use the following procedure to run the health scrip:
Log in to the primary node.
Execute the following command:
root@primary-node:~# health-check.sh$ ./health-check.sh -v 2025-10-03 10:15:23 Health status checking... ====================================================== Get node count of Kubernetes cluster ====================================================== OK There are 4 nodes in the cluster. ====================================================== Get node status of Kubernetes cluster ====================================================== OK 4 nodes are in the Ready state. NAME STATUS ROLES AGE VERSION master-node-1 Ready control-plane 45d v1.28.0 master-node-2 Ready control-plane 45d v1.28.0 worker-node-1 Ready <none> 45d v1.28.0 worker-node-2 Ready <none> 45d v1.28.0 ... ====================================================== Overall cluster status ====================================================== GREEN 2025-10-03 10:18:45 Health status checking completed!
The script generates a comprehensive health report with color-coded status indicators
(GREEN, AMBER, RED) and logs results to
/var/log/dv/health-check-daily-summary.log.
In addition, you can also schedule a health check. For example,
# Edit crontab crontab -e # Run health check daily at 2 AM 0 2 * * * /path/to/health-check.sh -k /root/.ssh/id_rsa >> /var/log/dv/cron-health-check.log 2>&1 # Run every 6 hours with verbose output 0 */6 * * * /path/to/health-check.sh -v -k /root/.ssh/id_rsa >> /var/log/dv/cron-health-check.log 2>&1
Color Code
Table 4 lists the color code in the health check output.
| Color | Status |
|---|---|
| GREEN | All checks passed successfully |
| AMBER | Warning conditions detected (degraded but operational) |
| RED | Critical failures detected |
Troubleshooting
Table 4 lists the common errors that can occur while running the health checks.
| Error | Cause | Solution |
|---|---|---|
| Kubernetes cluster is unreachable | kubectl cannot connect to the cluster |
# Verify KUBECONFIG echo $KUBECONFIG # Test cluster connectivity kubectl cluster-info # Check API server systemctl status kube-apiserver # Verify network connectivity ping <api-server-ip> |
| Failed to retrieve node status - cluster may be unreachable | kubectl commands are timing out |
# Increase timeout export KUBECTL_TIMEOUT=60 # Check cluster load kubectl top nodes # Review API server logs journalctl -u kube-apiserver -f |
| SSH key file not found | Invalid SSH key path |
# Verify SSH key exists ls -la /path/to/ssh_key # Check file permissions chmod 600 /path/to/ssh_key # Test SSH connectivity ssh -i /path/to/ssh_key root@<node-ip> "echo Connection successful" |
| jq command not available | jq not installed |
# RHEL/CentOS yum install -y jq # Ubuntu/Debian apt-get install -y jq # Verify installation jq --version |
| No RabbitMQ pods found | RabbitMQ not deployed or namespace incorrect |
# Check RabbitMQ deployment kubectl get pods -n northstar | grep rabbitmq # Verify namespace kubectl get namespaces | grep northstar # Check deployment status kubectl get statefulset -n northstar rabbitmq |
| No replica pod found to check replication lag | PostgreSQL replicas are not in streaming state |
# Check Patroni cluster status kubectl exec -n common atom-db-0 -- patronictl list # Review PostgreSQL logs kubectl logs -n common atom-db-1 --tail=100 # Check replication status kubectl exec -n common atom-db-0 -- psql -U atom -d atom -c "SELECT * FROM pg_stat_replication;" |