Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Perform Health Checks

Overview

The Paragon Health Check Script is meticulously designed for Kubernetes-based infrastructures, offering a seamless way to monitor and assess the operational health of system components. Executing this Bash-based script performs a series of automated checks across the following critical infrastructure elements:

  • Kubernetes cluster health (nodes, pods, containers, services)

  • Storage systems (Rook Ceph)

  • Databases (PostgreSQL with Patroni)

  • Message queues (RabbitMQ)

  • Time-series databases (InfluxDB)

  • Configuration services (ConfigServer)

  • Container registries

  • System resources (CPU, memory, disk space)

The script generates a detailed health report that is easy to interpret through its color-coded status indicators. By reviewing these reports, you can quickly identify and address potential issues before they escalate, ensuring the continuous availability and performance of your infrastructure.

Benefits of Paragon Automation Health Check Script

  • Enhance system reliability by automating comprehensive health checks across key infrastructure components, enabling early detection of potential issues.

  • Provide clear and actionable insights through detailed health reports with color-coded status indicators, allowing for quick identification and resolution of problems.

  • Support efficient resource management by monitoring disk space, CPU, and memory usage, ensuring optimal utilization and preventing performance bottlenecks.

  • Facilitate secure operations by integrating authentication for Kubernetes and PostgreSQL, safeguarding sensitive data and credentials during health assessments.

  • Improve operational efficiency through automation of health checks with cron jobs, ensuring continuous monitoring without manual intervention.

Prerequisites

Environment Requirements

  • Operating System—Linux (RHEL, CentOS, Ubuntu)

  • Kubernetes Cluster—Running and accessible

  • User Permissions—Root or sudo access required

  • Network Access—SSH access to cluster nodes (for disk space checks)

Required Tools

The following command-line tools must be installed and available in $PATH:

Table 1: Command Line Tools
Tool Purpose Installation
kubectl Kubernetes CLI
curl -LO https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl
jq JSON parsing
yum install jq or apt-get install jq
bc Arithmetic calculations
yum install bc or apt-get install bc
curl HTTP requests Preinstalled
ssh Remote node access Preinstalled
base64 Credential decoding Preinstalled

Required Configurations

  • KUBECONFIG—Must be set to /etc/kubernetes/admin.conf or equivalent.

  • Cluster Access—You must have cluster-admin privileges to run the script.

  • Metrics Server—Must be deployed for resource monitoring (kubectl top commands).

  • Environment Variables:

  • File Path:

Usage Instructions

Basic Usage

Table 2 lists the basic command line options that you can use to perform health checks.

Table 2: Command line options
Option Description Example
-v Enable verbose output
./health-check.sh -v
-s Silent mode (minimal output)
./health-check.sh -s
-t Run specific checks (comma-separated)
./health-check.sh -t check_node_status,check_pods_status

You can run the following checks:

  • check_node_status—Checks only node status and node health.

  • check_pods_status—Checks only pods status and pod health.

  • check_postgres_status—Checks PostgreSQL status only

  • check_disk_space_status -k /root/.ssh/id_rsa -u root—Checks disk space with SSH key

-u SSH username for node access
./health-check.sh -u root -k /root/.ssh/id_rsa
-k Path to SSH private key
./health-check.sh -k /root/.ssh/id_rsa
-a Full check mode ./health-check.sh -a

Available Check Functions

Table 3 lists the available check functions.

Table 3: Available Check Functions
Function Description
check_node_count Verifies the Kubernetes cluster has the expected number of nodes (≥4 for standard clusters, ≥1 for mini deployments).
check_node_status Ensures all cluster nodes are in the Ready state and reports any nodes that are NotReady or in error states.
check_node_readiness_status Counts and reports nodes that are not in the Ready state, identifying potential node-level issues.
check_node_taint_status Checks for taints on nodes that could prevent pod scheduling (expected on primary nodes if scheduling is disabled).
check_disk_pressure_status Monitors nodes for disk pressure conditions that could prevent new pods from being scheduled.
check_network_status Validates network connectivity and Calico CNI health across all nodes.
check_memory_pressure_status Detects nodes experiencing memory pressure that could trigger pod evictions.
check_pidpressure_status Checks if nodes are running out of process IDs, which would prevent new processes from starting.
check_pods_status Scans all namespaces for pods in error states (BackOff, Error, Init, Terminating, Pending).
check_pod_restarts Identifies pods that have experienced restart events in the last hour, indicating instability.
check_containers_status Verifies all init containers completed successfully and regular containers are in Running state.
check_replicas_status Compares desired versus ready replica counts for deployments and stateful sets to detect scaling issues.
check_services_status Checks for Kubernetes services stuck in Pending state, which indicates load balancer provisioning issues.
check_disk_space_status SSH to each node to check disk usage on / and /export partitions (requires SSH key); warns if the utilization is greater than 80 percent.
check_node_cpu_memory_status Uses metrics-server to monitor CPU and memory usage per node; alerts if utilization is greater than 80 percent.
check_etcd_logs Analyzes etcd pod logs for disk latency warnings that indicate slow storage affecting cluster performance.
check_rook_ceph_status Executes ceph status in rook-ceph-tools pod to verify Ceph cluster health (HEALTH_OK/HEALTH_WARN/HEALTH_ERR).
check_registry_status Validates container registry accessibility on master nodes and checks for missing images referenced in metadata.
check_configserver_status Verifies all ConfigServer pods in the healthbot namespace are running and healthy.
check_influxdb_status Ensures all InfluxDB time-series database pods in healthbot namespace are operational.
check_rabbitmq_status Validates RabbitMQ cluster health, checks all pods are running, and confirms AMQP (5672) and management (15672) ports are listening.
check_postgres_status Comprehensive PostgreSQL check including schema validation, Patroni cluster status, leader election, replica health, replication lag, and WAL size monitoring.

Run Health Check

Use the following procedure to run the health scrip:

  1. Log in to the primary node.

  2. Execute the following command:

    root@primary-node:~# health-check.sh

The script generates a comprehensive health report with color-coded status indicators (GREEN, AMBER, RED) and logs results to /var/log/dv/health-check-daily-summary.log.

In addition, you can also schedule a health check. For example,

Color Code

Table 4 lists the color code in the health check output.

Table 4: Color Code in the Health Check Output
Color Status
GREEN All checks passed successfully
AMBER Warning conditions detected (degraded but operational)
RED Critical failures detected

Troubleshooting

Table 4 lists the common errors that can occur while running the health checks.

Table 5: Common Errors and Solution
Error Cause Solution
Kubernetes cluster is unreachable kubectl cannot connect to the cluster
# Verify KUBECONFIG
echo $KUBECONFIG
# Test cluster connectivity
kubectl cluster-info
# Check API server
systemctl status kube-apiserver
# Verify network connectivity
ping <api-server-ip>
Failed to retrieve node status - cluster may be unreachable kubectl commands are timing out
# Increase timeout
export KUBECTL_TIMEOUT=60

# Check cluster load
kubectl top nodes

# Review API server logs
journalctl -u kube-apiserver -f
SSH key file not found Invalid SSH key path
# Verify SSH key exists
ls -la /path/to/ssh_key

# Check file permissions
chmod 600 /path/to/ssh_key

# Test SSH connectivity
ssh -i /path/to/ssh_key root@<node-ip> "echo Connection successful"
jq command not available jq not installed
# RHEL/CentOS
yum install -y jq

# Ubuntu/Debian
apt-get install -y jq

# Verify installation
jq --version
No RabbitMQ pods found RabbitMQ not deployed or namespace incorrect
# Check RabbitMQ deployment
kubectl get pods -n northstar | grep rabbitmq

# Verify namespace
kubectl get namespaces | grep northstar

# Check deployment status
kubectl get statefulset -n northstar rabbitmq
No replica pod found to check replication lag PostgreSQL replicas are not in streaming state
# Check Patroni cluster status
kubectl exec -n common atom-db-0 -- patronictl list

# Review PostgreSQL logs
kubectl logs -n common atom-db-1 --tail=100

# Check replication status
kubectl exec -n common atom-db-0 -- psql -U atom -d atom -c "SELECT * FROM pg_stat_replication;"