Perform Health Checks

Overview

The Paragon Health Check Script is meticulously designed for Kubernetes-based infrastructures, offering a seamless way to monitor and assess the operational health of system components. Executing this Bash-based script performs a series of automated checks across the following critical infrastructure elements:

Kubernetes cluster health (nodes, pods, containers, services)
Storage systems (Rook Ceph)
Databases (PostgreSQL with Patroni)
Message queues (RabbitMQ)
Time-series databases (InfluxDB)
Configuration services (ConfigServer)
Container registries
System resources (CPU, memory, disk space)

The script generates a detailed health report that is easy to interpret through its color-coded status indicators. By reviewing these reports, you can quickly identify and address potential issues before they escalate, ensuring the continuous availability and performance of your infrastructure.

Benefits of Paragon Automation Health Check Script

Enhance system reliability by automating comprehensive health checks across key infrastructure components, enabling early detection of potential issues.
Provide clear and actionable insights through detailed health reports with color-coded status indicators, allowing for quick identification and resolution of problems.
Support efficient resource management by monitoring disk space, CPU, and memory usage, ensuring optimal utilization and preventing performance bottlenecks.
Facilitate secure operations by integrating authentication for Kubernetes and PostgreSQL, safeguarding sensitive data and credentials during health assessments.
Improve operational efficiency through automation of health checks with cron jobs, ensuring continuous monitoring without manual intervention.

Environment Requirements

Operating System—Linux (RHEL, CentOS, Ubuntu)
Kubernetes Cluster—Running and accessible
User Permissions—Root or sudo access required
Network Access—SSH access to cluster nodes (for disk space checks)

Required Tools

The following command-line tools must be installed and available in $PATH:

Table 1: Command Line Tools
Tool	Purpose	Installation
kubectl	Kubernetes CLI	curl -LO https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl
jq	JSON parsing	yum install jq or apt-get install jq
bc	Arithmetic calculations	yum install bc or apt-get install bc
curl	HTTP requests	Preinstalled
ssh	Remote node access	Preinstalled
base64	Credential decoding	Preinstalled

Required Configurations

KUBECONFIG—Must be set to /etc/kubernetes/admin.conf or equivalent.
Cluster Access—You must have cluster-admin privileges to run the script.
Metrics Server—Must be deployed for resource monitoring (kubectl top commands).

Environment Variables:

File Path:

Usage Instructions

Basic Usage
Available Check Functions

Basic Usage

Table 2 lists the basic command line options that you can use to perform health checks.

Table 2: Command line options
Option	Description	Example
-v	Enable verbose output	./health-check.sh -v
-s	Silent mode (minimal output)	./health-check.sh -s
-t	Run specific checks (comma-separated)	./health-check.sh -t check_node_status,check_pods_status You can run the following checks: `check_node_status`—Checks only node status and node health. `check_pods_status`—Checks only pods status and pod health. `check_postgres_status`—Checks PostgreSQL status only `check_disk_space_status -k /root/.ssh/id_rsa -u root`—Checks disk space with SSH key
-u	SSH username for node access	./health-check.sh -u root -k /root/.ssh/id_rsa
-k	Path to SSH private key	./health-check.sh -k /root/.ssh/id_rsa
-a	Full check mode	./health-check.sh -a

Available Check Functions

Table 3 lists the available check functions.

Table 3: Available Check Functions
Function	Description
`check_node_count`	Verifies the Kubernetes cluster has the expected number of nodes (≥4 for standard clusters, ≥1 for mini deployments).
`check_node_status`	Ensures all cluster nodes are in the Ready state and reports any nodes that are NotReady or in error states.
`check_node_readiness_status`	Counts and reports nodes that are not in the Ready state, identifying potential node-level issues.
`check_node_taint_status`	Checks for taints on nodes that could prevent pod scheduling (expected on primary nodes if scheduling is disabled).
`check_disk_pressure_status`	Monitors nodes for disk pressure conditions that could prevent new pods from being scheduled.
`check_network_status`	Validates network connectivity and Calico CNI health across all nodes.
`check_memory_pressure_status`	Detects nodes experiencing memory pressure that could trigger pod evictions.
`check_pidpressure_status`	Checks if nodes are running out of process IDs, which would prevent new processes from starting.
`check_pods_status`	Scans all namespaces for pods in error states (BackOff, Error, Init, Terminating, Pending).
`check_pod_restarts`	Identifies pods that have experienced restart events in the last hour, indicating instability.
`check_containers_status`	Verifies all init containers completed successfully and regular containers are in Running state.
`check_replicas_status`	Compares desired versus ready replica counts for deployments and stateful sets to detect scaling issues.
`check_services_status`	Checks for Kubernetes services stuck in Pending state, which indicates load balancer provisioning issues.
`check_disk_space_status`	SSH to each node to check disk usage on / and /export partitions (requires SSH key); warns if the utilization is greater than 80 percent.
`check_node_cpu_memory_status`	Uses metrics-server to monitor CPU and memory usage per node; alerts if utilization is greater than 80 percent.
`check_etcd_logs`	Analyzes etcd pod logs for disk latency warnings that indicate slow storage affecting cluster performance.
`check_rook_ceph_status`	Executes `ceph status` in rook-ceph-tools pod to verify Ceph cluster health (HEALTH_OK/HEALTH_WARN/HEALTH_ERR).
`check_registry_status`	Validates container registry accessibility on master nodes and checks for missing images referenced in metadata.
`check_configserver_status`	Verifies all ConfigServer pods in the healthbot namespace are running and healthy.
`check_influxdb_status`	Ensures all InfluxDB time-series database pods in healthbot namespace are operational.
`check_rabbitmq_status`	Validates RabbitMQ cluster health, checks all pods are running, and confirms AMQP (5672) and management (15672) ports are listening.
`check_postgres_status`	Comprehensive PostgreSQL check including schema validation, Patroni cluster status, leader election, replica health, replication lag, and WAL size monitoring.

Run Health Check

Use the following procedure to run the health scrip:

Log in to the primary node.

Execute the following command:

root@primary-node:~#
            health-check.sh

The script generates a comprehensive health report with color-coded status indicators (GREEN, AMBER, RED) and logs results to /var/log/dv/health-check-daily-summary.log.

In addition, you can also schedule a health check. For example,

Color Code

Table 4 lists the color code in the health check output.

Table 4: Color Code in the Health Check Output
Color	Status
GREEN	All checks passed successfully
AMBER	Warning conditions detected (degraded but operational)
RED	Critical failures detected

Troubleshooting

Table 4 lists the common errors that can occur while running the health checks.

Table 5: Common Errors and Solution
Error	Cause	Solution
Kubernetes cluster is unreachable	kubectl cannot connect to the cluster	# Verify KUBECONFIG echo $KUBECONFIG # Test cluster connectivity kubectl cluster-info # Check API server systemctl status kube-apiserver # Verify network connectivity ping <api-server-ip>
Failed to retrieve node status - cluster may be unreachable	kubectl commands are timing out	# Increase timeout export KUBECTL_TIMEOUT=60 # Check cluster load kubectl top nodes # Review API server logs journalctl -u kube-apiserver -f
SSH key file not found	Invalid SSH key path	# Verify SSH key exists ls -la /path/to/ssh_key # Check file permissions chmod 600 /path/to/ssh_key # Test SSH connectivity ssh -i /path/to/ssh_key root@<node-ip> "echo Connection successful"
jq command not available	jq not installed	# RHEL/CentOS yum install -y jq # Ubuntu/Debian apt-get install -y jq # Verify installation jq --version
No RabbitMQ pods found	RabbitMQ not deployed or namespace incorrect	# Check RabbitMQ deployment kubectl get pods -n northstar \| grep rabbitmq # Verify namespace kubectl get namespaces \| grep northstar # Check deployment status kubectl get statefulset -n northstar rabbitmq
No replica pod found to check replication lag	PostgreSQL replicas are not in streaming state	# Check Patroni cluster status kubectl exec -n common atom-db-0 -- patronictl list # Review PostgreSQL logs kubectl logs -n common atom-db-1 --tail=100 # Check replication status kubectl exec -n common atom-db-0 -- psql -U atom -d atom -c "SELECT * FROM pg_stat_replication;"

ON THIS PAGE