Reboot Nodes in Paragon Automation
Read these instructions to reboot all Paragon Automation nodes.
Follow these steps to reboot a node:
-
Back up your current Paragon Automation cluster data.
root@primary-node:~# data.sh --backup
-
Copy the backed up data to a secure secondary server outside the
cluster. The
data.sh includes information on the location of the backed up file.
Run the
scp -prvcommand to copy the backed up file from the local host to the secondary server outside the cluster. -
Check for any errors in the pods using the
health-check.sh
script.
root@primary-node:~# health-check.sh
-
Use the
kubectl get nodescommand to view the status of the cluster nodes. The status of the nodes must beReady, and the roles must be either control-plane or none. -
Cordon off the primary node to remove it from scheduling.
Cordoning a Kubernetes node marks it as unavailable to the Kubernetes scheduler, preventing it from hosting any new pods. This is useful when you need to perform maintenance on a node without affecting the currently running pods.
root@primary-node:~# kubectl cordon <ip-address> cordoned
The makes the node cordoned, making it ineligible to host any new pods.
-
After cordoning a
node, you might want to drain the node to evict the running pods and reschedule them onto
other nodes. Use the following command to
drain
all nodes (safely evict all pods from the nodes).
root@primary-node:~# kubectl drain <node-name/ip-address> --ignore-daemonsets --grace-period=0 --force --delete-emptydir-data
-
Identify if there are any pods waiting to be rescheduled.
[root@rhel-84-node1 ~]# kubectl get po -A -o wide | grep -v Running | grep -v Completed NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
[root@rhel-84-node1 ~]# kubectl get po -A -o wide | grep -v Running NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES auditlog auditlog-purge-cron-29139840-g7l4r 0/1 Completed 0 12h 10.244.2.158 172.25.152.20 <none> <none> ems jobmanager-purge-cron-29138400-f4ln5 0/1 Completed 0 36h 10.244.2.175 172.25.152.20 <none> <none> ems jobmanager-purge-cron-29139840-drllc 0/1 Completed 0 12h 10.244.2.176 172.25.152.20 <none> <none> kube-system backup-29140575-p6tpx 0/1 Completed 0 14m 172.25.152.18 172.25.152.18 <none> <none> kube-system backup-29140580-24pq5 0/1 Completed 0 9m50s 172.25.152.18 172.25.152.18 <none> <none> kube-system backup-29140585-t5s6g 0/1 Completed 0 4m50s 172.25.152.18 172.25.152.18 <none> <none> rook-ceph rook-ceph-osd-prepare-172.25.152.18-g72hc 0/1 Completed 0 4h55m 10.244.142.9 172.25.152.18 <none> <none> rook-ceph rook-ceph-osd-prepare-172.25.152.19-9vvx4 0/1 Completed 0 4h55m 10.244.227.21 172.25.152.19 <none> <none> rook-ceph rook-ceph-osd-prepare-172.25.152.20-sklvr 0/1 Completed 0 4h55m 10.244.2.174 172.25.152.20 <none> <none> rook-ceph rook-ceph-osd-prepare-172.25.152.21-q7vdx 0/1 Completed 0 4h55m 10.244.91.143 172.25.152.21 <none> <none>
Pending processes on the cordoned node are listed. Nodes that do not have any pending processes are marked
<none>. -
If there are no
pod waiting to be scheduled,
recheck
for any errors in the pods using the health-check.sh script.
root@primary-node:~# health-check.sh
-
Reboot the cordoned node.
It would take roughly 5 to 10 minutes for the node to reboot.
-
Run the following command on primary node 1.
root@primary-node:~# kubectl uncordon <ip-address>
The pods in the cluster is redistributed within 15 minutes of running the command.
-
After the pods are redistributed, check for any errors in the pods by using the
health-check.sh script.
root@primary-node:~# health-check.sh
-
Identify the newly rebooted node.
root@primary-node:~# kubectl get po -A -o wide | grep -v Running
[root@rhel-84-node1 ~]# kubectl get po -A -o wide | grep -v Running NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES auditlog auditlog-purge-cron-29139840-g7l4r 0/1 Completed 0 12h 10.244.2.158 172.25.152.20 <none> <none> ems jobmanager-purge-cron-29138400-f4ln5 0/1 Completed 0 36h 10.244.2.175 172.25.152.20 <none> <none> ems jobmanager-purge-cron-29139840-drllc 0/1 Completed 0 12h 10.244.2.176 172.25.152.20 <none> <none> kube-system backup-29140575-p6tpx 0/1 Completed 0 14m 172.25.152.18 172.25.152.18 <none> <none> kube-system backup-29140580-24pq5 0/1 Completed 0 9m50s 172.25.152.18 172.25.152.18 <none> <none> kube-system backup-29140585-t5s6g 0/1 Completed 0 4m50s 172.25.152.18 172.25.152.18 <none> <none> rook-ceph rook-ceph-osd-prepare-172.25.152.18-g72hc 0/1 Completed 0 4h55m 10.244.142.9 172.25.152.18 <none> <none> rook-ceph rook-ceph-osd-prepare-172.25.152.19-9vvx4 0/1 Completed 0 4h55m 10.244.227.21 172.25.152.19 <none> <none> rook-ceph rook-ceph-osd-prepare-172.25.152.20-sklvr 0/1 Completed 0 4h55m 10.244.2.174 172.25.152.20 <none> <none> rook-ceph rook-ceph-osd-prepare-172.25.152.21-q7vdx 0/1 Completed 0 4h55m 10.244.91.143 172.25.152.21 <none> <none>
- Repeat step 3 through step 12 to reboot the other nodes in Paragon Automation.