ON THIS PAGE
Repair or Replace Cluster Nodes
You can repair and replace faulty nodes in your Paragon Automation cluster using Paragon Shell. This topic describes how to repair and replace nodes in your cluster.
Repair Nodes
To repair a faulty node in your existing Paragon Automation cluster.
Log in to the Linux root shell of the faulty node.
If you are unable to log in to the faulty node, go to step 6.
Stop and kill all RKE2 services on the faulty node.
root@node-f:~# rke2-killall.sh
If you see the following error. reboot the node using the
shutdown -R now
command.rm: cannot remove '/var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.cephfs.csi.ceph.com/aadbbffbb9d894e02c7c7a2510fe5fd72e28d2fe4a23565d764d84176f055d62/globalmount': Device or resource busy
Uninstall RKE2.
root@node-f:~# rke2-uninstall.sh
Reboot the node using the
shutdown -r now
command.Clear the data on the disk partition used for Ceph.
root@node-f:~# wipefs -a -f /dev/partition root@node-f:~# dd if=/dev/zero of=/dev/partition bs=1M count=100
Use /dev/sdb if you used the OVA bundle to deploy your cluster on an ESXi server. Use /dev/vdb if you deployed on Proxmox VE.
Log in to the Linux root shell of the node you deployed the cluster from. Note that you can repair the faulty node from any functional node in the cluster.
If the installer node is faulty, then perform the following steps.
Log in to one of the other two primary nodes (that is nodes with Kubernetes index 2 or 3).
Type
exit
to exit to the Linux root shell.Edit the inventory file to reorder the node IP addresses list. Use a text editor, to move the node that you logged in to, to the top of the list of the three IP addresses displayed under the
master
section.
Delete the faulty node from the cluster.
root@primary1:~# kubectl delete nodes faulty-node-hostname
Where faulty-node-hostname is the hostname of the node you want to repair.
Type
cli
to enter Paragon Shell.If you are not logged in to the node from which you deployed the cluster, then log out of the current node, and log in to the installer node.
Repair the node from Paragon Shell.
root@primary1> request paragon repair-node address ip-address-of-faulty-node
Where ip-address is the IP address of the node that you want to repair.
(Optional) Ensure that the cluster is healthy. Type
request paragon health-check
to determine cluster health.
Repair node failure scenario
You might encounter issues with node repair, where the repair procedure is
executed successfully, but rook-ceph-osd
is in
Init:CrashLoopBackOff
state on the repaired node. For
example:
root@primary3> show paragon cluster pods |grep rook-ceph-osd rook-ceph rook-ceph-osd-0-c744fd5f9-jgnlr 2/2 Running 0 5h39m rook-ceph rook-ceph-osd-1-c8f7c8c4f-k98cr 2/2 Running 0 5h39m rook-ceph rook-ceph-osd-2-9957b89f-jv6nf 0/2 Init:CrashLoopBackOff 22 (81s ago) 142m rook-ceph rook-ceph-osd-3-77f7b7f548-26d67 2/2 Running 0 5h38m rook-ceph rook-ceph-osd-prepare-Primary1 0/1 Completed 0 100m rook-ceph rook-ceph-osd-prepare-Primary2 0/1 Completed 0 100m rook-ceph rook-ceph-osd-prepare-Worker4 0/1 Completed 0 99m
Here the failed OSD number is 2.
Perform the following additional steps to fix the issue.
Stop the
rook-ceph-operator
and set replica to 0.root@primary3:~# kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0
Remove the failing
rook-ceph-osd-osd-number
OSD processes. For example:root@primary3:~# kubectl delete deploy -n rook-ceph rook-ceph-osd-2
Connect to the toolbox.
root@primary1:~# kubectl exec -ti -n rook-ceph $(kubectl get po -n rook-ceph -l app=rook-ceph-tools -o jsonpath={..metadata.name}) -- bash
Mark out the failed
osd-number
OSD. For example:bash-4.4$ ceph osd out 2
Remove the failed
osd-number
OSD. For example:bash-4.4$ ceph osd purge 2 --yes-i-really-mean-it bash-4.4$ exit
Clear the data on the disk partition used for Ceph.
root@primary3:~# wipefs -a -f /dev/vdb root@primary3:~# dd if=/dev/zero of=/dev/vdb bs=1M count=100
Use /dev/sdb if you used the OVA bundle to deploy your cluster on an ESXi server. Use /dev/vdb if you deployed on Proxmox VE.
Reboot the node.
root@primary3:~# shutdown -r now
Restart the
rook-ceph-operator
.root@primary3:~# kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=1
Replace Faulty Nodes
Log in to the Linux root shell of one of the functional nodes of the cluster and delete the faulty node.
root@primary1:~# kubectl delete node faulty-node-hostname
Where faulty-node-hostname is the hostname of the node you want to replace.
Prepare the replacement node.
You must create and configure the replacement node before replacing a faulty node.
The node which is replacing the faulty node can have a new IP address or the same IP address as the faulty node, but you will still need to create and configure the node VM.
To create the node, depending on your hypervisor server, perform the steps detailed in Create the Node VMs.
Once the node is created, configure the node as detailed in Configure the node VMs.
Replace the Faulty Node:
Once the replacement node is prepared, log in to the node from which you deployed your existing Paragon Automation cluster. You are placed in Paragon Shell.
- If the IP address of the replacement node is same as the IP address of the
faulty node, go to step 5.
If the IP address of the replacement node is different from the IP address
of the faulty node, perform the following steps.
To edit the cluster, type
configure
to enter the configuration mode.root@primary1> configure Entering configuration mode [edit] root@primary1#
Remove the faulty node using the
delete paragon cluster nodes kubernetes x
command.Where x is the index number of the node that you want to remove. For example, if the node with Kubernetes index 3 is faulty, use:
root@primary1# delete paragon cluster nodes kubernetes 3
Add the replacement node to the cluster configuration in place of the node you removed using the
set paragon cluster nodes kubernetes x address node-x-new-IP
command.Where
x
is the index number of the faulty node andnode-x-new-IP
is the IP address of the replacement node. For example, if 10.1.2.11 is the IP address of the replacement node and if the node with Kubernetes index 3 is faulty:root@primary1# set paragon cluster nodes kubernetes 3 address 10.1.2.11 root@primary1# commit commit complete
Commit the configuration.
root@primary1# commit commit complete
(Optional) Verify the cluster configuration.
root@primary1# show paragon cluster nodes kubernetes 1 { address 10.1.2.3; } kubernetes 2 { address 10.1.2.4; } kubernetes 3 { address 10.1.2.11; } kubernetes 4 { address 10.1.2.6; }
Exit configuration mode and regenerate the configuration files.
root@primary1# exit Exiting configuration mode root@primary1> request paragon config Paragon inventory file saved at /epic/config/inventory Paragon config file saved at /epic/config/config
Regenerate SSH keys on the cluster nodes.
When prompted, enter the SSH password for all the existing VMs and the new VM. Enter the same password that you configured to log in to the VMs.
root@primary1> request paragon ssh-key Please enter comma-separated list of IP addresses: 10.1.2.3,10.1.2.4,10.1.2.6,10.1.2.11 Please enter SSH username for the node(s): root Please enter SSH password for the node(s): password checking server reachability and ssh connectivity ... Connectivity ok for 10.1.2.3 Connectivity ok for 10.1.2.4 Connectivity ok for 10.1.2.6 Connectivity ok for 10.1.2.11 <output snipped>
Replace the node.
root@primary1> request paragon replace-node address 10.1.2.11 Process running with PID: 23xx032 To track progress, run 'monitor start /epic/config/log'
Where 10.1.2.11 is the IP address of the replacement node.
(Optional) Ensure that the cluster is healthy. Type
request paragon health-check
to determine cluster health.
Replace-node failure scenario
You might encounter issues with node-replacement when the IP address (or hostname) of the replacement node is different from the IP address (or hostname) of the faulty node; perform the following additional steps to fix the issue.
Log in to the Linux root shell of the node from where you deployed the cluster.
Delete the local volume pvc associated with the faulty node, if any.
root@primary1:~# paragon-remove-object-from-deleted-node -t faulty-node-hostname -y
Run the following command to check if the status of Rook and Ceph pods installed in the rook-ceph namespace is not
Running
orCompleted
.root@primary1:~# kubectl get po -n rook-ceph
Remove the failing OSD processes.
root@primary1:~# kubectl delete deploy -n rook-ceph rook-ceph-osd-number
Connect to the toolbox.
root@primary1:~# kubectl exec -ti -n rook-ceph $(kubectl get po -n rook-ceph -l app=rook-ceph-tools -o jsonpath={..metadata.name}) -- bash
Identify the failing OSD.
bash-4.4$ ceph osd status
Mark out the failed OSD.
bash-4.4$ ceph osd out osd-ID-number
Remove the failed OSD.
bash-4.4$ ceph osd purge osd-ID-number --yes-i-really-mean-it