Repair or Replace Cluster Nodes

You can repair and replace faulty nodes in your Paragon Automation cluster using Paragon Shell. This topic describes how to repair and replace nodes in your cluster.

Repair Nodes

To repair a faulty node in your existing Paragon Automation cluster.

Log in to the Linux root shell of the faulty node.

If you are unable to log in to the faulty node, go to step 6.
Stop and kill all RKE2 services on the faulty node.
If you see the following error, reboot the node using the shutdown -R now command.
rm: cannot remove '/var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.cephfs.csi.ceph.com/aadbbffbb9d894e02c7c7a2510fe5fd72e28d2fe4a23565d764d84176f055d62/globalmount': Device or resource busy
Uninstall RKE2.
Reboot the node using the shutdown -r now command.
Clear the data on the disk partition used for Ceph.
Use /dev/sdb if you used the OVA bundle to deploy your cluster on an ESXi server. Use /dev/vdb if you deployed on Proxmox VE.
Log in to the Linux root shell of the node you deployed the cluster from. Note that you can repair the faulty node from any functional node in the cluster.

If the installer node is faulty, then perform the following steps.
1. Log in to one of the other two primary nodes (that is nodes with Kubernetes index 2 or 3).
2. Type exit to exit to the Linux root shell.
3. Edit the inventory file to reorder the node IP addresses list. Use a text editor, to move the node that you logged in to, to the top of the list of the three IP addresses displayed under the master section.
Delete the faulty node from the cluster.
Where faulty-node-hostname is the hostname of the node you want to repair.
Type cli to enter Paragon Shell.

If you are not logged in to the node from which you deployed the cluster, then log out of the current node, and log in to the installer node.
Repair the node from Paragon Shell.
Where ip-address is the IP address of the node that you want to repair.
(Optional) Ensure that the cluster is healthy. Type request paragon health-check to determine cluster health.

Repair node failure scenario

You might encounter issues with node repair, where the repair procedure is executed successfully, but rook-ceph-osd is in Init:CrashLoopBackOff state on the repaired node. For example:

Here the failed OSD number is 2.

Perform the following additional steps to fix the issue.

Stop the rook-ceph-operator and set replica to 0.

Remove the failing rook-ceph-osd-osd-number OSD processes. For example:

Connect to the toolbox.

Mark out the failed osd-number OSD. For example:
Remove the failed osd-number OSD. For example:
Clear the data on the disk partition used for Ceph.
Use /dev/sdb if you used the OVA bundle to deploy your cluster on an ESXi server. Use /dev/vdb if you deployed on Proxmox VE.
Reboot the node.

Restart the rook-ceph-operator.

Replace Faulty Nodes

You can replace a faulty node with a replacement node. To replace a node, you must prepare the replacement node, delete the faulty node and then add the replacement node to the cluster.

Log in to the Linux root shell of one of the functional nodes of the cluster and delete the faulty node.
Where faulty-node-hostname is the hostname of the node you want to replace.
Prepare the replacement node.

You must create and configure the replacement node before replacing a faulty node.

The node which is replacing the faulty node can have a new IP address or the same IP address as the faulty node, but you will still need to create and configure the node VM.
1. To create the node, depending on your hypervisor server, perform the steps detailed in Create the Node VMs.
2. Once the node is created, configure the node as detailed in Configure the node VMs.
Replace the Faulty Node:

Once the replacement node is prepared, log in to the node from which you deployed your existing Paragon Automation cluster. You are placed in Paragon Shell.
If the IP address of the replacement node is same as the IP address of the faulty node, go to step 5. If the IP address of the replacement node is different from the IP address of the faulty node, perform the following steps.
1. To edit the cluster, type configure to enter the configuration mode.
2. Remove the faulty node using the delete paragon cluster nodes kubernetes x command.
  
  Where x is the index number of the node that you want to remove. For example, if the node with Kubernetes index 3 is faulty, use:
3. Add the replacement node to the cluster configuration in place of the node you removed using the set paragon cluster nodes kubernetes x address node-x-new-IP command.
  
  Where x is the index number of the faulty node and node-x-new-IP is the IP address of the replacement node. For example, if 10.1.2.11 is the IP address of the replacement node and if the node with Kubernetes index 3 is faulty:
4. Commit the configuration.
5. (Optional) Verify the cluster configuration.
6. Exit configuration mode and regenerate the configuration files.

Regenerate SSH keys on the cluster nodes.

When prompted, enter the SSH password for all the existing VMs and the new VM. Enter the same password that you configured to log in to the VMs.

Replace the node.

Where 10.1.2.11 is the IP address of the replacement node.

(Optional) Ensure that the cluster is healthy. Type request paragon health-check to determine cluster health.

Replace-node failure scenario

You might encounter issues with node-replacement when the IP address (or hostname) of the replacement node is different from the IP address (or hostname) of the faulty node; perform the following additional steps to fix the issue.

Log in to the Linux root shell of the node from where you deployed the cluster.
Delete the local volume pvc associated with the faulty node, if any.
Run the following command to check if the status of Rook and Ceph pods installed in the rook-ceph namespace is not Running or Completed.

Remove the failing OSD processes.

Connect to the toolbox.

Identify the failing OSD.
Mark out the failed OSD.

Remove the failed OSD.

ON THIS PAGE

Repair or Replace Cluster Nodes

Repair Nodes

Repair node failure scenario

Replace Faulty Nodes

Replace-node failure scenario

See Also