ON THIS PAGE
Repair or Replace Cluster Nodes
You can repair and replace faulty nodes in your Routing Director cluster using Paragon Shell. This topic describes how to repair and replace nodes in your cluster.
Repair Nodes
To repair a faulty node in your existing Routing Director cluster.
Log in to the Linux root shell of the faulty node.
If you are unable to log in to the faulty node, go to step 6.
Stop and kill all RKE2 services on the faulty node.
root@node-f:~# rke2-killall.sh
If you see the following error, reboot the node using the
shutdown -R nowcommand.rm: cannot remove '/var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.cephfs.csi.ceph.com/aadbbffbb9d894e02c7c7a2510fe5fd72e28d2fe4a23565d764d84176f055d62/globalmount': Device or resource busyUninstall RKE2.
root@node-f:~# rke2-uninstall.sh
Reboot the node using the
shutdown -r nowcommand.Clear the data on the disk partition used for Ceph.
root@node-f:~# wipefs -a -f /dev/partition root@node-f:~# dd if=/dev/zero of=/dev/partition bs=1M count=100
Use /dev/sdb if you used the OVA bundle to deploy your cluster on an ESXi server. Use /dev/vdb if you deployed on Proxmox VE.
Log in to the Linux root shell of the node you deployed the cluster from. Note that you can repair the faulty node from any functional node in the cluster.
If the installer node is faulty, then perform the following steps.
Log in to one of the other two primary nodes (that is nodes with Kubernetes index 2 or 3).
Type
exitto exit to the Linux root shell.Edit the inventory file to reorder the node IP addresses list. Use a text editor, to move the node that you logged in to, to the top of the list of the three IP addresses displayed under the
mastersection.
Delete the faulty node from the cluster.
root@primary1:~# kubectl delete nodes faulty-node-hostname
Where faulty-node-hostname is the hostname of the node you want to repair.
Type
clito enter Paragon Shell.If you are not logged in to the node from which you deployed the cluster, then log out of the current node, and log in to the installer node.
Repair the node from Paragon Shell.
root@primary1> request paragon repair-node address ip-address-of-faulty-node
Where ip-address is the IP address of the node that you want to repair.
(Optional) Ensure that the cluster is healthy. Type
request paragon health-checkto determine cluster health.
Repair node failure scenario
You might encounter issues with node repair, where the repair procedure is
executed successfully, but rook-ceph-osd is in
Init:CrashLoopBackOff state on the repaired node. For
example:
root@primary3> show paragon cluster pods |grep rook-ceph-osd rook-ceph rook-ceph-osd-0-c744fd5f9-jgnlr 2/2 Running 0 5h39m rook-ceph rook-ceph-osd-1-c8f7c8c4f-k98cr 2/2 Running 0 5h39m rook-ceph rook-ceph-osd-2-9957b89f-jv6nf 0/2 Init:CrashLoopBackOff 22 (81s ago) 142m rook-ceph rook-ceph-osd-3-77f7b7f548-26d67 2/2 Running 0 5h38m rook-ceph rook-ceph-osd-prepare-Primary1 0/1 Completed 0 100m rook-ceph rook-ceph-osd-prepare-Primary2 0/1 Completed 0 100m rook-ceph rook-ceph-osd-prepare-Worker4 0/1 Completed 0 99m
Here the failed OSD number is 2.
Perform the following additional steps to fix the issue.
Stop the
rook-ceph-operatorand set replica to 0.root@primary3:~# kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0
Remove the failing
rook-ceph-osd-osd-numberOSD processes. For example:root@primary3:~# kubectl delete deploy -n rook-ceph rook-ceph-osd-2
Connect to the toolbox.
root@primary1:~# kubectl exec -ti -n rook-ceph $(kubectl get po -n rook-ceph -l app=rook-ceph-tools -o jsonpath={..metadata.name}) -- bashMark out the failed
osd-numberOSD. For example:bash-4.4$ ceph osd out 2
Remove the failed
osd-numberOSD. For example:bash-4.4$ ceph osd purge 2 --yes-i-really-mean-it bash-4.4$ exit
Clear the data on the disk partition used for Ceph.
root@primary3:~# wipefs -a -f /dev/vdb root@primary3:~# dd if=/dev/zero of=/dev/vdb bs=1M count=100
Use /dev/sdb if you used the OVA bundle to deploy your cluster on an ESXi server. Use /dev/vdb if you deployed on Proxmox VE.
Reboot the node.
root@primary3:~# shutdown -r now
Restart the
rook-ceph-operator.root@primary3:~# kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=1
Replace Faulty Nodes
Log in to the Linux root shell of one of the functional nodes of the cluster and delete the faulty node.
root@primary1:~# kubectl delete node faulty-node-hostname
Where faulty-node-hostname is the hostname of the node you want to replace.
Prepare the replacement node.
You must create and configure the replacement node before replacing a faulty node.
The node which is replacing the faulty node can have a new IP address or the same IP address as the faulty node, but you will still need to create and configure the node VM.
To create the node, depending on your hypervisor server, perform the steps detailed in Create the Node VMs.
Once the node is created, configure the node as detailed in Configure the node VMs.
Replace the Faulty Node:
Once the replacement node is prepared, log in to the node from which you deployed your existing Routing Director cluster. You are placed in Paragon Shell.
- If the IP address of the replacement node is same as the IP address of the
faulty node, go to step 5.
If the IP address of the replacement node is different from the IP address
of the faulty node, perform the following steps.
To edit the cluster, type
configureto enter the configuration mode.root@primary1> configure Entering configuration mode [edit] root@primary1#
Remove the faulty node using the
delete paragon cluster nodes kubernetes xcommand.Where x is the index number of the node that you want to remove. For example, if the node with Kubernetes index 3 is faulty, use:
root@primary1# delete paragon cluster nodes kubernetes 3
Add the replacement node to the cluster configuration in place of the node you removed using the
set paragon cluster nodes kubernetes x address node-x-new-IPcommand.Where
xis the index number of the faulty node andnode-x-new-IPis the IP address of the replacement node. For example, if 10.1.2.11 is the IP address of the replacement node and if the node with Kubernetes index 3 is faulty:root@primary1# set paragon cluster nodes kubernetes 3 address 10.1.2.11 root@primary1# commit commit complete
Commit the configuration.
root@primary1# commit commit complete
(Optional) Verify the cluster configuration.
root@primary1# show paragon cluster nodes kubernetes 1 { address 10.1.2.3; } kubernetes 2 { address 10.1.2.4; } kubernetes 3 { address 10.1.2.11; } kubernetes 4 { address 10.1.2.6; }Exit configuration mode and regenerate the configuration files.
root@primary1# exit Exiting configuration mode root@primary1> request paragon config Paragon inventory file saved at /epic/config/inventory Paragon config file saved at /epic/config/config
Regenerate SSH keys on the cluster nodes.
When prompted, enter the SSH password for all the existing VMs and the new VM. Enter the same password that you configured to log in to the VMs.
root@primary1> request paragon ssh-key Please enter comma-separated list of IP addresses: 10.1.2.3,10.1.2.4,10.1.2.6,10.1.2.11 Please enter SSH username for the node(s): root Please enter SSH password for the node(s): password checking server reachability and ssh connectivity ... Connectivity ok for 10.1.2.3 Connectivity ok for 10.1.2.4 Connectivity ok for 10.1.2.6 Connectivity ok for 10.1.2.11 <output snipped>
Replace the node.
root@primary1> request paragon replace-node address 10.1.2.11 Process running with PID: 23xx032 To track progress, run 'monitor start /epic/config/log'
Where 10.1.2.11 is the IP address of the replacement node.
Type
exitto exit to the Linux root shell.Execute the post install script to sync all configuration.
root@primary1:~# ./post_install.sh -i 10.1.2.11
(Optional) Ensure that the cluster is healthy. Type
request paragon health-checkin Paragon Shell to determine cluster health.
Replace-node failure scenario
You might encounter issues with node-replacement when the IP address (or hostname) of the replacement node is different from the IP address (or hostname) of the faulty node; perform the following additional steps to fix the issue.
Log in to the Linux root shell of the node from where you deployed the cluster.
Delete the local volume pvc associated with the faulty node, if any.
root@primary1:~# paragon-remove-object-from-deleted-node -t faulty-node-hostname -y
Run the following command to check if the status of Rook and Ceph pods installed in the rook-ceph namespace is not
RunningorCompleted.root@primary1:~# kubectl get po -n rook-ceph
Remove the failing OSD processes.
root@primary1:~# kubectl delete deploy -n rook-ceph rook-ceph-osd-number
Connect to the toolbox.
root@primary1:~# kubectl exec -ti -n rook-ceph $(kubectl get po -n rook-ceph -l app=rook-ceph-tools -o jsonpath={..metadata.name}) -- bashIdentify the failing OSD.
bash-4.4$ ceph osd status
Mark out the failed OSD.
bash-4.4$ ceph osd out osd-ID-number
Remove the failed OSD.
bash-4.4$ ceph osd purge osd-ID-number --yes-i-really-mean-it