Replace a Control Plane Node
SUMMARY Learn how to identify and replace an unhealthy node in an OpenShift cluster.
Replacing a control plane node requires you to first identify and remove the unhealthy node. After you remove the unhealthy node, you can then add the new replacement node.
We provide these example procedures purely for informational purposes. See Red Hat OpenShift documentation (https://docs.openshift.com/) for the official procedure.
Remove an Unhealthy Control Plane Node
-
Check the status of the control plane nodes to identify the unhealthy member.
user@ai-client:~# oc get nodes -l node-role.kubernetes.io/master NAME STATUS ROLES AGE VERSION ocp1.mycluster.contrail.lan Ready master 16d v1.21.6+bb8d50a ocp2.mycluster.contrail.lan Ready master 16d v1.21.6+bb8d50a ocp3.mycluster.contrail.lan NotReady master 16d v1.21.6+bb8d50a
In this example, ocp3 is the unhealthy node. - Back up the etcd database on one of the healthy nodes by following the procedure in Back Up the Etcd Database.
-
List the etcd members.
user@ai-client:~# oc get pods -n openshift-etcd -o wide | grep -v etcd-quorum-guard | grep etcd etcd-ocp1 4/4 Running 0 18d 172.16.0.11 ocp1 <none> <none> etcd-ocp2 4/4 Running 0 18d 172.16.0.12 ocp2 <none> <none> etcd-ocp3 4/4 Running 0 18d 172.16.0.13 ocp3 <none> <none>
-
Open a remote shell to an etcd pod on a healthy node (for example, ocp1).
user@ai-client:~# oc rsh -n openshift-etcd etcd-ocp1 Defaulted container "etcdctl" out of: etcdctl, etcd, etcd-metrics, etcd-health-monitor, setup (init), etcd-ensure-env-vars (init), etcd-resources-copy (init)
-
List the etcd members.
sh-4.4# etcdctl member list -w table +------------------+---------+------+--------------------------+--------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+------+--------------------------+--------------------------+------------+ | 19faf45778a1ddd3 | started | ocp3 | https://172.16.0.13:2380 | https://172.16.0.13:2379 | false | | ad4840148f3f241c | started | ocp1 | https://172.16.0.11:2380 | https://172.16.0.11:2379 | false | | b1f7a55fb3caa3b6 | started | ocp2 | https://172.16.0.12:2380 | https://172.16.0.12:2379 | false | +------------------+---------+------+--------------------------+--------------------------+------------+
-
Remove the etcd member on the unhealthy node.
sh-4.4# etcdctl member remove 19faf45778a1ddd3 Member 19faf45778a1ddd3 removed from cluster 60f3fdb1b921fd7
-
View the member list again.
sh-4.4# etcdctl member list -w table +------------------+-----------+------+--------------------------+--------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+-----------+------+--------------------------+--------------------------+------------+ | ad4840148f3f241c | started | ocp1 | https://172.16.0.11:2380 | https://172.16.0.11:2379 | false | | b1f7a55fb3caa3b6 | started | ocp2 | https://172.16.0.12:2380 | https://172.16.0.12:2379 | false | +------------------+-----------+------+--------------------------+--------------------------+------------+
- Type exit to exit the remote shell.
-
Back on the AI client node, remove the old secrets for the unhealthy etcd member.
-
List the secrets for the unhealthy (removed) member.
user@ai-client:~# oc get secrets -n openshift-etcd | grep ocp3 etcd-peer-ocp3 kubernetes.io/tls 2 18d etcd-serving-metrics-ocp3 kubernetes.io/tls 2 18d etcd-serving-ocp3 kubernetes.io/tls 2 18d
-
Delete the peer secrets.
user@ai-client:~# oc delete secret -n openshift-etcd etcd-peer-ocp3 secret "etcd-peer-ocp3" deleted
-
Delete the metrics secrets.
user@ai-client:~# oc delete secret -n openshift-etcd etcd-serving-metrics-ocp3 secret "etcd-serving-metrics-ocp3" deleted
-
Delete the serving secrets.
user@ai-client:~# oc delete secret -n openshift-etcd etcd-serving-ocp3 secret "etcd-serving-ocp3" deleted
-
List the secrets for the unhealthy (removed) member.
-
Finally, delete the unhealthy node.
-
Cordon the unhealthy node.
user@ai-client:~# oc adm cordon ocp3 node/ocp3 cordoned
-
Drain the unhealthy node.
user@ai-client:~# oc adm drain ocp3 --ignore-daemonsets=true --delete-emptydir-data --force node/ocp3 already cordoned <trimmed>
-
Delete the unhealthy node.
user@ai-client:~# oc delete node ocp3 node "ocp3" deleted
-
List the nodes.
user@ai-client:~# oc get nodes NAME STATUS ROLES AGE VERSION ocp1 Ready master 19d v1.21.6+bb8d50a ocp2 Ready master 19d v1.21.6+bb8d50a ocp4 Ready worker 18d v1.21.6+bb8d50a ocp5 Ready worker 18d v1.21.6+bb8d50a
You have now identified and removed the unhealthy node. -
Cordon the unhealthy node.
Add a Replacement Control Plane Node
Use this procedure to add a replacement control plane node to an existing OpenShift cluster. An OpenShift cluster has exactly 3 control plane nodes. You cannot use this procedure to add a node to a cluster that already has 3 control plane nodes.
This procedure shows an example of late binding. In late binding, you generate an ISO and boot the node with that ISO. After the node boots, you bind the node to the existing cluster.
This causes one or more CertificateSigningRequests (CSRs) to be sent from the new node to the existing cluster. A CSR is simply a request to obtain the client certificates for the (existing) cluster. You'll need to explicitly approve these requests. Once approved, the existing cluster provides the client certificates to the new node, and the new node is allowed to join the existing cluster.
- Log in to the machine (VM or BMS) that you're using as the Assisted Installer client. The Assisted Installer client machine is where you issue Assisted Installer API calls to the Assisted Installer server hosted by Red Hat.
-
Prepare the deployment by setting the environment variables that you'll use in later
steps.
-
Set up the same SSH key that you use for the existing cluster.
In this example, we retrieve that SSH key from its default location ~/.ssh/id_rsa.pub and store into a variable.
export CLUSTER_SSHKEY=$(cat ~/.ssh/id_rsa.pub)
-
If you no longer have the image pull secret, then download the image pull secret
from your Red Hat account onto your local computer. The pull secret allows your
installation to access services and registries that serve container images for
OpenShift components.
If you're using the Red Hat hosted Assisted Installer, you can download the pull secret file (pull-secret) from the https://console.redhat.com/openshift/downloads page. Copy the pull-secret file to the Assisted Installer client machine. In this example, we store the pull-secret in a file called pull-secret.txt.
Strip out any whitespace, convert the contents to JSON string format, and store to an environment variable, as follows:
export PULL_SECRET=$(sed '/^[[:space:]]*$/d' pull-secret.txt | jq -R .)
-
If you no longer have your offline access token, then copy the offline access token
from your Red Hat account. The OpenShift Cluster Manager API Token allows you (on the
Assisted Installer client machine) to interact with the Assisted Installer API service
hosted by Red Hat.
The token is a string that you can copy and paste to a local environment variable. If you're using the Red Hat hosted Assisted Installer, you can copy the API token from https://console.redhat.com/openshift/downloads.
export OFFLINE_ACCESS_TOKEN='<paste offline access token here>'
-
Generate (refresh) the token from the OFFLINE_ACCESS_TOKEN. You will use this
generated token whenever you issue API commands.
export TOKEN=$(curl --silent --data-urlencode "grant_type=refresh_token" --data-urlencode "client_id=cloud-services" --data-urlencode "refresh_token=${OFFLINE_ACCESS_TOKEN}" https://sso.redhat.com/auth/realms/redhat-external/protocol/openid-connect/token | jq -r .access_token)
Note:This token expires regularly. When this token expires, you will get an HTTP 4xx response whenever you issue an API command. Refresh the token when it expires, or alternatively, refresh the token regularly prior to expiry. There is no harm in refreshing the token when it hasn't expired.
-
Get the OpenShift cluster ID of the existing cluster.
For example:
Save it to a variable:oc get clusterversion -o jsonpath='{.items[].spec.clusterID}{"\n"}' 1777102a-1fe1-407a-9441-9d0bad4f5968
export $OS_CLUSTER_ID="1777102a-1fe1-407a-9441-9d0bad4f5968"
-
Set up the remaining environment variables.
Table 1 lists all the environment variables that you need to set in this procedure, including the ones described in the previous steps.
Table 1: Environment Variables Variable Description Example CLUSTER_SSHKEY The (public) SSH key you use for the existing cluster. You must use this same key for the new node you're adding. – PULL_SECRET The image pull secret that you downloaded, stripped and converted to JSON string format. – OFFLINE_ACCESS_TOKEN The OpenShift Cluster Manager API Token that you copied. – TOKEN The token that you generated (refreshed) from the OFFLINE_ACCESS_TOKEN. – CLUSTER_NAME The name of the existing cluster. mycluster CLUSTER_DOMAIN The base domain of the existing cluster. contrail.lan OS_CLUSTER_ID The OpenShift cluster ID of the existing cluster. 1777102a-1fe1-407a-9441-9d0bad4f5968 AI_URL The URL of the Assisted Installer service. This example uses the Red Hat hosted Assisted Installer. https://api.openshift.com
-
Set up the same SSH key that you use for the existing cluster.
-
Generate the discovery boot ISO. You will use this ISO to boot the node that you're
adding to the cluster.
The ISO is customized to your infrastructure based on the infrastructure environment that you'll set up.
-
Create a file that describes the infrastructure environment. In this example, we
name it infra-envs-addhost.json.
where:cat << EOF > ./infra-envs-addhost.json { "name": "<InfraEnv Name>", "ssh_authorized_key": "$CLUSTER_SSHKEY", "pull_secret": $PULL_SECRET, "openshift_version": "4.8", "user_managed_networking": <same as for existing cluster>, "vip_dhcp_allocation": <same as for existing cluster>, "base_dns_domain": "$CLUSTER_DOMAIN", } EOF
- InfraEnv Name is the name you want call the InfraEnv.
user_managed_networking
andvip_dhcp_allocation
are set to the same values as for the existing cluster.
-
Register the InfraEnv. In response, the Assisted Installer service assigns an
InfraEnv ID and builds the discovery boot ISO based on the specified infrastructure
environment.
curl -X POST "$AI_URL/api/assisted-install/v2/infra-envs" -H "accept: application/json" -H "Content-Type: application/json" -H "Authorization: Bearer $TOKEN" -d @infra-envs-addhosts.json
When you register the InfraEnv, the Assisted Installer service returns an InfraEnv ID. Look carefully for the InfraEnv ID embedded in the response. For example:
"id":"5c858ed9-26cf-446d-817c-4c4261541657"
Store the InfraEnv ID into a variable. For example:
export INFRA_ENV_ID="5c858ed9-26cf-446d-817c-4c4261541657"
-
Get the image download URL.
The Assisted Installer service returns the image URL.curl -s $AI_URL/api/assisted-install/v2/infra-envs/$INFRA_ENV_ID/downloads/image-url -H "accept: application/json" -H "Content-Type: application/json" -H "Authorization: Bearer $TOKEN" | jq '.url'
-
Download the ISO and save it to a file. In this example, we save it to
ai-liveiso-addhosts.iso.
curl -L "<image URL>" -H "Authorization: Bearer $TOKEN" -o ./ai-liveiso-addhosts.iso
-
Create a file that describes the infrastructure environment. In this example, we
name it infra-envs-addhost.json.
-
Boot the new node with the discovery boot ISO. Choose the boot method most convenient
for your infrastructure. Ensure that the new node boots up attached to a network that has
access to the Red Hat hosted Assisted Installer.
Check the status of the host:
Store the host ID into a variable.curl -s -X GET --header "Content-Type: application/json" -H "Authorization: Bearer $TOKEN" $AI_URL/api/assisted-install/v2/infra-envs/$INFRA_ENV_ID/hosts | jq '.'
export HOST_ID=$(curl -s -X GET --header "Content-Type: application/json" -H "Authorization: Bearer $TOKEN" $AI_URL/api/assisted-install/v2/infra-envs/$INFRA_ENV_ID/hosts | jq -r '.[].id')
-
Configure the new node as a control plane node.
Check to see that the following is embedded in the response:curl -X PATCH --header "Content-Type: application/json" "$AI_URL/api/assisted-install/v2/infra-envs/$INFRA_ENV_ID/hosts/$HOST_ID" -H "Authorization: Bearer $TOKEN" -d "{ \"machine_config_pool_name\": \"master\"}" | jq -r
"machine_config_pool_name": "master",
-
Import the existing cluster.
curl -X POST "$AI_URL/api/assisted-install/v2/clusters/import" -H "accept: application/json" -H "Content-Type: application/json" -H "Authorization: Bearer $TOKEN" -d "{\"name\":\"$CLUSTER_NAME\",\"openshift_cluster_id\":\"$OS_CLUSTER_ID\",\"api_vip_dnsname\":\"api.$CLUSTER_NAME.$CLUSTER_DOMAIN\"}"
When you import the cluster, the Assisted Installer service returns a cluster ID for the AddHostsCluster. Look carefully for the cluster ID embedded in the response. For example:
"id":"c5bbb159-78bc-41c9-99b7-d8a4727a3890"
-
Bind the new host to the cluster, referencing the cluster ID of the
AddHostsCluster.
Check the status of the host regularly:curl -X POST "$AI_URL/api/assisted-install/v2/infra-envs/$INFRA_ENV_ID/hosts/$HOST_ID/actions/bind" -H "accept: application/json" -H "Content-Type: application/json" -H "Authorization: Bearer $TOKEN" -d '{"cluster_id":"c5bbb159-78bc-41c9-99b7-d8a4727a3890"}'
Proceed to the next step when you see the following output:curl -s -X GET --header "Content-Type: application/json" -H "Authorization: Bearer $TOKEN" $AI_URL/api/assisted-install/v2/infra-envs/$INFRA_ENV_ID/hosts | jq '.'
"status": "known", "status_info": "Host is ready to be installed",
-
Install the new node.
Check on the status of the node:curl -X POST -H "accept: application/json" -H "Content-Type: application/json" -H "Authorization: Bearer $TOKEN" "$AI_URL/api/assisted-install/v2/infra-envs/$INFRA_ENV_ID/hosts/$HOST_ID/actions/install"
Look for the following status, which indicates that the node has rebooted:curl -s -X GET --header "Content-Type: application/json" -H "Authorization: Bearer $TOKEN" $AI_URL/api/assisted-install/v2/infra-envs/$INFRA_ENV_ID/hosts | jq '.'
"status_info": "Host has rebooted and no further updates will be posted. Please check console for progress and to possibly approve pending CSRs",
-
Once the new node has rebooted, it will try to join the existing cluster. This causes
one or more CertificateSigningRequests (CSRs) to be sent from the new node to the existing
cluster. You will need to approve the CSRs.
-
Check for the CSRs.
For example:
root@ai-client:~/contrail# oc get csr -A NAME AGE SIGNERNAME REQUESTOR CONDITION csr-gblnm 20s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
You may need to repeat this command periodically until you see pending CSRs.
-
Approve the CSRs.
For example:
root@ai-client:~/contrail# oc adm certificate approve csr-gblnm certificatesigningrequest.certificates.k8s.io/csr-gblnm approved
-
Check for the CSRs.
-
Verify that the new node is up and running in the existing cluster.
oc get nodes