Rebuilding/Adding Worker Nodes

This article will show us how to recover from a lost worker node or can also be used for adding a new worker node. This assumes a UPI-based install but this process should work the same with even IPI methods. The advantage you will have with a IPI-based install, is through the use of MachineSets but that won’t be mentioned here (yet).

For demonstration purposes, let’s assume that worker4 needs to be replaced.

Replacing/Adding Worker4

To replace/add worker4 back, we will need to get an ignition file with the contents of the CA of the OCP cluster.

  1. To get the contents of the CA cert, run the following OC command:
oc describe cm root-ca -n kube-system

You will want to grab the cert (starting and BEGIN CERTIFICATE line and ending at END CERTIFICATE line) and save to a file (such as ca.crt).

2. Take the contents of the ca.crt file and encode to base64.

cat ca.crt | base64 -w0 > ca.crt.base64

To build the ignition file needed to restore/replace this worker, we will use the following template:

{"ignition":{"config":{"merge":[{"source":"https://<apifqdn>:22623/config/worker"}]},"security":{"tls":{"certificateAuthorities":[{"source":"data:text/plain;charset=utf-8;base64, <base64decodedcrt>"}]}},"version":"3.1.0"}}

Replace the <apifqdn> with the fqdn of your api server.
Replace the <base64decodedcacrt> with the contents of ca.crt.base64decodedcacrt

A sample of this template is located at:

Place this file on a web server that is reachable from the control-plane network.

  1. Download the ISO based on version of OCP that you are running (this cluster is 4.7.13). The ISO images are located at:

My specific ISO is

2.  In my case, I am using a virtual machine but you can also boot your physical machine from this ISO.

3. When the ISO is booted, run the following command:

sudo coreos-installer install \
–ignition-url=https://host/worker.ign /dev/sda

If using an http url, pass the –insecure-ignition switch to the above command

4.  Once the install is complete, you can issue a reboot command. Sometimes the node will not reboot automatically.

5. The node will reboot a few times to apply the various machine configs.

6. Lastly, wait for some new CSR requests to come in based on the new worker node by issuing the following command:

oc get csr

7.  You can loop through each of the requests that come in by issuing the following command:

for i in `oc get csr|awk '{print $1}'|grep -v NAME` ; do oc adm certificate approve $i; done;

You may need to issue this command a few times based on some new/pending requests coming in.

After approving the CSRs, issue the “oc get node” command. This command will show the new worker registered to the cluster. It may show a “Not Ready” status momentarily but will evenutally go to a “Ready” status.