Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Troubleshoot Paragon Automation Installation

SUMMARY Read the following topics to learn how to troubleshoot typical problems that you might encounter during and after installation.

Resolve Merge Conflicts of the Configuration File

The init script creates the template configuration files. If you update an existing installation using the same config-dir directory that was used for the installation, the template files that the init script creates are merged with the existing configuration files. Sometimes, this merging action creates a merge conflict that you must resolve. The script prompts you about how to resolve the conflict. When prompted, select one of the following options:

  • C—You can retain the existing configuration file and discard the new template file. This is the default option.

  • n—You can discard the existing configuration file and reinitialize the template file.

  • m—You can merge the files manually. Conflicting sections are marked with lines starting with “<<<<<<<<“, “||||||||”, “========“, and “>>>>>>>>”. You must edit the file and remove the merge markers before you proceed with the update.

  • d—You can view the differences between the files before you decide how to resolve the conflict.

Resolve Common Backup and Restore Issues

Suppose you destroy an existing cluster and redeploy a software image on the same cluster nodes. In such a scenario, if you try to restore a configuration from a previously backed-up configuration folder, the restore operation might fail. The restore operation fails because the mount path for the backed-up configuration is now changed. When you destroy an existing cluster, the persistent volume is deleted. When you redeploy the new image, the persistent volume gets re-created in one of the cluster nodes wherever space is available, but not necessarily in the same node as it was present in previously. As a result, the restore operation fails.

To work around these backup and restore issues:

  1. Determine the mount path of the new persistent volume.

  2. Copy the contents of the previous persistent volume's mount path to the new path.

  3. Retry the restore operation.

View Installation Log Files

If the deploy script fails, you must check the installation log files in the config-dir directory. By default, the config-dir directory stores six zipped log files. The current log file is saved as log, and the previous log files are saved as log.1 through log.5 files. Every time you run the deploy script, the current log is saved, and the oldest one is discarded.

You typically find error messages at the end of a log file. View the error message, and fix the configuration.

View Log Files in Kibana

You use Open Distro to consolidate and index application logs. The Kibana application is the visualization tool that you can use to search logs using keywords and filters.

To view logs in the Kibana application:

  1. Use one of the following methods to access Kibana:
    • Use the virtual IP (VIP) address of the ingress controller: Open a browser and enter https://vip-of-ingress-controller-or-hostname-of-main-web-application/kibana in the URL field.
    • Use the Logs page: In the Paragon Automation UI, click Monitoring > Logs in the left-nav bar.
  2. Enter the opendistro_es_admin_user username and the opendistro_es_admin_password password that you configured in the config.yml file during installation. The default username is admin.

    If you do not configure the opendistro_es_admin_password password, the installer generates a random password. You can retrieve the password using the following command:

    # kubectl -n kube-system get secret opendistro-es-account -o jsonpath={..password} | base64 -d

  3. If you are logging in for the first time, create an index pattern by clicking the Create index pattern option.
  4. Enter logstash-* in the Index pattern name field, and then click Next Step >.
  5. Select @timestamp from the Time field list, and then click Create index pattern to create an index pattern.
  6. Click the hamburger icon and select Discover from the left-nav bar to browse the log files, and to add or remove filters as required.

Troubleshooting Using the kubectl Interface

kubectl (Kube Control) is a command-line utility that interacts with the Kubernetes API, and the most common command line took to control Kubernetes clusters.

You can issue kubectl commands on the primary node right after installation. To issue kubectl commands on the worker nodes, you need to copy the admin.conf file and set the kubeconfig environment variable or use the export KUBECONFIG=config-dir /admin.conf command. The admin.conf file is copied to the config-dir directory on the control host as part of the installation process.

You use the kubectl command-line tool to communicate with the Kubernetes API and obtain information about API resources such as nodes, pods, and services, show log files, as well as create, delete, or modify those resources.

The syntax of kubectl commands is as follows:

kubectl [command] [TYPE] [NAME] [flags]

[command] is simply the action that you want to execute.

You can use the following command to view a list of kubectl commands:

root@primary-node:/# kubectl [enter]

You can ask for help, to get details and list all the flags and options associated with a particular command. For example:

root@primary-node:/# kubectl get -h

To verify and troubleshoot the operations in Paragon Automation, you'll use the following commands:

[command] Description
get

Display one or many resources.

The output shows a table of the most important information about the specified resources.

describe Show details of a specific resource or a group of resources.
explain Documentation of resources.
logs Print the logs for a container in a pod.
rollout restart Manage the rollout of a resource.
edit Edit a resource.

[TYPE] represents the type of resource that you want to view. Resource types are case-insensitive, and you can use singular, plural, or abbreviated forms.

For example, pod, node, service, or deployment. For a complete list of resources, and allowed abbreviations (example, pod = po), issue this command:

kubectl api-resources

To learn more about a resource, issue this command:

kubectl explain [TYPE]

For example:

[NAME] is the name of a specific resource—for example, the name of a service or pod. Names are case-sensitive.

root@primary-node:/# kubectl get pod pod_name

[flags] provide additional options for a command. For example, -o lists more attributes for a resource. Use help (-h) to get information about the available flags.

Note that most Kubernetes resources (such as pods and services) are in some namespaces, while others are not (such as nodes).

Namespaces provide a mechanism for isolating groups of resources within a single cluster. Names of resources need to be unique within a namespace, but not across namespaces.

When you use a command on a resource that is in a namespace, you must include the namespace as part of the command. Namespaces are case-sensitive. Without the proper namespace, the specific resource you are interested in might not be displayed.

You can get a list of all namespaces by issuing the kubectl get namespace command.

If you want to display resources for all namespaces, or you are not sure what namespaces the specific resource you are interested in belongs to, you can enter --all-namespaces or - A.

For more information about Kubernetes, see:

Use the following topics to troubleshoot and view installation details using the kubectl interface.

View Node Status

Use the kubectl get nodes command, abbreviated as the kubectl get no command, to view the status of the cluster nodes. The status of the nodes must be Ready, and the roles must be either control-plane or none. For example:

If a node is not Ready, verify whether the kubelet process is running. You can also use the system log of the node to investigate the issue.

To verify kubelet: root@primary-node:/# kubelet

View Pod Status

Use the kubectl get po –n namespace or kubectl get po -A command to view the status of a pod. You can specify an individual namespace (such as healthbot, northstar, and common) or you can use the -A parameter to view the status of all namespaces. For example:

The status of healthy pods must be Running or Completed, and the number of ready containers should match the total. If the status of a pod is not Running or if the number of containers does not match, use the kubectl describe po or kubectl log (POD | TYPE/NAME) [-c CONTAINER] command to troubleshoot the issue further.

View Detailed Information About a Pod

Use the kubectl describe po -n namespace pod-name command to view detailed information about a specific pod. For example:

View the Logs for a Container in a Pod

Use the kubectl logs -n namespace pod-name [-c container-name] command to view the logs for a particular pod. If a pod has multiple containers, you must specify the container for which you want to view the logs. For example:

Run a Command on a Container in a Pod

Use the kubectl exec –ti –n namespacepod-name [-c container-name] -- command-line command to run commands on a container inside a pod. For example:

After you run exec the command, you get a bash shell into the Postgres database server. You can access the bash shell inside the container, and run commands to connect to the database. Not all containers provide a bash shell. Some containers provide only SSH, and some containers do not have any shells.

View Services

Use the kubectl get svc -n namespace or kubectl get svc -A command to view the cluster services. You can specify an individual namespace (such as healthbot, northstar, and common), or you can use the -A parameter to view the services for all namespaces. For example:

In this example, the services are sorted by type, and only services of type LoadBalancer are displayed. You can view the services that are provided by the cluster and the external IP addresses that are selected by the load balancer to access those services.

You can access these services from outside the cluster. The external IP address is exposed and accessible from devices outside the cluster.

Frequently Used kubectl Commands

  • List the replication controllers:

  • Restart a component:

  • Edit a Kubernetes resource: You can edit a deployment or any Kubernetes API object, and these changes are saved to the cluster. However, if you reinstall the cluster, these changes are not preserved.

Troubleshoot Ceph and Rook

Ceph requires relatively newer Kernel versions. If your Linux kernel is very old, consider upgrading or reinstalling a new one.

Use this section to troubleshoot issues with Ceph and Rook.

Insufficient Disk Space

A common reason for installation failure is that the object storage daemons (OSDs) are not created. An OSD configures the storage on a cluster node. OSDs might not be created because of non-availability of disk resources, in the form of either insufficient resources or incorrectly partitioned disk space. Ensure that the nodes have sufficient unpartitioned disk space available.

Reformat a Disk

Examine the logs of the "rook-ceph-osd-prepare-hostname-*" jobs. The logs are descriptive. If you need to reformat the disk or partition, and restart Rook, perform the following steps:

  1. Use one of the following methods to reformat an existing disk or partition.
    • If you have a block storage device that should have been used for Ceph, but wasn't used because it was in an unusable state, you can reformat the disk completely.
    • If you have a disk partition that should have been used for Ceph, you can clear the data on the partition completely.
    Note:

    These commands completely reformat the disk or partitions that you are using and you will lose all data on them.

  2. Restart Rook to save the changes and reattempt the OSD creation process.

View Pod Status

To check the status of Rook and Ceph pods installed in the rook-ceph namespace, use the # kubectl get po -n rook-ceph command. The following pods must be in the running state.

  • rook-ceph-mon-*—Typically, three monitor pods are created.
  • rook-ceph-mgr-*—One manager pod
  • rook-ceph-osd-*—Three or more OSD pods
  • rook-ceph-mds-cephfs-*—Metadata servers
  • rook-ceph-rgw-object-store-*—ObjectStore gateway
  • rook-ceph-tools*—For additional debugging options.

    To connect to the toolbox, use the command:

    $ kubectl exec -ti -n rook-ceph $(kubectl get po -n rook-ceph -l app=rook-ceph-tools \ -o jsonpath={..metadata.name}) -- bash

    Some of the common commands you can use in the toolbox are:

    # ceph status # ceph osd status, # ceph osd df, # ceph osd utilization, # ceph osd pool stats, # ceph osd tree, and # ceph pg stat

Troubleshoot Ceph OSD failure

Check the status of pods installed in the rook-ceph namespace.

# kubectl get po -n rook-ceph

If a rook-ceph-osd-* pod is in the Error or CrashLoopBackoff state, then you must repair the disk.

  1. Stop the rook-ceph-operator.

    # kubectl scale deploy -n rook-ceph rook-ceph-operator --replicas=0

  2. Remove the failing OSD processes.

    # kubectl delete deploy -n rook-ceph rook-ceph-osd-number

  3. Connect to the toolbox.

    $ kubectl exec -ti -n rook-ceph $(kubectl get po -n rook-ceph -l app=rook-ceph-tools -o jsonpath={..metadata.name}) -- bash

  4. Identify the failing OSD.

    # ceph osd status

  5. Mark out the failed OSD.

  6. Remove the failed OSD.

    # ceph osd purge number --yes-i-really-mean-it

  7. Connect to the node that hosted the failed OSD and do one of the following:
    • Replace the hard disk in case of a hardware failure.
    • Reformat the disk completely.
    • Reformat the partition completely.
  8. Restart rook-ceph-operator.

    # kubectl scale deploy -n rook-ceph rook-ceph-operator --replicas=1

  9. Monitor the OSD pods.

    # kubectl get po -n rook-ceph

    If the OSD does not recover, use the same procedure to remove the OSD, and then remove the disk or delete the partition before restarting rook-ceph-operator.

Troubleshoot Air-Gap Installation Failure

The air-gap installation as well as the kube-apiserver fails with the following error because you do not have an existing /etc/resolv.conf file.

To create a new file, you must run the #touch /etc/resolv.conf command as the root user, and then redeploy the Paragon Automation cluster.

Recover from a RabbitMQ Cluster Failure

If your Paragon Automation cluster fails (for example, from a power outage), the RabbitMQ message bus may not restart properly.

To check for this condition, run the kubectl get po -n northstar -l app=rabbitmq command. This command should show three pods with their status as Running. For example:

However, if the status of one or more pods is Error, use the following recovery procedure:

  1. Delete RabbitMQ.

    kubectl delete po -n northstar -l app=rabbitmq

  2. Check the status of the pods.

    kubectl get po -n northstar -l app=rabbitmq.

    Repeat kubectl delete po -n northstar -l app=rabbitmq until the status of all pods is Running.

  3. Restart the Paragon Pathfinder applications.

Disable udevd Daemon During OSD Creation

You use the udevd daemon for managing new hardware such as disks, network cards, and CDs. During the creation of OSDs, the udevd daemon detects the OSDs and can lock them before they are fully initialized. The Paragon Automation installer disables systemd-udevd during installation and enables it after Rook has initialized the OSDs.

When adding or replacing nodes and repairing failed nodes, you must manually disable the udevd daemon so that OSD creation does not fail. You can reenable the daemon after the OSDs are created.

Use these commands to manually disable and enable udevd.

  1. Log in to the node that you want to add or repair.
  2. Disable the udevd daemon.
    1. Check whether udevd is running.

      # systemctl is-active systemd-udevd

    2. If udevd is active, disable it. # systemctl mask system-udevd --now
  3. When you repair or replace a node, the Ceph distributed filesystems are not automatically updated. If the data disks are destroyed as part of the repair process, then you must recover the object storage daemons (OSDs) hosted on those data disks.

    1. Connect to the Ceph toolbox and view the status of OSDs. The ceph-tools script is installed on a primary node. You can log in to the primary node and use the kubectl interface to access ceph-tools. To use a node other than the primary node, you must copy the admin.conf file (in the config-dir directory on the control host) and set the kubeconfig environment variable or use the export KUBECONFIG=config-dir/admin.conf command.

      $ ceph-tools# ceph osd status

    2. Verify that all OSDs are listed as exists,up. If OSDs are damaged, follow the troubleshooting instructions explained in Troubleshoot Ceph and Rook.

  4. Log in to node that you added or repaired after verifying that all OSDs are created.
  5. Reenable udevd on the node.

    systemctl unmask system-udevd

Alternatively, you can set disable_udevd: true in the config.yml and run the ./run -c config-dir deploy command. We do not recommend that you redeploy the cluster only to disable the udevd daemon.

Wrapper Scripts for Common Utility Commands

You can use the following wrapper scripts installed in /usr/local/bin to connect to and run commands on pods running in the system.
Command Description
paragon-db [arguments] Connect to the database server and start the Postgres SQL shell using the superuser account. Optional arguments are passed to the Postgres SQL command.
pf-cmgd [arguments] Start the CLI in the Paragon Pathfinder CMGD pod. Optional arguments are executed by the CLI.
pf-crpd [arguments] Start the CLI in the Paragon Pathfinder CRPD pod. Optional arguments are executed by the CLI.
pf-redis [arguments] Start the (authenticated) redis-cli in the Paragon Pathfinder Redis pod. Optional arguments are executed by the Redis pod.
pf-debugutils [arguments] Start the shell in the Paragon Pathfinder debugutils pod. Optional arguments are executed by the shell. Pathfinder debugutils utilities are installed if install_northstar_debugutils: true is configured in the config.yml file.
ceph-tools [arguments] Start the shell to the Ceph toolbox. Optional arguments are executed by the shell.

Back Up the Control Host

If your control host fails, you must back up the config-dir directory to a remote location to be able to rebuild your cluster . The config-dir contains the inventory, config.yml, and id_rsa files.

Alternatively, you can also rebuild the inventory and config.yml files by downloading information from the cluster using the following commands:

# kubectl get cm -n common metadata -o jsonpath={..inventory} > inventory

# kubectl get cm -n common metadata -o jsonpath={..config_yml} > config.yml

You cannot recover SSH keys; you must replace failed keys with new keys.

User Service Accounts for Debugging

Paragon Pathfinder, telemetry manager, and base platform applications internally use Paragon Insights for telemetry collection. To debug configuration issues associated with these applications, three user service accounts are created, by default, during Paragon Automation installation. The scope of these service accounts is limited to debugging the corresponding application only. The service accounts details are listed in the following table.

Table 1: Service Account Details
Application Name and Scope Account Username Account Default Password
Paragon Pathfinder (northstar) hb-northstar-admin Admin123!
Telemetry manager (tm) hb-tm-admin
Base platform (ems-dmon) hb-ems-dmon

You must use these accounts solely for debugging purposes. Do not use these accounts for day-to-day operations or for modifying any configuration. We recommend that you change the login credentials for security reasons.