Understanding Log Error Messages for Troubleshooting ISSU-Related Problems

 

The following problems might occur during an ISSU upgrade. You can identify the errors by using the details in the logs. For detailed information about specific system log messages, see System Log Explorer.

Chassisd Process Errors

Problem

Description: Errors related to chassisd.

Solution

Use the error messages to understand the issues related to chassisd.

When ISSU starts, a request is sent to chassisd to check whether there are any problems related to the ISSU from a chassis perspective. If there is a problem, a log message is created.

Understanding Common Error Handling for ISSU

Problem

Description: You might encounter some problems in the course of an ISSU. This section provides details on how to handle them.

Solution

Any errors encountered during an ISSU result in the creation of log messages, and ISSU continues to function without impact to traffic. If reverting to previous versions is required, the event is either logged or the ISSU is halted, so as not to create any mismatched versions on both nodes of the chassis cluster. Table 1 provides some of the common error conditions and the workarounds for them. The sample messages used in the Table 1 are from the SRX1500 device and are also applicable to all supported SRX Series devices.

Table 1: ISSU-Related Errors and Solutions

Error Conditions

Solutions

Attempt to initiate an ISSU when previous instance of an ISSU is already in progress

The following message is displayed:

warning: ISSU in progress

You can abort the current ISSU process, and initiate the ISSU again using the request chassis cluster in-service-upgrade abort command.

Reboot failure on the secondary node

No service downtime occurs, because the primary node continues to provide required services. Detailed console messages are displayed requesting that you manually clear existing ISSU states and restore the chassis cluster.

error: [Oct  6 12:30:16]: Reboot secondary node failed (error-code: 4.1)

       error: [Oct  6 12:30:16]: ISSU Aborted! Backup node maybe in inconsistent state, Please restore backup node
       [Oct  6 12:30:16]: ISSU aborted. But, both nodes are in ISSU window.
       Please do the following:
       1. Rollback the node with the newer image using rollback command
          Note: use the 'node' option in the rollback command
          otherwise, images on both nodes will be rolled back
       2. Make sure that both nodes (will) have the same image
       3. Ensure the node with older image is primary for all RGs
       4. Abort ISSU on both nodes
       5. Reboot the rolled back node

Starting with Junos OS Release 17.4R1, the hold timer for the initial reboot of the secondary node during the ISSU process is extended from 15 minutes (900 seconds) to 45 minutes (2700 seconds) in chassis clusters on SRX1500, SRX4100, SRX4200, and SRX4600 devices.

Secondary node failed to complete the cold synchronization

The primary node times out if the secondary node fails to complete the cold synchronization. Detailed console messages are displayed that you manually clear existing ISSU states and restore the chassis cluster. No service downtime occurs in this scenario.

[Oct  3 14:00:46]: timeout waiting for secondary node node1 to sync(error-code: 6.1)
        Chassis control process started, pid 36707 

       error: [Oct  3 14:00:46]: ISSU Aborted! Backup node has been upgraded, Please restore backup node 
       [Oct  3 14:00:46]: ISSU aborted. But, both nodes are in ISSU window. 
       Please do the following: 
      1. Rollback the node with the newer image using rollback command 
          Note: use the 'node' option in the rollback command 
          otherwise, images on both nodes will be rolled back 
      2. Make sure that both nodes (will) have the same image 
      3. Ensure the node with older image is primary for all RGs 
      4. Abort ISSU on both nodes 
      5. Reboot the rolled back node  

Failover of newly upgraded secondary failed

No service downtime occurs, because the primary node continues to provide required services. Detailed console messages are displayed requesting that you manually clear existing ISSU states and restore the chassis cluster.

[Aug 27 15:28:17]: Secondary node0 ready for failover.
[Aug 27 15:28:17]: Failing over all redundancy-groups to node0
ISSU: Preparing for Switchover
error: remote rg1 priority zero, abort failover.
[Aug 27 15:28:17]: failover all RGs to node node0 failed (error-code: 7.1)
error: [Aug 27 15:28:17]: ISSU Aborted!
[Aug 27 15:28:17]: ISSU aborted. But, both nodes are in ISSU window.
Please do the following:
1. Rollback the node with the newer image using rollback command
    Note: use the 'node' option in the rollback command
           otherwise, images on both nodes will be rolled back
2. Make sure that both nodes (will) have the same image
3. Ensure the node with older image is primary for all RGs
4. Abort ISSU on both nodes
5. Reboot the rolled back node
{primary:node1}

Upgrade failure on primary

No service downtime occurs, because the secondary node fails over as primary and continues to provide required services.

Reboot failure on primary node

Before the reboot of the primary node, devices being out of the ISSU setup, no ISSU-related error messages are displayed. The following reboot error message is displayed if any other failure is detected:

Reboot failure on     Before the reboot of primary node, devices will be out of ISSU setup and no primary node error messages will be displayed.
Primary node

ISSU Support-Related Errors

Problem

Description: Installation failure occurs because of unsupported software and unsupported feature configuration.

Solution

Use the following error messages to understand the compatibility-related problems:

Initial Validation Checks Failure

Problem

Description: The initial validation checks fail.

Solution

The validation checks fail if the image is not present or if the image file is corrupt. The following error messages are displayed when initial validation checks fail when the image is not present and the ISSU is aborted:

When Image Is Not Present

When Image File Is Corrupted

If the image file is corrupted, the following output displays:

The primary node validates the device configuration to ensure that it can be committed using the new software version. If anything goes wrong, the ISSU aborts and error messages are displayed.

Installation-Related Errors

Problem

Description: The install image file does not exist or the remote site is inaccessible.

Solution

Use the following error messages to understand the installation-related problems:

ISSU downloads the install image as specified in the ISSU command as an argument. The image file can be a local file or located at a remote site. If the file does not exist or the remote site is inaccessible, an error is reported.

Redundancy Group Failover Errors

Problem

Description: Problem with automatic redundancy group (RG) failure.

Solution

Use the following error messages to understand the problem:

Kernel State Synchronization Errors

Problem

Description: Errors related to ksyncd.

Solution

Use the following error messages to understand the issues related to ksyncd:

ISSU checks whether there are any ksyncd errors on the secondary node (node 1) and displays the error message if there are any problems and aborts the upgrade.

Release History Table
Release
Description
Starting with Junos OS Release 17.4R1, the hold timer for the initial reboot of the secondary node during the ISSU process is extended from 15 minutes (900 seconds) to 45 minutes (2700 seconds) in chassis clusters on SRX1500, SRX4100, SRX4200, and SRX4600 devices.