Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

Upgrading a Chassis Cluster Using In-Service Software Upgrade

 

In-service software upgrade (ISSU) enables a software upgrade from one Junos OS version to a later Junos OS version with minimal downtime. For more information, see the following topics:

Understanding ISSU for a Chassis Cluster

In-service software upgrade (ISSU) enables a software upgrade from one Junos OS version to a later Junos OS version with little or no downtime. ISSU is performed when the devices are operating in chassis cluster mode only.

The chassis cluster ISSU feature enables both devices in a cluster to be upgraded from supported Junos OS versions with a minimal disruption in traffic and without a disruption in service.

Starting with Junos OS Release 15.1X49-D80, SRX4100 and SRX4200 devices support ISSU.

Starting with Junos OS Release 15.1X49-D70, SRX1500 devices support ISSU.

  • On SRX1500, SRX4100, and SRX4200 devices, ISSU is not supported for upgrading to 17.4 releases from previous Junos OS releases. ISSU is supported for upgrading from Junos OS 17.4 to successive 17.4 releases.

  • On SRX5400, SRX5600 and SRX5800 devices, ISSU is not supported for upgrading to 17.3 and higher releases from earlier Junos OS releases. ISSU is supported for upgrading from Junos OS 17.3 to Junos 17.4 releases.

  • SRX300 Series devices, SRX550M devices and vSRX do not support ISSU.

Note

You can use the in-band cluster upgrade (ICU) commands on SRX4100 and SRX4200 devices to upgrade following Junos OS Releases:

  • Junos OS Release 15.1X49-D65 to Junos OS Release 15.1X49-D70

  • Junos OS Release 15.1X49-D70 to Junos OS Release 15.1X49-D80.

You must use the in-band cluster upgrade (ICU) commands on SRX1500 device to upgrade following Junos OS Releases:

  • Junos OS Release 15.1X49-D50 to Junos OS Release 15.1X49-D60

  • Junos OS Release 15.1X49-D60 to Junos OS Release 15.1X49-D70

  • Junos OS Release 15.1X49-D50 to Junos OS Release 15.1X49-D70

ISSU provides the following benefits:

  • Eliminates network downtime during software image upgrades

  • Reduces operating costs, while delivering higher service levels

  • Allows fast implementation of new features

Note

ISSU has the following limitations:

  • ISSU is available only for Junos OS Release 10.4R4 or later.

  • ISSU does not support software downgrades.

  • If you upgrade from a Junos OS version that supports only IPv4 to a version that supports both IPv4 and IPv6, the IPv4 traffic continue to work during the upgrade process. If you upgrade from a Junos OS version that supports both IPv4 and IPv6 to a version that supports both IPv4 and IPv6, both the IPv4 and IPv6 traffic continue to work during the upgrade process. Junos OS Release 10.2 and later releases support flow-based processing for IPv6 traffic.

  • During an ISSU, you cannot bring any PICs online. You cannot perform operations such as commit, restart, or halt.

  • During an ISSU, operations like fabric monitoring, control link recovery, and RGX preempt are suspended.

  • During an ISSU, you cannot commit any configurations.

Note

For details about ISSU support status, see knowledge base article KB17946.

The following process occurs during an ISSU for devices in a chassis cluster. The sequences given below are applicable when RG-0 is node 0 (primary node). Note that you must initiate an ISSU from RG-0 primary. If you initiate the upgrade on node 1 (RG-0 secondary), an error message is displayed.

  1. At the beginning of a chassis cluster ISSU, the system automatically fails over all RG-1+ redundancy groups that are not primary on the node from which the ISSU is initiated. This action ensures that all the redundancy groups are active on only the RG-0 primary node.

    Note

    The automatic failover of all RG-1+ redundancy groups is available from Junos OS release 12.1 or later. If you are using Junos OS release 11.4 or earlier, before starting the ISSU, ensure that all the redundancy groups are all active on only the RG-0 primary node.

    After the system fails over all RG-1+ redundancy groups, it sets the manual failover bit and changes all RG-1+ primary node priorities to 255, regardless of whether the redundancy group failed over to the RG-0 primary node.

  2. The primary node (node 0) validates the device configuration to ensure that it can be committed using the new software version. Checks are made for disk space availability for the /var file system on both nodes, unsupported configurations, and unsupported Physical Interface Cards (PICs).

    If the disk space available on either of the Routing Engines is insufficient, the ISSU process fails and returns an error message. However, unsupported PICs do not prevent the ISSU. The software issues a warning to indicate that these PICs will restart during the upgrade. Similarly, an unsupported protocol configuration does not prevent the ISSU. However, the software issues a warning that packet loss might occur for the protocol during the upgrade.

  3. When the validation succeeds, the kernel state synchronization daemon (ksyncd) synchronizes the kernel on the secondary node (node 1) with the node 0.

  4. Node 1 is upgraded with the new software image. Before being upgraded, the node 1 gets the configuration file from node 0 and validates the configuration to ensure that it can be committed using the new software version. After being upgraded, it is resynchronized with node 0.

  5. The chassis cluster process (chassisd) on the node 0 prepares other software processes for the lISSU. When all the processes are ready, chassisd sends a message to the PICs installed in the device.

  6. The Packet Forwarding Engine on each Flexible PIC Concentrator (FPC) saves its state and downloads the new software image from node 1. Next, each Packet Forwarding Engine sends a message (unified-ISSU ready) to the chassisd.

  7. After receiving the message (unified-ISSU ready) from a Packet Forwarding Engine, the chassisd sends a reboot message to the FPC on which the Packet Forwarding Engine resides. The FPC reboots with the new software image. After the FPC is rebooted, the Packet Forwarding Engine restores the FPC state and a high-speed internal link is established with node 1 running the new software. The chassisd is also reestablished with node 0.

  8. After all Packet Forwarding Engines have sent a ready message using the chassisd on node 0, other software processes are prepared for a node switchover. The system is ready for a switchover at this point.

  9. Node switchover occurs and node 1 becomes the new primary node (hitherto secondary node 1).

  10. The new secondary node (hitherto primary node 0) is now upgraded to the new software image.

When both nodes are successfully upgraded, the ISSU is complete.

Note

When upgrading a version cluster that does not support encryption to a version that supports encryption, upgrade the first node to the new version. Without the encryption configured and enabled, two nodes with different versions can still communicate with each other and service is not broken. After upgrading the first node, upgrade the second node to the new version. Users can decide whether to turn on the encryption feature after completing the upgrade. Encryption must be deactivated before downgrading to a version that does not support encryption. This ensures that communication between an encryption-enabled version node and a downgraded node does not break, because both are no longer encrypted.

ISSU System Requirements

You can use ISSU to upgrade from an ISSU-capable software release to a later release.

To perform an ISSU, your device must be running a Junos OS release that supports ISSU for the specific platform. See Table 1 for platform support.

Table 1: ISSU Platform Support

Device

Junos OS Release

SRX5800

10.4R4 or later

SRX5600

10.4R4 or later

SRX5400

12.1X46-D20 or later

SRX1500

15.1X49-D70 or later

SRX4100

15.1X49-D80 or later

SRX4200

15.1X49-D80 or later

SRX4600

17.4R1 or later

SRX1400

12.1X47-D10

SRX3400

12.1X47-D10

SRX3600

12.1X47-D10

Note

For additional details on ISSU support and limitations, see ISSU/ICU Upgrade Limitations on SRX Series Devices.

Note the following limitations related to an ISSU:

  • The ISSU process is aborted if the Junos OS version specified for installation is a version earlier than the one currently running on the device.

  • The ISSU process is aborted if the specified upgrade conflicts with the current configuration, the components supported, and so forth.

  • ISSU does not support the extension application packages developed using the Junos OS SDK.

  • ISSU does not support version downgrading on all supported SRX Series devices.

  • ISSU occasionally fails under heavy CPU load.

Note

To downgrade from an ISSU-capable release to an earlier release (ISSU-capable or not), use the request system software add command. Unlike an upgrade using the ISSU process, a downgrade using the request system software add command might cause network disruptions and loss of data.

We strongly recommend that you perform ISSU under the following conditions:

  • When both the primary and secondary nodes are healthy

  • During system maintenance period

  • During the lowest possible traffic period

  • When the Routing Engine CPU usage is less than 40 percent

In cases where ISSU is not supported or recommended, while still downtime during the system upgrade must be minimized, the minimal downtime procedure can be used, see knowledge base articleKB17947.

Upgrading Both Devices in a Chassis Cluster Using ISSU

The chassis cluster ISSU feature enables both devices in a cluster to be upgraded from supported Junos OS versions with a traffic impact similar to that of redundancy group failovers.

Before you begin the ISSU for upgrading both the devices, note the following guidelines:

  • Back up the software using the request system snapshot command on each Routing Engine to back up the system software to the device’s hard disk.

  • If you are using Junos OS Release 11.4 or earlier, before starting the ISSU, set the failover for all redundancy groups so that they are all active on only one node (primary). See Initiating a Chassis Cluster Manual Redundancy Group Failover.

    If you are using Junos OS Release 12.1 or later, Junos OS automatically fails over all RGs to the RG0 primary.

  • We recommend that you enable graceful restart for routing protocols before you start an ISSU.

Note

On all supported SRX Series devices, the first recommended ISSU from release is Junos OS Release 10.4R4.

Starting with Junos OS Release 15.1X49-D70, SRX1500 devices support ISSU.

Starting with Junos OS Release 15.1X49-D80, SRX4100 and SRX4200 devices support ISSU.

Starting with Junos OS Release 17.4R1, SRX4600 devices support ISSU.

To perform an ISSU from the CLI:

  1. Download the software package from the Juniper Networks Support website: https://www.juniper.net/support/downloads/
  2. Copy the package on primary node of the cluster. We recommend that you copy the package to the/var/tmp directory, which is a large file system on the hard disk. Note that the node from where you initiate the ISSU must have the software image.

    user@host>file copy ftp://username:prompt@ftp.hostname.net/filename /var/tmp/filename

  3. Verify the current software version running on both nodes by issuing the show version command on the primary node.
  4. Start the ISSU from the node that is primary for all the redundancy groups by entering the following command:
    user@host> request system software in-service-upgrade image-name-with-full-path reboot
    Note

    For SRX1500, SRX4100, SRX4200, SRX4600, SRX5400, SRX5600, and SRX5800 devices, you must include reboot in the command. If reboot is not included, the commit fails.

    Note

    For SRX1500, SRX4100, and SRX4200 devices, you can optionally remove the original image file by including unlink in the command.

    user@host> request system software in-service-upgrade image-name-with-full-path reboot unlink

    Wait for both nodes to complete the upgrade (After which you are logged out of the device).

  5. Wait a few minutes, and then log in to the device again. Verify by using the show version command that both devices in the cluster are running the new Junos OS release.
  6. Verify that all policies, zones, redundancy groups, and other real-time objects (RTOs) return to their correct states.
  7. Make node 0 the primary node again by issuing the request chassis cluster failover node node-number redundancy-group group-number command.
Note

If you want redundancy groups to automatically return to node 0 as the primary after an in-service software upgrade (ISSU), you must set the redundancy group priority such that node 0 is primary and enable the preempt option. Note that this method works for all redundancy groups except redundancy group 0. You must manually set the failover for redundancy group 0.

To set the redundancy group priority and enable the preempt option, see Example: Configuring Chassis Cluster Redundancy Groups.

To manually set the failover for a redundancy group, see Initiating a Chassis Cluster Manual Redundancy Group Failover.

Note

During the upgrade, both devices might experience redundancy group failovers, but traffic is not disrupted. Each device validates the package and checks version compatibility before beginning the upgrade. If the system finds that the new package version is not compatible with the currently installed version, the device refuses the upgrade or prompts you to take corrective action. Sometimes a single feature is not compatible, in which case, the upgrade software prompts you to either abort the upgrade or turn off the feature before beginning the upgrade.

Note

If you want to operate the SRX Series device back as a standalone device or to remove a node from a chassis cluster, ensure that you have aborted the ISSU procedure on both the nodes (in case ISSU procedure is initiated)

Rolling Back Devices in a Chassis Cluster After an ISSU

If an ISSU fails to complete and only one device in the cluster is upgraded, you can roll back to the previous configuration on the upgraded device alone by issuing one of the following commands on the upgraded device:

  • request chassis cluster in-service-upgrade abort

  • request system software rollback node node-id reboot

  • request system reboot

Enabling an Automatic Chassis Cluster Node Failback After an ISSU

If you want redundancy groups to automatically return to node 0 as the primary after the an in-service software upgrade (ISSU), you must set the redundancy group priority such that node 0 is primary and enable the preempt option. Note that this method works for all redundancy groups except redundancy group 0. You must manually set the failover for a redundancy group 0. To set the redundancy group priority and enable the preempt option, see Example: Configuring Chassis Cluster Redundancy Groups. To manually set the failover for a redundancy group, see Initiating a Chassis Cluster Manual Redundancy Group Failover.

Note

To upgrade node 0 and make it available in the chassis cluster, manually reboot node 0. Node 0 does not reboot automatically.

Understanding Log Error Messages for Troubleshooting ISSU-Related Problems

The following problems might occur during an ISSU upgrade. You can identify the errors by using the details in the logs. For detailed information about specific system log messages, see System Log Explorer.

Chassisd Process Errors

Problem

Description: Errors related to chassisd.

Solution

Use the error messages to understand the issues related to chassisd.

When ISSU starts, a request is sent to chassisd to check whether there are any problems related to the ISSU from a chassis perspective. If there is a problem, a log message is created.

Understanding Common Error Handling for ISSU

Problem

Description: You might encounter some problems in the course of an ISSU. This section provides details on how to handle them.

Solution

Any errors encountered during an ISSU result in the creation of log messages, and ISSU continues to function without impact to traffic. If reverting to previous versions is required, the event is either logged or the ISSU is halted, so as not to create any mismatched versions on both nodes of the chassis cluster. Table 2 provides some of the common error conditions and the workarounds for them. The sample messages used in the Table 2 are from the SRX1500 device and are also applicable to all supported SRX Series devices.

Table 2: ISSU-Related Errors and Solutions

Error Conditions

Solutions

Attempt to initiate an ISSU when previous instance of an ISSU is already in progress

The following message is displayed:

warning: ISSU in progress

You can abort the current ISSU process, and initiate the ISSU again using the request chassis cluster in-service-upgrade abort command.

Reboot failure on the secondary node

No service downtime occurs, because the primary node continues to provide required services. Detailed console messages are displayed requesting that you manually clear existing ISSU states and restore the chassis cluster.

error: [Oct  6 12:30:16]: Reboot secondary node failed (error-code: 4.1)

       error: [Oct  6 12:30:16]: ISSU Aborted! Backup node maybe in inconsistent state, Please restore backup node
       [Oct  6 12:30:16]: ISSU aborted. But, both nodes are in ISSU window.
       Please do the following:
       1. Rollback the node with the newer image using rollback command
          Note: use the 'node' option in the rollback command
          otherwise, images on both nodes will be rolled back
       2. Make sure that both nodes (will) have the same image
       3. Ensure the node with older image is primary for all RGs
       4. Abort ISSU on both nodes
       5. Reboot the rolled back node

Starting with Junos OS Release 17.4R1, the hold timer for the initial reboot of the secondary node during the ISSU process is extended from 15 minutes (900 seconds) to 45 minutes (2700 seconds) in chassis clusters on SRX1500, SRX4100, SRX4200, and SRX4600 devices.

Secondary node failed to complete the cold synchronization

The primary node times out if the secondary node fails to complete the cold synchronization. Detailed console messages are displayed that you manually clear existing ISSU states and restore the chassis cluster. No service downtime occurs in this scenario.

[Oct  3 14:00:46]: timeout waiting for secondary node node1 to sync(error-code: 6.1)
        Chassis control process started, pid 36707 

       error: [Oct  3 14:00:46]: ISSU Aborted! Backup node has been upgraded, Please restore backup node 
       [Oct  3 14:00:46]: ISSU aborted. But, both nodes are in ISSU window. 
       Please do the following: 
      1. Rollback the node with the newer image using rollback command 
          Note: use the 'node' option in the rollback command 
          otherwise, images on both nodes will be rolled back 
      2. Make sure that both nodes (will) have the same image 
      3. Ensure the node with older image is primary for all RGs 
      4. Abort ISSU on both nodes 
      5. Reboot the rolled back node  

Failover of newly upgraded secondary failed

No service downtime occurs, because the primary node continues to provide required services. Detailed console messages are displayed requesting that you manually clear existing ISSU states and restore the chassis cluster.

[Aug 27 15:28:17]: Secondary node0 ready for failover.
[Aug 27 15:28:17]: Failing over all redundancy-groups to node0
ISSU: Preparing for Switchover
error: remote rg1 priority zero, abort failover.
[Aug 27 15:28:17]: failover all RGs to node node0 failed (error-code: 7.1)
error: [Aug 27 15:28:17]: ISSU Aborted!
[Aug 27 15:28:17]: ISSU aborted. But, both nodes are in ISSU window.
Please do the following:
1. Rollback the node with the newer image using rollback command
    Note: use the 'node' option in the rollback command
           otherwise, images on both nodes will be rolled back
2. Make sure that both nodes (will) have the same image
3. Ensure the node with older image is primary for all RGs
4. Abort ISSU on both nodes
5. Reboot the rolled back node
{primary:node1}

Upgrade failure on primary

No service downtime occurs, because the secondary node fails over as primary and continues to provide required services.

Reboot failure on primary node

Before the reboot of the primary node, devices being out of the ISSU setup, no ISSU-related error messages are displayed. The following reboot error message is displayed if any other failure is detected:

Reboot failure on     Before the reboot of primary node, devices will be out of ISSU setup and no primary node error messages will be displayed.
Primary node

ISSU Support-Related Errors

Problem

Description: Installation failure occurs because of unsupported software and unsupported feature configuration.

Solution

Use the following error messages to understand the compatibility-related problems:

Initial Validation Checks Failure

Problem

Description: The initial validation checks fail.

Solution

The validation checks fail if the image is not present or if the image file is corrupt. The following error messages are displayed when initial validation checks fail when the image is not present and the ISSU is aborted:

When Image Is Not Present

When Image File Is Corrupted

If the image file is corrupted, the following output displays:

The primary node validates the device configuration to ensure that it can be committed using the new software version. If anything goes wrong, the ISSU aborts and error messages are displayed.

Installation-Related Errors

Problem

Description: The install image file does not exist or the remote site is inaccessible.

Solution

Use the following error messages to understand the installation-related problems:

ISSU downloads the install image as specified in the ISSU command as an argument. The image file can be a local file or located at a remote site. If the file does not exist or the remote site is inaccessible, an error is reported.

Redundancy Group Failover Errors

Problem

Description: Problem with automatic redundancy group (RG) failure.

Solution

Use the following error messages to understand the problem:

Kernel State Synchronization Errors

Problem

Description: Errors related to ksyncd.

Solution

Use the following error messages to understand the issues related to ksyncd:

ISSU checks whether there are any ksyncd errors on the secondary node (node 1) and displays the error message if there are any problems and aborts the upgrade.

This topic includes the following sections:

Viewing ISSU Progress

Problem

Description: Rather than wait for an ISSU failure, you can display the progress of the ISSU as it occurs, noting any message indicating that the ISSU was unsuccessful. Providing such messages to JTAC can help with resolving the issue.

Solution

After starting an ISSU, issue the show chassis cluster information issu command. Output similar to the following is displayed indicating the progress of the ISSU for all Services Processing Units (SPUs).

Stopping ISSU Process if it Halts During an Upgrade

Problem

Description: The ISSU process halts in the middle of an upgrade.

Solution

If the ISSU fails to complete and only one device in the cluster is upgraded, you can roll back to the previous configuration on the upgraded device alone by issuing one of the following commands on the upgraded device:

  • request chassis cluster in-service-upgrade abort to abort the ISSU on both nodes.

  • request system software rollback node node-id reboot to roll back the image.

  • request system reboot to reboot the rolled back node.

Recovering the Node in Case of a Failed ISSU

Problem

Description: The ISSU procedure stops progressing.

Solution

Open a new session on the primary device and issue the request chassis cluster in-service-upgrade abort command.

This step aborts an in-progress ISSU . This command must be issued from a session other than the one on which you issued the request system in-service-upgrade command that launched the ISSU. If the node is being upgraded, this command cancels the upgrade. The command is also helpful in recovering the node in case of a failed ISSU.

When an ISSU encounters an unexpected situation that necessitates an abort, the system message provides you with detailed information about when and why the upgrade stopped along with recommendations for the next steps to take.

For example, the following message is issued when a node fails to become RG-0 secondary when it boots up:

Note

If you attempt to upgrade a device pair running a Junos OS release earlier than Release 9.6, ISSU fails without changing anything on either device in the cluster. Devices running Junos OS releases earlier than Release 9.6 must be upgraded separately using individual device upgrade procedures.

If the secondary device experiences a power-off condition before it boots up using the new image specified when the ISSU was initiated, the newly upgraded device will still be waiting to end the ISSU after power is restored. To end the ISSU, issue the request chassis cluster in-service-upgrade abort command.

Release History Table
Release
Description
Starting with Junos OS Release 17.4R1, SRX4600 devices support ISSU.
Starting with Junos OS Release 17.4R1, the hold timer for the initial reboot of the secondary node during the ISSU process is extended from 15 minutes (900 seconds) to 45 minutes (2700 seconds) in chassis clusters on SRX1500, SRX4100, SRX4200, and SRX4600 devices.
Starting with Junos OS Release 15.1X49-D80, SRX4100 and SRX4200 devices support ISSU.
Starting with Junos OS Release 15.1X49-D80, SRX4100 and SRX4200 devices support ISSU.
Starting with Junos OS Release 15.1X49-D70, SRX1500 devices support ISSU.
Starting with Junos OS Release 15.1X49-D70, SRX1500 devices support ISSU.