Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Chassis Cluster Overview

A chassis cluster provides high availability on SRX Series Firewalls where two devices operate as a single device. Chassis cluster includes the synchronization of configuration files and the dynamic runtime session states between the SRX Series Firewalls, which are part of chassis cluster setup.

Chassis Cluster Overview

The Junos OS provides high availability on SRX Series Firewall by using chassis clustering. SRX Series Firewalls can be configured to operate in cluster mode, where a pair of devices can be connected together and configured to operate like a single node, providing device, interface, and service level redundancy.

For SRX Series Firewalls, which act as stateful firewalls, it is important to preserve the state of the traffic between two devices. In a chassis cluster setup, in the event of failure, session persistence is required so that the established sessions are not dropped even if the failed device was forwarding traffic.

When configured as a chassis cluster, the two nodes back up each other, with one node acting as the primary device and the other as the secondary device, ensuring stateful failover of processes and services in the event of system or hardware failure. If the primary device fails, the secondary device takes over the processing of traffic. The cluster nodes are connected together with two links called control link and fabric link and devices in a chassis cluster synchronize the configuration, kernel, and PFE session states across the cluster to facilitate high availability, failover of stateful services, and load balancing.

There is no seperate license required to enable chassis cluster. However, some Junos OS software features require a license to activate the feature. For more information, see Understanding Chassis Cluster Licensing Requirements, Installing Licenses on the SRX Series Devices in a Chassis Cluster and Verifying Licenses on an SRX Series Device in a Chassis Cluster. Please refer to the Juniper Licensing Guide for general information about License Management. Please refer to the product Data Sheets at SRX Series Services Gateways for details, or contact your Juniper Account Team or Juniper Partner.

Benefits of Chassis Cluster

  • Prevents single device failure that results in a loss of connectivity.

  • Provides high availability between devices when connecting branch and remote site links to larger corporate offices. By leveraging the chassis cluster feature, enterprises can ensure connectivity in the event of device or link failure.

Chassis Cluster Functionality

Chassis cluster functionality includes:

  • Resilient system architecture, with a single active control plane for the entire cluster and multiple Packet Forwarding Engines. This architecture presents a single device view of the cluster.

  • Synchronization of configuration and dynamic runtime states between nodes within a cluster.

  • Monitoring of physical interfaces, and failover if the failure parameters cross a configured threshold.

Chassis Cluster Modes

A chassis cluster can be configured in an active/active or active/passive mode.

  • Active/passive mode: In active/passive mode, transit traffic passes through the primary node while the backup node is used only in the event of a failure. When a failure occurs, the backup device becomes primary and takes over all forwarding tasks.

  • Active/active mode: In active/active mode, has transit traffic passing through both nodes of the cluster all of the time.

How Chassis Clustering Works?

The control ports on the respective nodes are connected to form a control plane that synchronizes configuration and kernel state to facilitate the high availability of interfaces and services.

The data plane on the respective nodes is connected over the fabric ports to form a unified data plane.

When creating a chassis cluster, the control ports on the respective nodes are connected to form a control plane that synchronizes the configuration and kernel state to facilitate the high availability of interfaces and services.

Similarly, the data plane on the respective nodes is connected over the fabric ports to form a unified data plane.

The fabric link allows for the management of cross-node flow processing and for the management of session redundancy.

The control plane software operates in active or backup mode. When configured as a chassis cluster, the two nodes back up each other, with one node acting as the primary device and the other as the secondary device, ensuring stateful failover of processes and services in the event of system or hardware failure. If the primary device fails, the secondary device takes over processing of traffic.

The data plane software operates in active/active mode. In a chassis cluster, session information is updated as traffic traverses either device, and this information is transmitted between the nodes over the fabric link to guarantee that established sessions are not dropped when a failover occurs. In active/active mode, it is possible for traffic to ingress the cluster on one node and egress from the other node. When a device joins a cluster, it becomes a node of that cluster. With the exception of unique node settings and management IP addresses, nodes in a cluster share the same configuration.

At any given instant, a cluster can be in one of the following states: hold, primary, secondary-hold, secondary, ineligible, and disabled. A state transition can be triggered because of any event, such as interface monitoring, SPU monitoring, failures, and manual failovers.

IPv6 Clustering Support

SRX Series Firewalls running IP version 6 (IPv6) can be deployed in active/active (failover) chassis cluster configurations in addition to the existing support of active/passive (failover) chassis cluster configurations. An interface can be configured with an IPv4 address, IPv6 address, or both. Address book entries can include any combination of IPv4 addresses, IPv6 addresses, and Domain Name System (DNS) names.

Chassis cluster supports Generic Routing Encapsulation (GRE) tunnels used to route encapsulated IPv4/IPv6 traffic by means of an internal interface, gr-0/0/0. This interface is created by Junos OS at system bootup and is used only for processing GRE tunnels. See the Interfaces User Guide for Security Devices.

Use case for SRX Chassis Clusters

Enterprise and service provider networks employ various redundancy and resiliency methods at the customer edge network tier. As this tier represents the entrance or peering point to the Internet, its stability and uptime are of great importance. Customer transactional information, email, Voice over IP (VoIP), and site-to-site traffic can all utilize this single entry point to the public network. In environments where a site-to-site VPN is the only interconnect between customer sites and the headquarter site, this link becomes even more vital.

Traditionally, multiple devices with discreet configurations have been used to provide redundancy at this network layer with mixed results. In these configurations, the enterprise relies on routing and redundancy protocols to enable a highly available and redundant customer edge. These protocols are often slow to recognize failure and do not typically allow for the synchronization required to properly handle stateful traffic. Given that a fair amount of enterprise traffic passing through the edge (to/from the Internet, or between customer sites) is stateful, a consistent challenge in the configuration of this network tier has been ensuring session state is not lost when failover or reversion occurs.

Another challenge in configuration of redundant devices is the need to configure, manage, and maintain separate physical devices with different configurations. Synchronizing those configurations can also be a challenge because as the need and complexity of security measures increase, so too does the probability that configurations are mismatched. In a secure environment, a mismatched configuration can cause something as simple as a loss of connectivity or as complex and costly as a total security breech. Any anomalous event on the customer edge can affect uptime, which consequently impacts the ability to service customers, or possibly the ability to keep customer data secure.

An answer to the problem of redundant customer edge configuration is to introduce a state-aware clustering architecture that allows two or more devices to operate as a single device. Devices in this type of architecture are able to share session information between all devices to allow for near instantaneous failover and reversion of stateful traffic. A key measure of success in this space is the ability of the cluster to fail over and revert traffic while maintaining the state of active sessions.

Using the SRX Chassis Cluster configuration described in Example: Configuring an SRX Series Services Gateway as a Full Mesh Chassis Cluster will reduce system downtime.

Devices in an effective clustering architecture can also be managed as a single device; sharing a single control plane. This function is vital as it reduces the OpEx associated with managing multiple devices. Rather than managing and operating separate devices with different configurations and management portals, you can manage multiple devices that serve the same function through a single management point.

Finally, in a cluster configuration, devices have the ability to monitor active interfaces to determine their service state. An effective cluster proactively monitors all revenue interfaces and should fail over to backup interfaces if a failure is detected. This should be done at nearly instantaneous intervals to minimize the impact of a service failure (dropped customer calls, and so on).

Chassis Cluster Limitations

The SRX Series Firewalls have the following chassis cluster limitations:

Chassis Cluster

  • Group VPN is not supported.

  • On all SRX Series Firewalls in a chassis cluster, flow monitoring for version 5 and version 8 is supported. However, flow monitoring for version 9 is not supported.

  • When an SRX Series Firewall is operating in chassis cluster mode and encounter any IA-chip access issue in an SPC or a I/O Card (IOC), a minor FPC alarm is activated to trigger redundancy group failover.

  • On SRX5400, SRX5600, and SRX5800 devices, screen statistics data can be gathered on the primary device only.

  • On SRX4600, SRX5400, SRX5600, and SRX5800 devices, in large chassis cluster configurations, if more than 1000 logical interfaces are used, the cluster heartbeat timers are recommended to be increased from the default wait time before triggering failover. In a full-capacity implementation, we recommend increasing the wait to 8 seconds by modifying heartbeat-threshold and heartbeat-interval values in the [edit chassis cluster] hierarchy.

    The product of the heartbeat-threshold and heartbeat-interval values defines the time before failover. The default values (heartbeat-threshold of 3 beats and heartbeat-interval of 1000 milliseconds) produce a wait time of 3 seconds.

    To change the wait time, modify the option values so that the product equals the desired setting. For example, setting the heartbeat-threshold to 8 and maintaining the default value for the heartbeat-interval (1000 milliseconds) yields a wait time of 8 seconds. Likewise, setting the heartbeat-threshold to 4 and the heartbeat-interval to 2000 milliseconds also yields a wait time of 8 seconds.

  • On SRX5400, SRX5600, and SRX5800 devices, eight-queue configurations are not reflected on the chassis cluster interface.

Flow and Processing

  • If you use packet capture on reth interfaces, two files are created, one for ingress packets and the other for egress packets based on the reth interface name. These files can be merged outside of the device using tools such as Wireshark or Mergecap.

  • If you use port mirroring on reth interfaces, the reth interface cannot be configured as the output interface. You must use a physical interface as the output interface. If you configure the reth interface as an output interface using the set forwarding-options port-mirroring family inet output command, the following error message is displayed.

    Port-mirroring configuration error. Interface type in reth1.0 is not valid for port-mirroring or next-hop-group config

  • When an SRX Series Firewall is operating in chassis cluster mode and encounter any IA-chip (IA-chip is part of Juniper SPC1 and IOC1. It has direct impact on SPC1/IOC1 control plane) access issue in an SPC or a I/O Card (IOC), a minor FPC alarm is activated to trigger redundancy group failover.

  • On SRX Series Firewalls in a chassis cluster, when two logical systems are configured, the scaling limit crosses 13,000, which is very close to the standard scaling limit of 15,000, and a convergence time of 5 minutes results. This issue occurs because multicast route learning takes more time when the number of routes is increased.

  • On SRX4600, SRX5400, SRX5600, and SRX5800 devices in a chassis cluster, if the primary node running the LACP process (lacpd) undergoes a graceful or ungraceful restart, the lacpd on the new primary node might take a few seconds to start or reset interfaces and state machines to recover unexpected synchronous results. Also, during failover, when the system is processing traffic packets or internal high-priority packets (deleting sessions or reestablishing tasks), medium-priority LACP packets from the peer (switch) are pushed off in the waiting queues, causing further delay.

Flowd monitoring is supported on SRX300, SRX320, SRX340, SRX345, SRX380, SRX1500, SRX1600, SRX2300, and SRX4300 devices.

Installation and Upgrade

  • For SRX300, SRX320, SRX340, SRX345, and SRX380 devices, the reboot parameter is not available, because the devices in a cluster are automatically rebooted following an in-band cluster upgrade (ICU).

Interfaces

  • On the lsq-0/0/0 interface, Link services MLPPP, MLFR, and CRTP are not supported.

  • On the lt-0/0/0 interface, CoS for RPM is not supported.

  • The 3G dialer interface is not supported.

  • Queuing on the ae interface is not supported.

Layer 2 Switching

  • On SRX Series Firewall failover, access points on the Layer 2 switch reboot and all wireless clients lose connectivity for 4 to 6 minutes.

MIBs

  • The Chassis Cluster MIB is not supported.

Monitoring

  • The maximum number of monitoring IPs that can be configured per cluster is 64 for SRX300, SRX320, SRX340, SRX345, SRX380, SRX1500, SRX1600, SRX2300, and SRX4300 devices.

  • On SRX300, SRX320, SRX340, SRX345, SRX380, SRX1500,SRX1600, SRX2300, and SRX4300 devices, logs cannot be sent to NSM when logging is configured in the stream mode. Logs cannot be sent because the security log does not support configuration of the source IP address for the fxp0 interface and the security log destination in stream mode cannot be routed through the fxp0 interface. This implies that you cannot configure the security log server in the same subnet as the fxp0 interface and route the log server through the fxp0 interface.

IPv6

  • Redundancy group IP address monitoring is not supported for IPv6 destinations.

GPRS

  • On SRX5400, SRX5600, and SRX5800 devices, an APN or an IMSI filter must be limited to 600 for each GTP profile. The number of filters is directly proportional to the number of IMSI prefix entries. For example, if one APN is configured with two IMSI prefix entries, then the number of filters is two.

MIBs

  • The Chassis Cluster MIB is not supported.

Nonstop Active Routing (NSR)

  • NSR can preserve interface and kernel information and saves routing protocol information by running the routing protocol process (RPD) on the backup Routing Engine. However, most SRX platforms do not support NSR yet. So on the secondary node, there is no existing RPD daemon. After RG0 failover happens, the new RG0 master will have a new RPD and need to re-negotiate with peer device. Only SRX5000 platforms with version 17.4R2 or higher can support NSR.

Starting with Junos OS Release 12.1X45-D10 and later, sampling features such as flow monitoring, packet capture, and port mirroring are supported on reth interfaces.

Change History Table

Feature support is determined by the platform and release you are using. Use Feature Explorer to determine if a feature is supported on your platform.

Release
Description
12.1X45
Starting with Junos OS Release 12.1X45-D10 and later, sampling features such as flow monitoring, packet capture, and port mirroring are supported on reth interfaces.