Chassis Cluster Overview
A chassis cluster provides high availability on SRX Series devices where two devices operate as a single device. Chassis cluster includes the synchronization of configuration files and the dynamic runtime session states between the SRX Series devices, which are part of chassis cluster setup.
Chassis Cluster Overview
The Junos OS provides high availability on SRX Series device by using chassis clustering. SRX Series Services Gateways can be configured to operate in cluster mode, where a pair of devices can be connected together and configured to operate like a single node, providing device, interface, and service level redundancy.
For SRX Series devices, which act as stateful firewalls, it is important to preserve the state of the traffic between two devices. In a chassis cluster setup, in the event of failure, session persistence is required so that the established sessions are not dropped even if the failed device was forwarding traffic.
When configured as a chassis cluster, the two nodes back up each other, with one node acting as the primary device and the other as the secondary device, ensuring stateful failover of processes and services in the event of system or hardware failure. If the primary device fails, the secondary device takes over the processing of traffic. The cluster nodes are connected together with two links called control link and fabric link and devices in a chassis cluster synchronize the configuration, kernel, and PFE session states across the cluster to facilitate high availability, failover of stateful services, and load balancing.
This feature requires a license. To understand more about Chassis Cluster License, see, Understanding Chassis Cluster Licensing Requirements, Installing Licenses on the SRX Series Devices in a Chassis Cluster and Verifying Licenses on an SRX Series Device in a Chassis Cluster. Please refer to the Juniper Licensing Guide for general information about License Management. Please refer to the product Data Sheets at SRX Series Services Gateways for details, or contact your Juniper Account Team or Juniper Partner.
Benefits of Chassis Cluster
Prevents single device failure that results in a loss of connectivity.
Provides high availability between devices when connecting branch and remote site links to larger corporate offices. By leveraging the chassis cluster feature, enterprises can ensure connectivity in the event of device or link failure.
Chassis Cluster Functionality
Chassis cluster functionality includes:
Resilient system architecture, with a single active control plane for the entire cluster and multiple Packet Forwarding Engines. This architecture presents a single device view of the cluster.
Synchronization of configuration and dynamic runtime states between nodes within a cluster.
Monitoring of physical interfaces, and failover if the failure parameters cross a configured threshold.
Chassis Cluster Modes
A chassis cluster can be configured in an active/active or active/passive mode.
Active/passive mode: In active/passive mode, transit traffic passes through the primary node while the backup node is used only in the event of a failure. When a failure occurs, the backup device becomes master and takes over all forwarding tasks.
Active/active mode: In active/active mode, has transit traffic passing through both nodes of the cluster all of the time.
How Chassis Clustering Works?
The control ports on the respective nodes are connected to form a control plane that synchronizes configuration and kernel state to facilitate the high availability of interfaces and services.
The data plane on the respective nodes is connected over the fabric ports to form a unified data plane.
When creating a chassis cluster, the control ports on the respective nodes are connected to form a control plane that synchronizes the configuration and kernel state to facilitate the high availability of interfaces and services.
Similarly, the data plane on the respective nodes is connected over the fabric ports to form a unified data plane.
The fabric link allows for the management of cross-node flow processing and for the management of session redundancy.
The control plane software operates in active or backup mode. When configured as a chassis cluster, the two nodes back up each other, with one node acting as the primary device and the other as the secondary device, ensuring stateful failover of processes and services in the event of system or hardware failure. If the primary device fails, the secondary device takes over processing of traffic.
The data plane software operates in active/active mode. In a chassis cluster, session information is updated as traffic traverses either device, and this information is transmitted between the nodes over the fabric link to guarantee that established sessions are not dropped when a failover occurs. In active/active mode, it is possible for traffic to ingress the cluster on one node and egress from the other node. When a device joins a cluster, it becomes a node of that cluster. With the exception of unique node settings and management IP addresses, nodes in a cluster share the same configuration.
At any given instant, a cluster can be in one of the following states: hold, primary, secondary-hold, secondary, ineligible, and disabled. A state transition can be triggered because of any event, such as interface monitoring, SPU monitoring, failures, and manual failovers.
IPv6 Clustering Support
SRX Series devices running IP version 6 (IPv6) can be deployed in active/active (failover) chassis cluster configurations in addition to the existing support of active/passive (failover) chassis cluster configurations. An interface can be configured with an IPv4 address, IPv6 address, or both. Address book entries can include any combination of IPv4 addresses, IPv6 addresses, and Domain Name System (DNS) names.
Chassis cluster supports Generic Routing Encapsulation (GRE) tunnels used to route encapsulated IPv4/IPv6 traffic by means of an internal interface, gr-0/0/0. This interface is created by Junos OS at system bootup and is used only for processing GRE tunnels. See the Interfaces User Guide for Security Devices.
Chassis Cluster Limitations
The SRX Series devices have the following chassis cluster limitations:
Group VPN is not supported.
On all SRX Series devices in a chassis cluster, flow monitoring for version 5 and version 8 is supported. However, flow monitoring for version 9 is not supported.
When an SRX Series device is operating in chassis cluster mode and encounter any IA-chip access issue in an SPC or a I/O Card (IOC), a minor FPC alarm is activated to trigger redundancy group failover.
On SRX5400, SRX5600, and SRX5800 devices, screen statistics data can be gathered on the primary device only.
On SRX4600, SRX5400, SRX5600, and SRX5800 devices, in large chassis cluster configurations, if more than 1000 logical interfaces are used, the cluster heartbeat timers are recommended to be increased from the default wait time before triggering failover. In a full-capacity implementation, we recommend increasing the wait to 8 seconds by modifying heartbeat-threshold and heartbeat-interval values in the [edit chassis cluster] hierarchy.
The product of the heartbeat-threshold and heartbeat-interval values defines the time before failover. The default values (heartbeat-threshold of 3 beats and heartbeat-interval of 1000 milliseconds) produce a wait time of 3 seconds.
To change the wait time, modify the option values so that the product equals the desired setting. For example, setting the heartbeat-threshold to 8 and maintaining the default value for the heartbeat-interval (1000 milliseconds) yields a wait time of 8 seconds. Likewise, setting the heartbeat-threshold to 4 and the heartbeat-interval to 2000 milliseconds also yields a wait time of 8 seconds.
On SRX5400, SRX5600, and SRX5800 devices, eight-queue configurations are not reflected on the chassis cluster interface.
Flow and Processing
If you use packet capture on reth interfaces, two files are created, one for ingress packets and the other for egress packets based on the reth interface name. These files can be merged outside of the device using tools such as Wireshark or Mergecap.
If you use port mirroring on reth interfaces, the reth interface cannot be configured as the output interface. You must use a physical interface as the output interface. If you configure the reth interface as an output interface using the set forwarding-options port-mirroring family inet output command, the following error message is displayed.
Port-mirroring configuration error.
Interface type in reth1.0 is not valid for port-mirroring or next-hop-group config
When an SRX Series device is operating in chassis cluster mode and encounter any IA-chip (IA-chip is part of Juniper SPC1 and IOC1. It has direct impact on SPC1/IOC1 control plane) access issue in an SPC or a I/O Card (IOC), a minor FPC alarm is activated to trigger redundancy group failover.
On SRX Series devices in a chassis cluster, when two logical systems are configured, the scaling limit crosses 13,000, which is very close to the standard scaling limit of 15,000, and a convergence time of 5 minutes results. This issue occurs because multicast route learning takes more time when the number of routes is increased.
On SRX4600, SRX5400, SRX5600, and SRX5800 devices in a chassis cluster, if the primary node running the LACP process (lacpd) undergoes a graceful or ungraceful restart, the lacpd on the new primary node might take a few seconds to start or reset interfaces and state machines to recover unexpected synchronous results. Also, during failover, when the system is processing traffic packets or internal high-priority packets (deleting sessions or reestablishing tasks), medium-priority LACP packets from the peer (switch) are pushed off in the waiting queues, causing further delay.
Flowd monitoring is supported on SRX100, SRX210, SRX240, SRX550M, SRX650, SRX300, SRX320, SRX340, SRX345, SRX380, SRX550M, and SRX1500 devices.
Installation and Upgrade
For SRX300, SRX320, SRX340, SRX345, SRX380, and SRX550M devices, the reboot parameter is not available, because the devices in a cluster are automatically rebooted following an in-band cluster upgrade (ICU).
On the lsq-0/0/0 interface, Link services MLPPP, MLFR, and CRTP are not supported.
On the lt-0/0/0 interface, CoS for RPM is not supported.
The 3G dialer interface is not supported.
Queuing on the ae interface is not supported.
Layer 2 Switching
On SRX Series device failover, access points on the Layer 2 switch reboot and all wireless clients lose connectivity for 4 to 6 minutes.
The Chassis Cluster MIB is not supported.
The maximum number of monitoring IPs that can be configured per cluster is 64 for SRX300, SRX320, SRX340, SRX345, SRX380, SRX550M, and SRX1500 devices.
On SRX300, SRX320, SRX340, SRX345, SRX380, SRX550M, and SRX1500 devices, logs cannot be sent to NSM when logging is configured in the stream mode. Logs cannot be sent because the security log does not support configuration of the source IP address for the fxp0 interface and the security log destination in stream mode cannot be routed through the fxp0 interface. This implies that you cannot configure the security log server in the same subnet as the fxp0 interface and route the log server through the fxp0 interface.
Redundancy group IP address monitoring is not supported for IPv6 destinations.
On SRX5400, SRX5600, and SRX5800 devices, an APN or an IMSI filter must be limited to 600 for each GTP profile. The number of filters is directly proportional to the number of IMSI prefix entries. For example, if one APN is configured with two IMSI prefix entries, then the number of filters is two.
The Chassis Cluster MIB is not supported.
Starting with Junos OS Release 12.1X45-D10 and later, sampling features such as flow monitoring, packet capture, and port mirroring are supported on reth interfaces.