Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Multinode High Availability

SUMMARY Learn about the Multinode High Availability solution and how you can use it in simple and reliable deployment models. Currently, we support two nodes in any Multinode High Availability deployment.

Overview

Business continuity is an important requirement of the modern network. Downtime of even a few seconds might cause disruption and inconvenience apart from affecting the OpEx and CapEx. Modern networks also have data centers spread across multiple geographical areas. In such scenarios, achieving high availability can be very challenging.

Juniper Networks® SRX Series Firewalls support a new solution, Multinode High Availability, to address high availability requirements for modern data centers. In this solution, both the control plane and the data plane of the participating devices (nodes) are active at the same time. Thus, the solution provides interchassis resiliency.

The participating devices could be co-located or physically separated across geographical areas or other locations such as different rooms or buildings. Having nodes with high availability across geographical locations ensures resilient service. If a disaster affects one physical location, Multinode High Availability can fail over to a node in another physical location, thereby ensuring continuity.

Benefits of Multinode High Availability

  • Reduced CapEx and OpEx—Eliminates the need for a switched network surrounding the firewall complex and the need for a direct Layer 2 (L2) connectivity between nodes

  • Network flexibility—Provides greater network flexibility by supporting high availability across Layer 3 (L3) and switched network segments.

  • Stateful resilient solution—Supports active control plane and data plane at the same time on both nodes.

  • Business continuity and disaster recovery—Maximizes availability, increasing redundancy within and across data centers and geographies.

  • Smooth upgrades—Supports different versions of Junos OS on two nodes to ensure smooth upgrades between the Junos OS releases, also allows to run two different version of Junos.

We support two nodes in Multinode High Availability solution.

Active/Backup Multinode High Availability

We support active/backup Multinode High Availability on:

  • SRX5800, SRX5600, SRX5400 with SPC3, IOC3, IOC4, SCB3, SCB4, and RE3 (in Junos OS Release 20.4R1)

  • SRX4600, SRX4200, SRX4100, and SRX1500 (in Junos OS Release 22.3R1)

  • vSRX3.0 virtual firewalls (in Junos OS Release 22.3R1) for the following private and public cloud platforms:

    • KVM (kernel-based virtual machine)
    • VMWare ESXi
    • Amazon Web Services (AWS)

Active/Active Multinode High Availability

Starting in Junos OS Release 22.4R1, you can operate Multinode High Availability in active-active mode with support of multiple services redundancy groups (SRGs).

Multi SRG support is available on SRX5400, SRX5600, and SRX5800 with SPC3, IOC3, IOC4, SCB3, SCB4, and RE3.

Supported Features

SRX Series Firewalls with Multinode High Availability support the firewall and advanced security services—such as application security, Content Security , intrusion prevention system (IPS), firewall user authentication, NAT, ALG.

For the complete list of features supported with Multinode High Availability, see Feature Explorer.

Multinode High Availability does not support transparent mode high availability (HA)

Deployment Scenarios

Multinode High Availability supports two SRX Series Firewalls presenting themselves as independent nodes to the rest of the network. The nodes are connected to adjacent infrastructure belonging to the same or different networks, all depending on the deployment mode. These nodes can either be collocated or separated across geographies. Participating nodes back up each other to ensure a fast synchronized failover in case of system or hardware failure.

We support the following types of network deployment models for Multinode High Availability:

  • Route mode (all interfaces connected using a Layer 3 topology)
    Figure 1: Layer 3 Mode Layer 3 Mode
  • Default gateway mode (all interfaces connected using an Layer 2 topology) used in more traditional environments. Common deployment of DMZ networks where the firewall devices act as the default gateway for the hosts and applications on the same segment.
    Figure 2: Default Gateway Mode Default Gateway Mode
  • Hybrid mode (one or more interfaces are connected using a Layer 3 topology and one or more interfaces are connected using a Layer 2 topology)
    Figure 3: Hybrid Mode Hybrid Mode
  • AWS deployment
    Figure 4: Public Cloud Deployment Public Cloud Deployment

How Is Multinode High Availability Different from Chassis Cluster?

A chassis cluster operates in Layer 2 network environment and requires two links between the nodes (control link and fabric link). These links connect both nodes over dedicated VLANs using back-to-back cabling or over dark fiber. Control links and fabric links use dedicated physical ports on the SRX Series Firewall.

Multinode High Availability uses an encrypted logical interchassis link (ICL). The ICL connects the nodes over a routed path instead of a dedicated Layer 2 network. This routed path can use one or more revenue ports for best resiliency, it’s even possible to dedicate its own routing instance to these ports and paths to ensure total isolation which maximizes the resiliency of the solution.

Figure 5 and Figure 6 show two architectures.

Figure 5: Chassis Cluster Topology in a Layer 2 Network Chassis Cluster Topology in a Layer 2 Network
Figure 6: Multinode High Availability in a Layer 3 Network Multinode High Availability in a Layer 3 Network

Table 1 lists the differences between the two architectures

Table 1: Comparing Chassis Cluster and Multinode High Availability
Parameters Chassis Cluster Multinode High Availability
Network topology Nodes connect to a broadcast domain Nodes connect to a router, a broadcast domain, or a combination of both.
  • Nodes connect to a router
  • Broadcast domain
  • Combination of both above
Network environment Layer 2
  • Layer 3 (Route mode)
  • Layer 2 (defaut gateway mode
  • Combination of Layer 3 and Layer 2 (hybrid mode)
  • Public cloud (AWS) deployments
Traffic switchover approach SRX Series Firewall sends GARP to the switch

Switchover using IP path selection by a peer Layer 3 router or Layer 2 GARP from an SRX Series Firewall to a peer Layer 2 switch

  • Route mode: Switchover using IP path selection (route updates)
  • Hybrid mode: Switchover using IP path selection (route updates), and sending GARP to the switch
  • Default gateway mode: SRX Series Firewall sends GARP to the switch
Public cloud Not supported Supported
Dynamic routing function Routing process active on the SRX Series where the control plane (RG0) is active Routing process active on each SRX Series Firewall participating in Multinode High Availability
Connection between SRX Series Firewalls
  • Control link (Layer 2 path)

  • Fabric link (Layer 2 path)
Interchassis link (Layer 3 path)
Connectivity / Geo-redundance Requires a dedicated Layer 2 stretch between the SRX Series nodes for the control link and fabric link. Uses any routed path between the nodes for the Interchassis link.
IP monitoring to detect network failure
  • Interfaces
  • IP monitoring using IPv4 addresses
  • Interfaces
  • IP monitoring using IPv4 and IPv6 addresses
  • Bidirectional Forwarding Detection (BFD) using IPv4 addresses

Multinode High Availability Glossary

Let's begin by getting familiar with Multinode High Availability terms used in this documentation.

Table 2: Multinode High Availability Glossary
Term Description
active/active state (SRG0) All security services/flows are inspected at each node and backed up on the other node. Security flows must be symmetric.
active/backup state (SRG1+) SRG1+ remains active on one node at any given time and remains in backed up state on the other node. SRG1+ in the backup state is ready to take over traffic from the active SRG1 in case on a failure.
device priority Priority value determines whether a node can act as the active node in a Multinode High Availability setup. The node with a lower numerical value has a higher priority and, therefore, acts as the active node while the other node acts as the backup node.
device preemption Preemptive behavior allows the device with the higher priority (lower numerical value) to resume as active node after it recovers from a failure. If you need to use a specific device in Multinode High Availability as active node, then you must enable the preemptive behavior on both the devices and assign a device priority value for each device.
failover A failover happens when one node detects a failure (hardware/software and so on) and traffic transitions to the other node in a stateful manner. As result, the backup node in a high availability system takes over the task of the active node when the active node fails.
floating IP address or activeness probing IP address An IP address that moves from an active node to the backup node during failover in a Multinode High Availability setup. This mechanism enables clients to communicate with the nodes using a single IP address.

high availability/resiliency

Ability of a system to eliminate single points of failure to ensure continuous operations over an extended period of time.
interchassis link IP-based link (logical link) that connects nodes over a routed network in a Multinode High Availability deployment. The ICL link is normally bound to the loopback interfaces for most flexible deployments. Connectivity can be any routed or switched path as long as the connectivity is reachable between the two IP addresses.

The security device uses the ICL to synchronize and maintain state information and to handle device failover scenarios.

Interchassis link encryption Link encryption provides data privacy for messages traversing over the network. As the ICL link transmits private data, it is important to encrypt the link. You must encrypt the ICL using IPsec VPN.
monitoring (BFD) Monitoring of one or more links using Bidirectional Forwarding Detection (BFD). BFD monitoring triggers a routing path change or a system failover, depending on system configuration.
monitoring (IP) Monitoring of a reliable IP address and system state in case of loss of communication with the peer node.
monitoring (path) Method that uses ICMP to verify the reachability of the IP address. The default interval for ICMP ping probes is 1 second.
monitoring (system) Monitoring of key hardware and software resources and infrastructures by triggering failover when a failure is detected on a node.
probing Mechanism used to exchange messages between active and backup nodes in the high availability setup. The messages determine the status and health of the application on each individual node.
real-time object (RTO) Special payload packet that contains the necessary information to synchronize the data from one node to the other node.
split-brain detection (also known as control plane detection or activeness conflict detection) Event where the ICL between two Multinode High Availability nodes is down, and both nodes initiate an activeness determination probe (split-brain probe). Based on the response to the probe, subsequent failover to a new role is triggered
services redundancy group (SRG) Failover unit that includes and manages a collection of objects on the participating nodes. The SRG on one node switches over to the other node when a failover is detected.
SRG0 Manages all control plane stateless services such as firewall, NAT, and ALG. SRG0 is active on all participating nodes and handles symmetric security flows.
SRG1+ Manages control plane stateful service (IPsec VPN or virtual IPs in hybrid or default gateway mode.).
synchronization Process where controls and data plane states are synchronized across the nodes.
virtual IP (VIP) address Virtual IP addresses in hybrid or default gateway mode are used for activeness determination and enforcement on the switching side in a Multinode High Availability setup. The virtual IP is controlled by the SRG1+.
virtual MAC (VMAC) address (For hybrid and default gateway deployments). Virtual MAC address dynamically assigned to the interface on active node that faces the switching side.

Now we are that familiar with Multinode High Availability features and terminology, let's proceed to understand how Multinode High Availability works.

How Multinode High Availability Works

Note:

We support a two-node configuration for the Multinode High Availability solution.

In a Multinode High Availability setup, you connect two SRX Series Firewalls to adjacent upstream and downstream routers (for Layer 3 deployments), routers and switches (hybrid deployment), or switches (default gateway deployment) using the revenue interfaces.

The nodes communicate with each other using an interchassis link (ICL). The ICL link uses Layer 3 connectivity to communicate with each other, This communication can take place over a routed network (Layer 3), or a directly connected Layer 2 path. It is recommended to bind the ICL to the loopback interface and have more than one physical link (LAG/LACP) to ensure path diversity for the highest resiliency.

Multinode High Availability operates in active/active mode for data plane and active/backup mode for control plane services. The active SRX Series Firewall hosts the floating IP address and steers traffic towards it using the floating IP address

Multinode High Availability operates in:

  • Active/active mode (SRG0) for the security services
  • Active/backup mode (SRG1 and above) for security and system services

Floating IP addresses controlled by SRG1 or above moves between the nodes . Active SRG1+ hosts and controls the floating IP address. In failover scenarios, this IP address 'floats' to another active SRG1 based on configuration, system health, or path monitoring decisions. The the newly active SRG1+ can take on the function of a now-standby SRG1 and starts responding to incoming requests.

Figure 7, Figure 8, and Figure 9 show deployments in Layer 3, hybrid, and default gateway modes.

Figure 7: Layer 3 Deployment Layer 3 Deployment

In this topology, two SRX Series Firewalls are part of a Multinode High Availability setup. The setup has Layer 3 connectivity between SRX Series Firewalls and neighboring routers. The devices are running on separate physical Layer 3 networks and are operating as two independent nodes. The nodes shown in the illustration are co-located in the topology. The nodes can also be geographically separated.

Figure 8: Default Gateway Deployment Default Gateway Deployment

In a typical default gateway deployment, hosts and servers in a LAN are configured with a default gateway of the security device. So the security device must host a virtual IP (VIP) address that moves between nodes based on the activeness. The configuration on hosts remains static, and security device failover is seamless from the hosts' perspective.

You must create static routes or dynamic routing on SRX Series Firewalls to reach other networks not directly connected.

Figure 9: Hybrid Deployment Hybrid Deployment

In hybrid mode, an SRX Series Firewall uses a VIP address on the Layer 2 side to draw traffic toward it. You can optionally configure the static ARP for the VIP using the VMAC address to ensure no change in the IP address during the failover

Let's now understand the components and functionality of Multinode High Availability in detail.

Services Redundancy Groups

A services redundancy group (SRG) is a failover unit in a Multinode High Availability setup. There are two types of SRGs:

  • SRG0—Manages security service from Layer 4-Layer 7 except IPsec VPN services. The SRG0 operates in active mode on both nodes at any point in time. On SRG0, each security session must traverse the node in a symmetric flow, Backup of these flows are fully state-synchronized to the other node,
  • SRG1+—Manages IPsec services and virtual IPs for hybrid and default gateway mode and are backed up to the other node. The SRG1 operates in active mode on one node and in backup node on another node.

Figure 10 shows SRG0 and SRG1 in a Multinode High Availability setup.

Figure 10: Single SRG Support in Multinode High Availability (Active-Backup Mode) Single SRG Support in Multinode High Availability (Active-Backup Mode)

Figure 11 shows SRG0 and SRG1+ in a Multinode High Availability setup.

Figure 11: Multi SRG Support in Multinode High Availability (Active-Active Mode) Multi SRG Support in Multinode High Availability (Active-Active Mode)

Starting in Junos OS Release 22.4R1, you can configure Multinode High Availability to operate in active-active mode with support of multi SRG1s (SRG1+). In this mode, some SRGs remain active on one node and some SRGs remain active on another node. A particular SRG always operates in active-backup mode; it operates in active mode on one node and backup mode on another node. In this case, both the nodes can have the active SRG1 forwarding stateful services. Each node has a different set of floating IP addresses assigned to SRG1+.

Note:

Starting in Junos OS Release 22.4R1, you can configure upto 20 SRGs in a Multinode Highavailability setup.

Table 3 explains the behavior of SRGs in a Multinode High Availability setup.

Table 3: Services Redundancy Group Details in Multinode High Availability
Related Services Redundancy Group (SRG) Managed Services Operates in Synchronization Type When Active Node Fails Configuration Options
SRG0 Manages security service L4-L7 except IPsec VPN. Active/active mode Stateful synchronization of security services Traffic processed on the failed node will transition to the healthy node in a stateful manner.
  • Shutdown on failure option
  • Install on failure route options
SRG1+ Manages IPsec and virtual-IP addresses with associated security services Active/backup mode Stateful synchronization of security services Traffic processed on the failed node will transition to the healthy node in a stateful manner.
  • Active/backup signal route
  • Deployment type
  • Activeness priority and preemption
  • Virtual IP address (for default gateway deployments)
  • Activeness probing
  • Process packet on backup option
  • BFD monitoring

  • IP monitoring
  • Interface monitoring
Note:

When you configure monitoring (BFD or IP or Interface) options on SRG1+, we recommend not to configure the shutdown-on-failure option on SRG0.

Activeness Determination and Enforcement

In a Multinode High Availability setup, activeness is determined at the service level, not at the node level. The active/backup state is at the SRG level and the traffic is steered toward the active SRG. SRG0 remains active on both the nodes, whereas SRG1 can remain in active or in backup state in each node

If you prefer a certain node to take over as the active node on boot, you can do one of the followings:

  • Configure the upstream routers to include preferences for the path where the node is located.
  • Configure activeness priority.
  • Allow the node with higher node ID (in case the above two options not configured) to take the active role.

In a Multinode High Availability setup, both the SRX Series Firewalls initially advertise the route for the floating IP address to the upstream routers. There isn’t a specific preference between the two paths advertised by SRX Series Firewalls. However, the router can have its own preferences on one of the paths depending on the configured metrics.

Figure 12 represents the sequence of events for activeness determination and activeness enforcement.

Figure 12: Activeness Determination and Enforcement Activeness Determination and Enforcement
  1. On boot, devices enter the hold state and start probing continuously. The devices use the floating IP address (activeness-probing source IP address) as the source IP address and IP addresses of the upstream routers as the destination IP address for the activeness determination probe.
  2. The router hosting the probe destination IP address replies to the SRX Series Firewall that is available on its preferred routing path. In the following example, SRX-1 gets the response from the upstream router.

    Figure 13: Activeness Determination and Enforcement Activeness Determination and Enforcement
  3. SRX-1 promotes itself to the active role since it got the probe reply. SRX-1 communicates its role change to the other device and takes up the active role.

  4. After the activeness is determined, the active node (SRX-1):

    • Hosts the floating IP address assigned to it.
    • Advertises the high-preference path to adjacent BGP neighbors.
    • Continues to advertise the active (higher) preference path for all remote and local routes to draw the traffic.
    • Notifies the active node status to the other node through the ICL.
  5. The other device (SRX-2) stops probing and takes over the backup role. The backup node advertises the default (lower) priority, ensuring that the upstream routers do not forward any packets to the backup node.

The Multinode High Availability module adds active and backup signal routes for the SRG to the routing table when the node moves to the active role. In case of node failures, the ICL goes down and the current active node releases its active role and removes the active signal route. Now the backup node detects the condition through its probes and transitions to the active role. The route preference is swapped to drive all the traffic towards the new active node.

The switch in the route preference advertisement is part of routing policies configured on SRX Series Firewalls. You must configure the routing policy to include the active signal route with the if-route-exists condition.

For Default Gateway Deployments

If both the nodes are booting up at the same time, then the Multinode High Availability system uses the configured priority value of an SRG to determine activeness. Activeness enforcement takes place when the node with an active SRG1+ owns the virtual IP (VIP) address and the virtual MAC (VMAC) address. This action triggers Gratuitous ARP (GARP) toward the switches on both sides and results in updating the MAC tables on the switches.

For Hybrid Deployments

Activeness enforcement takes place on the Layer 3 side, when the configured signal route enforce activeness with the corresponding route advertisements. On the Layer 2 side, the SRX Series Firewall triggers a Gratuitous ARP (GARP) to the switch layer and owns the VIP and VMAC addresses

When the failover happens and the old backup node transitions to the active role, the route preference is swapped to drive all the traffic to the new active node.

Activeness Priority and Preemption

Configure the preemption priority (1-254) for SRG1+. You must configure the preemption value on both nodes. The preempt option ensures that the traffic always falls back to the specified node, when the node recovers from a failover.

You can configure activeness priority and preemption for an SRG1+ as in the following sample:

See Configuring Multinode High Availability In a Layer 3 Network for the complete configuration example.

As long as the nodes can communicate with each other through the ICL, the activeness priority is honored.

Configuring Activeness Probe Settings

Starting in Junos OS 22.4R1, default gateway (switching) and in hybrid deployments of Multinode High Availability, you can optionally configure activeness probe parameters using the following statements:

The probe interval sets the time period between the probes sent to the destination IP addresses. You can set the probe interval as 1000 milliseconds.

The multiplier value determines the time period, after which the backup node transitions to active state, if the backup node fails to receive response to the activeness-probes from the peer node.

The default is 2, and the minimum value is 2, and the maximum is 15.

Example: If you configure the multiplier value to two, the backup node will transition to the active state if it does not receive a response to activeness probing request from the peer node after two seconds.

You can configure multiplier and minimal-interval in switching and hybrid deployments.

In hybrid mode deployments, if you've configured the probe destination IP details for activeness determination (by using the activeness-probe dest-ip statement), then do not configure the multiplier and minimal-interval values. Configure these parameters when you are using VIP-based activeness probing.

Resiliency and Failover

The Multinode High Availability solution supports redundancy at the service level. Service-level redundancy minimizes the effort needed to synchronize the control plane across the nodes.

After the Multinode High Availability setup determines activeness, it negotiates subsequent high availability (HA) state through the ICL.​ The backup node sends ICMP probes using the floating IP address. If the ICL is up, the node gets the response to its probe and remains as the backup node. If the ICL is down, and there are no probe response, the backup node transitions into the active node.

The SRG1 of the previous backup node now transitions to the active state and continues to operate seamlessly. When the transition happens, the floating IP address is assigned to the active SRG1. In this way, the IP address floats between the active and backup nodes and remains reachable to all the connected hosts. Thus, traffic continues to flow without any disruption.

Services, such as IPsec VPN, that require both control plane and data plane states are synchronized across the nodes. Whenever an active node fails for this service function, both control plane and data plane fail over to the backup node at the same time.

The nodes use the following messages to synchronize data:

  • Routing Engine to Routing Engine control application messages
  • Routing Engine configuration-related messages
  • Data plane RTO messages

Interchassis Link (ICL) Encryption

In Multinode High Availability, the active and backup nodes communicate with each other using an interchassis link (ICL) connected over a routed network or connected directly. The ICL is a logical IP link and it is established using IP addresses that are routable in the network.

Nodes use the ICL to synchronize control plane and data plane states between them. ICL communication could go over a shared or untrusted network and packets sent over the ICL may traverse a path that is not always trusted. Therefore, you must secure the packets traversing the ICL by encrypting the traffic using IPsec standards.

IPsec protects traffic by establishing an encryption tunnel for the ICL. When you apply HA link encryption, the HA traffic flows between the nodes only through the secure, encrypted tunnel. Without HA link encryption, communication between the nodes may not be secure.

To encrypt the HA link for the ICL:

  • Install the Junos IKE package on your SRX Series Firewall by using the following command:

    request system software add optional://junos-ike.tgz .

  • Configure a VPN profile for the HA traffic and apply the profile for both the nodes. The IPsec tunnel negotiated between the SRX Series Firewalls uses the IKEv2 protocol.
  • Ensure you have included the statement ha-link-encryption in your IPsec VPN configuration. Example: user@host# set security ipsec vpn vpn-name ha-link-encryption.

We recommend following for setting up an ICL:

  • Use ports and network which is less likely to be saturated
  • Not to use the dedicated HA ports (control and fabric ports, if available on your SRX Series Firewall)
  • Bind the ICL to the loopback interface (lo0) or an aggregated Ethernet interface (ae0) and have more than one physical link (LAG/LACP) that ensure path diversity for highest resiliency.

  • You can use a revenue Ethernet port on the SRX Series Firewalls to setup an ICL connection. Ensure that you separate the transit traffic in revenue interfaces from the high availability (HA) traffic.

See Configuring Multinode High Availability for more details.

PKI-Based Link Encryption for ICL

Starting in Junos OS Release 22.3R1, we support PKI-based link encryption for interchassis link (ICL) in Multinode High Availability. As a part of this support, you can now generate and store node-specific PKI objects such as local keypairs, local certificates, and certificate-signing requests on both nodes. The objects are specific to local nodes and are stored in the specific locations on both nodes.

The node local objects enable you to distinguish between PKI objects that are used for ICL encryption and PKI objects used for IPsec VPN tunnel created between two endpoints.

You can use the following commands run on local node to work with node-specific PKI objects.

On your security device in Multinode High Availability, if you've configured the automatic re-enrollment option and if the ICL goes down at the time of re-enrollment trigger, both the devices start enrolling the same certificate separately with the CA server and download the same CRL file. Once Multinode High Availability re-establishes the ICL, the setup uses only one local certificate. You must synchronize the certificates from the active node to backup node using the user@host> request security pki sync-from-peer command on the backup node.

If you don't synchronize the certificates, the certificate mismatch issue between peer nodes persists till the next re-enrollment.

Optionally you can enable TPM (Trusted Platform module) on both nodes before generating any keypairs on the nodes. See Using Trusted Platform Module to Bind Secrets on SRX Series devices.

Split-Brain Detection and Prevention

Split-brain detection or activeness conflict happens when the ICL between two Multinode High Availability nodes is down and both nodes cannot reach each other to gather the status of peer node anymore.

Consider a scenario where two SRX Series Firewalls are part of Multinode High Availability setup. Lets consider SRX-1 as a local node and SRX-2 a remote node. The local node is currently in active role and the upstream router has higher priority path for the local node.

Case 1: Active Node is Up

  • The upstream router, that hosts the probe destination IP address, receives the ICMP probes from both nodes.

  • Upstream router replies to only to the active node; because it's configuration has the higher preference path for the active node

  • The active node retains the active role.

If Active Node is Down

When the ICL between the nodes goes down, both nodes initiate an activeness determination probe (ICMP probe). The nodes use the floating IP address (activeness determination IP address) as source IP address and IP addresses of the upstream routers as destination IP address for the probes.

  • The remote node restarts the activeness determination probes.
  • The router hosting the probe destination IP address has lost its higher preference path (of former active node) and replies to the remote node.
  • The probe result is a success for the remote node and the remote node transitions to the active state.

As demonstrated in the above cases, activeness determination probes and the configuration of higher path preference in the upstream router ensures one node always stays in the active role and prevents split-brain taking place.

You must also ensure that ICMP packets can be reached and allowed all the way to the router hosting the probe destination IP address.

See Configuring Multinode High Availability In a Layer 3 for details.

Default Gateway Deployments

In a default gateway deployments, when the ICL connection is down, SRX Series Firewalls are not able to communicate with each other. In such case, there is a possibility of both devices could claim active role. To prevent this, Multinode High Availability probes using ICMP-based ping to the virtual IP from the backup node. Following two scenarios are possible:

  • ICL Down and Active Node Up

    The active node owns the virtual IP address and hence when active node pings to the virtual IP using ICMP probes, the probe succeeds. The backup node remains in backup state.

  • ICL and Active Node Down

    The backup node pings the virtual IP using ICMP probes. Because the active node is down, it does not host VIP, and does not respond to the IP-based probes. So after a specified number of failures, the backup node transitions to active state.

    See Configuring Multinode High Availability In a Default Gateway Deployment for details.

Hybrid Deployments

You can use Layer 3 side or Layer 2 side for the split brain prevention, You can use only one method at the same time. If you use the Layer 2 side, then Multinode High Availability uses the VIP probing method described in “default gateway mode deployment”. If you use Layer 3 side, then the activeness determination probe (ICMP probe) method described in “Route Mode" is used.

See Configuring Multinode High Availability In a Default Gateway Deployment or Configuring Multinode High Availability In a Layer 3.

In spite of the split brain prevention mechanism, theoretically the nodes can still get in to a active-active state. This happens when the ICL is down and there are other network issues on the probe router at the same time. Because of this, the probe router replies to probe requests from both the nodes. In this case, once the situation improves and the ICL is up, one of the nodes takes up the active role based on your activeness-priority configuration. In case the activeness-priority configuration is not available, the node with lower local ID takes up the backup role.

Multinode High Availability Monitoring

A high availability failure detection monitors both system, software, and hardware for internal failures. The system can also monitor network connectivity problems or link connectivity using interface monitoring, BFD path monitoring and IP monitoring to detect reachability of targets further away.

Table 4 provides details on different monitoring types used in Multinode High Availability.

Table 4: Multinode High Availability Monitoring Types
Montitoring Type What is Does Detection Type Scope
BFD Monitoring Monitors reachability to the next hop by examining the link layer along with the actual link,
  • Path failures
  • Link failures
  • Detects failure within its routing connectivity
  • Not intended to detect failures beyond direct connections/next-hops.
IP monitoring

Monitors the connectivity to hosts or services located beyond directly connected interfaces or next-hops.

  • Path failures
  • Link failures
  • Detects failure occurring at more distant hosts or services.
  • Not intended for detecting failures occurring in directly connected links or next-hop failures.
Interface monitoring

Examines whether the link layer is operational or not,

Link failures
  • Detects failure in directly connected links or next-hops, and connectivity to hosts or services located farther away.
  • Not intended for monitoring path

In Multinode High Availability, when monitoring detects a connectivity failure to a host or service, it marks the affected path as down/unavailable, and marks the corresponding Service Route Groups (SRGs) at the impacted node as Ineligible. The affected SRGs will transition in a stateful manner to the other node without causing any disruption to traffic.

To prevent any traffic from being lost, Multinode High Availability takes following precautions:

  • Layer 3 mode—Routes will be redrawn so that the traffic is redirected correctly
  • Default gateway or hybrid mode—The new active node for the SRG sends a GARP (Gratuitous ARP) to the connected switch to ensure the re-routing of traffic

Multinode High Availability Failure Scenarios

The following sections describe possible failure scenarios: how a failure is detected, what recovery action to take, and if applicable, the impact on the system caused by the failure.

Node Failure

Hardware Failure

  • Cause—A failed hardware component or an environmental issue such as a power failure.
  • Detection— In Multinode High Availability
    • Affected device/node not accessible
    • SRG1 status changes to INELIGIBLE on the node with hardware failure.
  • Impact —Traffic will failover to the other node (if healthy) as shown in Figure 14. .
    Figure 14: Hardware Failure in Multinode High Availability Hardware Failure in Multinode High Availability
  • Recovery—Recovery of failure takes place when you clear chassis hardware failure (ex: replace or repair the failed hardware component.
  • Results—Check status using the following commands:

System/Software Failure

  • Cause—A failure in software process or service or issues with operating system.
  • Detection— In Multinode High Availability
    • Affected device/node not accessible
    • Changes system state to INELIGIBLE on the affected node with system/software failure.
  • Impact —Traffic will failover to the other node if healthy as shown in Figure 15
    Figure 15: Software Failure in Multinode High Availability Software Failure in Multinode High Availability
  • Recovery—Automatically and gracefully recovers from the outage once the issue is addressed. The backup node that has taken the active role, continues to remain active. The formerly active node remains as the backup node.
  • Results—Check status using the show chassis high-availability information detail command.

Network/Connectivity Failure

Physical Interfaces (Link) Failure

  • Cause—A failure in interfaces could be due to network equipment outages, or disruption with physical cable or inconsistent configurations.
  • Detection— In Multinode High Availability
    • Affected device/node is not accessible.
    • SRG1 status changes to INELIGIBLE on the affected node with network or connectivity failure (if the interface-monitor is configured). Path connectivity could also be detected with BFD or IP-monitoring and trigger an event based on configured action.
  • Impact—A change in the link state of the interfaces triggers a failover. The backup node takes up the active role, and services that were running on the failed node are migrated to other node as shown in Figure 16.
    Figure 16: Interface Failure Interface Failure
  • Configuration—To configure BFD monitoring and interface monitoring, use the following configuration statement:

    All links critical to traffic flow should be monitored.

    Checkout Configuring Multinode High Availability In a Layer 3 Network or Configuring Multinode High Availability In a Default Gateway Deployment for complete configuration details.

  • Recovery—Recovers when you repair/replace the failed interface. After the network/connectivity failure recovers, SRG1 moves from the INELIGIBLE state to the BACKUP state. The new-active node continues advertise better metrics to its upstream router and processes traffic.
  • Results—Check status using the following commands:
  • For information on configuring interfaces, see Configuring Multinode High Availability In a Layer 3 Network, Configuring Multinode High Availability In a Hybrid Deployment, Configuring Multinode High Availability In a Default Gateway Deployment, Troubleshooting Interfaces.

Interchassis Link (ICL) Failure

  • Cause—A failure in ICL could be due to network outages, or inconsistent configurations.
  • Detection— In Multinode High Availability, nodes cannot reach each other and they initiate a activeness determination probe (ICMP probe).
  • Impact— In a Multinode High Availability system, ICL connects active and backup nodes; if the ICL goes down, both devices will notice this change and start the activeness probe (ICMP probe). Activeness probe is done to determine the node that can take active role for each SRG1+. Based on the probe result, one of the node transitions to the active state.

    As shown in Figure 17, the ICL between SRX-1 and SRX-2 goes down. Both devices cannot reach each other and start sending activeness probes to the upstream router. Since SRX-1 is on higher preferred path in the router configuration, it takes up active role and continues to process traffic and advertises higher preference path. The other takes up backup role.

    Figure 17: ICL Failure in Multinode High Availability ICL Failure in Multinode High Availability
  • Configuration—To configure the activeness probing, use the following configuration statement:

    Checkout Configuring Multinode High Availability In a Layer 3 Network for complete configuration details.

  • Results—Check status using the following commands:
  • Recovery—Once one of the nodes assumes active role, Multinode High Availability restarts cold synchronization process and resynchronizes control-plane services (IPSec VPN). SRG state information is re-exchanged between the nodes.

Node Remains in Isolated State

  • Cause—In a Multinode High Availability setup, the node remains in isolated state after a reboot and associated interfaces continue to remain down when:
    • Inter chassis link (ICL) has no connectivity to the other node after booting up until the cold-sync complete

      and

    • The shutdown-on-failure option is configured on SRG0

      Note:

      The above cause could also happen if the other device is out of service.

  • Detection—SRG0 status displayed as ISOLATED in command output.
  • Recovery—The node automatically recovers when the other node comes online and the ICL can exchange system information or when you remove the shutdown-on-failure statement and commit the configuration.

    Use the delete chassis high-availability services-redundancy-group 0 shutdown-on-failure to remove the statement.

    If the above solution is not suitable for your environment, you can use the install-on-failure-route option. In this option, the Multinode High Availability setup uses a defined signal route for more graceful handling of the above situation using routing policy options, which is similar to active-signal-route and backup-signal-route approach available in SRG1+.