Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

 
 

Multinode High Availability

SUMMARY Learn about the Multinode High Availability solution and how you can use it in simple and reliable deployment models. Currently, we support two nodes in any Multinode High Availability deployment.

Overview

Business continuity is an important requirement of the modern network. Downtime of even a few seconds might cause disruption and inconvenience apart from affecting the OpEx and CapEx. Modern networks also have data centers spread across multiple geographical areas. In such scenarios, achieving high availability can be very challenging.

Juniper Networks® SRX Series Firewalls support a new solution, Multinode High Availability, to address high availability requirements for modern data centers. In this solution, both the control plane and the data plane of the participating devices (nodes) are active at the same time. Thus, the solution provides interchassis resiliency.

The participating devices could be co-located or physically separated across geographical areas or other locations such as different rooms or buildings. Having nodes with high availability across geographical locations ensures resilient service. If a disaster affects one physical location, Multinode High Availability can fail over to a node in another physical location, thereby ensuring continuity.

Benefits of Multinode High Availability

  • Reduced CapEx and OpEx—Eliminates the need for a switched network surrounding the firewall complex and the need for a direct L2 connectivity between nodes

  • Network flexibility—Provides greater network flexibility by supporting high availability across Layer 3 (L3) and switched network segments.

  • Stateful resilient solution—Supports active control plane and data plane at the same time on both nodes.

  • Business continuity and disaster recovery—Maximizes availability, increasing redundancy within and across data centers and geographies.

  • Smooth upgrades—Supports different versions of Junos OS on two nodes to ensure smooth upgrades between the Junos OS releases.

We support two nodes in Multinode High Availability solution.

Active/Backup Multinode High Availability

We support active/backup Multinode High Availability on:

  • SRX5800, SRX5600, SRX5400 with SPC3, IOC3, SCB3, SCB4, and RE3 (in Junos OS Release 20.4R1)

  • SRX4600, SRX4200, SRX4100, and SRX1500 (in Junos OS Release 22.3R1)

  • vSRX3.0 virtual firewalls (in Junos OS Release 22.3R1) for the following private and public cloud platforms:

    • KVM (kernel-based virtual machine)
    • VMWare ESXi
    • Amazon Web Services (AWS)

Active/Active Multinode High Availability

Starting in Junos OS Release 22.4R1, you can operate Multinode High Availability in active-active mode with support of multiple services redundancy groups (SRGs).

Multi SRG support is available on SRX5400, SRX5600, and SRX5800 with SPC3, IOC3, SCB3, SCB4, and RE3.

Supported Features

SRX Series devices with Multinode High Availability support the firewall and advanced security services—such as application security, unified threat management (UTM), intrusion prevention system (IPS), firewall user authentication, NAT, ALG.

For the complete list of features supported with Multinode High Availability, see Feature Explorer.

Multinode High Availability does not support transparent mode high availability (HA)

Deployment Scenarios

Note:

We support a two-node configuration for the Multinode High Availability solution.

Multinode High Availability supports two SRX Series devices presenting themselves as independent nodes to the rest of the network. The nodes are connected to adjacent infrastructure belonging to different networks. These nodes can either be collocated or separated across geographies. Participating nodes back up each other to ensure a fast synchronized failover in case of system or hardware failure.

We support the following types of network deployment models for Multinode High Availability:

  • Layer 3 network mode ( fully routed environments where routers connected at both ends)
    Figure 1: Layer 3 Mode Layer 3 Mode
  • Default gateway mode ( Layer 2 switches connected at both ends). Common deployment of DMZ networks where the firewall devices act as the default gateway for the hosts and applications on the same segment
    Figure 2: Default Gateway Mode Default Gateway Mode
  • Hybrid network mode (mixed mode of routed networks on one side and locally connected networks on the other side)
    Figure 3: Hybrid Mode Hybrid Mode
  • Public cloud deployment (for example, AWS)
    Figure 4: Public Cloud Deployment Public Cloud Deployment

How Is Multinode High Availability Different from Chassis Cluster?

A chassis cluster operates in Layer 2 network environment and requires two links—the control link and the fabric link). These links connect both nodes over dedicated VLANs using back-to-back cabling or over dark (unlit) fiber connections. Control links and fabric links use dedicated physical ports on the SRX Series device.

Multinode High Availability uses an encrypted logical interchassis link (ICL). The ICL connects the nodes over a routed network instead of a dedicated Layer 2 network. You can use revenue ports on the SRX Series devices to setup ICL connection and configure a routing instance for the ICL path to ensure maximum segmentation.

Figure 5 and Figure 6 show two architectures.

Figure 5: Chassis Cluster Topology in a Layer 2 Network Chassis Cluster Topology in a Layer 2 Network
Figure 6: Multinode High Availability in a Layer 3 Network Multinode High Availability in a Layer 3 Network

Table 1 lists the differences between the two architectures

Table 1: Comparing Chassis Cluster and Multinode High Availability
Parameters Chassis Cluster Multinode High Availability
Network topology Nodes connect to a broadcast domain and move the IP address during failover by sending Gratuitous Address Resolution Protocol (GARP) messages to the switch. Nodes connect to a router, a broadcast domain, or a combination of both.
Network environment Layer 2 Layer 3, Layer 2, a combination of Layer 3 and Layer 2 (hybrid mode), and public cloud (AWS) deployments
Traffic switchover approach Switchover using Layer 2 GARP from an SRX Series device to a peer Layer 2 switch

Switchover using IP path selection by a peer Layer 3 router or Layer 2 GARP from an SRX Series device to a peer Layer 2 switch

Public cloud Not supported Supported
Dynamic routing function Active only on an SRX Series device in the primary RG0 state Active on each SRX Series device
Deployment Requires a dedicated L2 stretch between nodes to offer geo-redundancy Offers geo-redundancy without any switched/broadcast domain
Connection between SRX Series devices Control link and fabric link (cables) ICL (Layer 3-capable, IP-based link)
IP monitoring to detect network failure Supports IPv4 traffic Supports both IPv4 and IPv6 traffic

Multinode High Availability Glossary

Let's begin by getting familiar with Multinode High Availability terms used in this documentation.

Table 2: Multinode High Availability Glossary
Term Description
active/backup state Active/backup state is available when you have the SRG1 in active state on one node and in backup state on the other node.
active/active state Active/active state is available when you have multiple SRG1 (SRG1+) and one of the SRG1 is in active state on both the nodes.
active node First node that is active in the high availability deployment. The active node accepts connections and manages the session.
backup node Node that takes over as the new active node if the active node goes down due to any reason.
device priority Priority value determines whether a node can act as the active node in a Multinode High Availability setup. The node with a lower numerical value has a higher priority and, therefore, acts as the active node while the other node acts as the backup node.
device preemption Preemptive behavior allows the device with the higher priority (lower numerical value) to resume as active node after it recovers from a failure. If you need to use a specific device in Multinode High Availability as active node, then you must enable the preemptive behavior on both the devices and assign a device priority value for each device.
failover Event where the backup node in a high availability system takes over the task of the active node when the active node fails. In contrast, failure can occur when a physical link goes down or an ICMP probe fails.
floating IP address or activeness probing IP address An IP address that moves from an active node to the backup node during failover in a Multinode High Availability setup. This mechanism enables clients to communicate with the nodes using a single IP address. You configure the floating IP address on the interface that connects to participating networks or segments.

high availability/resiliency

Ability of a system to eliminate single points of failure to ensure continuous operations over an extended period of time.
interchassis link (ICL) IP-based link (logical link) that connects nodes over a routed network in a Multinode High Availability deployment. The security device uses the ICL to synchronize and maintain state information and to handle device failover scenarios.

You can use an ICL to connect the nodes directly. Alternatively, you can use a switch or a set of routers to connect nodes (for geo-redundant deployments). Because this is an IP-based link, you must be able to route local and peer IP addresses in the network.

link encryption Link encryption provides data privacy for messages traversing over the network. In Multinode High Availability, packets sent over an ICL may traverse a path on the public IP network. So, we secure ICL using IPsec VPN.
monitoring (BFD) Monitoring of one or more links using Bidirectional Forwarding Detection (BFD). BFD monitoring triggers a routing path change or a system failover, depending on system configuration.
monitoring (IP) Monitoring of a reliable IP address and system state in case of loss of communication with the peer node.
monitoring (path) Method that uses ICMP to verify the reachability of the IP address. The default interval for ICMP ping probes is 1 second.
monitoring (system) Monitoring of key hardware and software resources and infrastructures by triggering failover when a failure is detected on a node.
probing Mechanism used to exchange messages between active and backup nodes in the high availability setup. The messages determine the status and health of the application on each individual node.
real-time object (RTO) Special payload packet that contains the necessary information to synchronize the data from one node to the other node.
split-brain detection (also known as control plane detection or activeness conflict detection) Event where the ICL between two Multinode High Availability nodes is down, and both nodes initiate an activeness determination probe (split-brain probe). Based on the response to the probe, subsequent failover to a new role is triggered
services redundancy group (SRG) Failover unit that includes and manages a collection of objects on the participating nodes. The SRG on one node switches over to the other node when a failover is detected.
SRG0 Manages all control plane stateless services such as firewall, NAT, and ALG. SRG0 is active on all participating nodes and handles symmetric security flows.
SRG1 Manages control plane stateful service (IPsec VPN).
synchronization Process where controls and data plane states are synchronized across the nodes.
virtual IP (VIP) address (For hybrid and default gateway deployments). Virtual IP address used for activeness determination and enforcement on the switching side in a Multinode High Availability setup.
virtual MAC (VMAC) address (For hybrid and default gateway deployments). Virtual MAC address dynamically assigned to the interface on active node that faces the switching side.

Now we are that familiar with Multinode High Availability features and terminology, let's proceed to understand how Multinode High Availability works.

How Multinode High Availability Works

Note:

As of Junos OS Release 22.3R1, we support a two-node configuration for the Multinode High Availability solution.

In a Multinode High Availability setup, you connect two SRX Series devices to adjacent upstream and downstream routers (for Layer 3 deployments), routers and switches (hybrid deployment), or switches (default gateway deployment) using Gigabit Ethernet ports.

The nodes communicate with each other using an interchassis link (ICL) connected over a routed network or connected directly. You can establish the ICL by using:

  • Indirect connection through a Layer 3 network: Establish a logical IP link connecting both nodes using a loopback (lo0) interface or an aggregated Ethernet interface (ae0) (nodes located across different locations).

  • Direct connection: Connect two ports on each node directly using a crossover cable (node co-located).

Multinode High Availability operates in active/active mode for data plane and active/backup mode for control plane services. The active SRX Series device hosts the floating IP address and steers traffic towards it using the floating IP address.

During a failover, the floating IP address moves from the old active node to the new active node. This mechanism enables clients to communicate with the nodes using a single IP address. You configure the floating IP address on the interface that connects to participating networks or segments.

Figure 7, Figure 8, and Figure 9 show deployments in Layer 3, hybrid, and default gateway modes.

Figure 7: Layer 3 Deployment Layer 3 Deployment

In this topology, two SRX Series devices are part of a Multinode High Availability setup. The setup has Layer 3 connectivity between SRX Series devices and neighboring routers. The devices are running on separate physical Layer 3 networks and are operating as two independent nodes. The nodes shown in the illustration are co-located in the topology. The nodes can also be geographically separated.

Figure 8: Default Gateway Deployment Default Gateway Deployment

In a typical default gateway deployment, hosts and servers in a LAN are configured with a default route next-hop IP address to the security device. So the security device must host a virtual IP (VIP) address that moves between nodes based on the activeness. The configuration on hosts remains static, and security device failover is seamless from the hosts' perspective.

You must create the static route on SRX Series devices for the routers or hosts beyond the switches in both the directions.

Figure 9: Hybrid Deployment Hybrid Deployment

In hybrid mode, an SRX Series device uses a VIP address on the Layer 2 side to draw traffic toward it. On the Layer 3 side, routers can employ dynamic routing or use static route for hosts with next hop as the VIP address. You an also optionally configure the static ARP for the VIP using the VMAC to ensure no change in the IP address during the failover.

Let's now understand the components and functionality of Multinode High Availability in detail.

Services Redundancy Groups

A services redundancy group (SRG) is a failover unit in a Multinode High Availability setup. There are two types of SRGs:

  • SRG0 - Manages control plane stateless services. SRG0 remains in the active state on both active node and backup node.
  • SRG1 - Manages control plane stateful service (IPsec VPN). SRG1 remains active on one node and standby on another node.

Figure 10 shows SRG0 and SRG1 in a Multinode High Availability setup.

Figure 10: Single SRG Support in Multinode High Availability (Active-Backup Mode) Single SRG Support in Multinode High Availability (Active-Backup Mode)

Figure 11 shows SRG0 and SRG1+ in a Multinode High Availability setup.

Figure 11: Multi SRG Support in Multinode High Availability (Active-Active Mode) Multi SRG Support in Multinode High Availability (Active-Active Mode)

Starting in Junos OS Release 22.4R1, you can configure Multinode High Availability to operate in active-active mode with support of multi SRG1s (SRG1+). In this mode, some SRGs remain active on one node and some SRGs remain active on another node. A particular SRG always operates in active-backup mode; it operates in active mode on one node and backup mode on another node. In this case, both the nodes can have the active SRG1 forwarding stateful services. Each node has a different set of floating IP addresses assigned to SRG1+.

Note:

Starting in Junos OS Release 22.4R1, you can configure upto 20 SRGs in a Multinode High Availability setup.

Table 3 explains the behavior of SRGs in a Multinode High Availability setup.

Table 3: Services Redundancy Group Details in Multinode High Availability
Related Services Redundancy Group (SRG) Operates in Affected Services Synchronization Type When Active Node Fails
SRG0 Active/active data plane Manages control plane stateless services—services such as firewall, NAT, and ALG (services without control plane state).

Stateless services operate in active/active mode and forwards traffic on both nodes

Synchronization for only data plane states across the nodes. Data plane must fail over to the backup node.

Activeness is maintained per session. The backup session remains in standby (or warm) state and moves to active state when it receives the first packet. The new active node notifies the other node about the session move. Other node changes the session state from active to standby.

SRG1 Active/backup control and data plane Manages control plane stateful service (IPsec VPN). On the backup SRG, the real-time objects (RTOs) are synchronized. Synchronization for both control plane and data plane states across the nodes. Both control plane and data plane need to fail over to the backup node at the same time.

Activeness Determination and Enforcement

In a Multinode High Availability setup, activeness is determined at the service level, not at the node level. That is, the active/backup state is at the SRG level and the traffic is steered toward the active SRG. SRG0 remains active on both the nodes, whereas SRG1 can remain in active or in backup state in each node. The node where SRG1 is in active state is considered as active node.

If you prefer a certain node to take over as the active node on boot, you can do one of the followings:

  • Configure the upstream routers to include preferences for the path where the node is located.
  • Configure activeness priority.
  • Allow the node with higher node ID (in case the above two options not configured) to take the active role.

In a Multinode High Availability setup, both the SRX Series devices initially advertise the route for the floating IP address to the upstream routers. There isn’t a specific preference between the two paths advertised by SRX Series devices. However, the router can have its own preferences on one of the paths depending on the configured metrics.

Figure 12 represents the sequence of events for activeness determination and activeness enforcement.

Figure 12: Activeness Determination and Enforcement Activeness Determination and Enforcement
  1. On boot, devices enter the hold state and start probing continuously. The devices use the floating IP address (activeness-probing source IP address) as the source IP address and IP addresses of the upstream routers as the destination IP address for the activeness determination probe.
  2. The router hosting the probe destination IP address replies to the SRX Series device that is available on its preferred routing path. In the following example, SRX-1 gets the response from the upstream router.

    Figure 13: Activeness Determination and Enforcement Activeness Determination and Enforcement
  3. SRX-1 promotes itself to the active role since it got the probe reply. SRX-1 communicates its role change to the other device and takes up the active role.

  4. After the activeness is determined, the active node (SRX-1):

    • Hosts the floating IP address assigned to it.
    • Advertises the high-preference path to adjacent BGP neighbors.
    • Continues to advertise the active (higher) preference path for all remote and local routes to draw the traffic.
    • Notifies the active node status to the other node through the ICL.
  5. The other device (SRX-2) stops probing and takes over the backup role. The backup node advertises the default (lower) priority, ensuring that the upstream routers do not forward any packets to the backup node.

The Multinode High Availability module adds active and backup signal routes for the SRG to the routing table when the node moves to the active role. In case of node failures, the ICL goes down and the current active node releases its active role and removes the active signal route. Now the backup node detects the condition through its probes and transitions to the active role. The route preference is swapped to drive all the traffic towards the new active node.

The switch in the route preference advertisement is part of routing policies configured on SRX Series devices. You must configure the routing policy to include the active signal route with the if-route-exists condition.

For Default Gateway Deployments

If both the nodes are booting up at the same time, then the Multinode High Availability system uses the configured priority value of an SRG to determine activeness. Activeness enforcement takes place when the node with an active SRG owns the virtual IP (VIP) address and the virtual MAC (VMAC) address. This action triggers Gratuitous ARP (GARP) toward the switches on both sides and results in updating the MAC tables on the switches.

For Hybrid Deployments

Activeness enforcement takes place on Layer 3 (router side) and Layer 2 (switch side). On the Layer 2 side, the SRX Series device enforces activeness by owning the VIP and VMAC addresses while triggering GARP. On the Layer 3 side, you enforce activeness by configuring signal route and triggering corresponding route advertisements.

When the failover happens and the old backup node transitions to the active role, the route preference is swapped to drive all the traffic to the new active node.

Activeness Priority and Preemption

Configure the activeness priority (1-254) for SRG1 and enable the preemptive behavior on both the nodes. The preempt option ensures that the traffic always falls back to the specified node, when the node recovers from a failover.

You can configure activeness priority and preemption for an SRG1 as in the following sample:

See Configuring Multinode High Availability In a Layer 3 Network for the complete configuration example.

As long as the nodes can communicate with each other through the ICL, the priority is honored.

Configuring Activeness Probe Settings

Starting in Junos OS 22.4R1, default gateway (switching) and in hybrid deployments of Multinode High Availability, you can optionally configure activeness probe parameters using the following statements:

The probe interval sets the time period between the probes sent to the destination IP addresses. You can set the probe interval as 1000 milliseconds.

The multiplier value determines the time period, after which the backup node transitions to active state, if the backup node fails to receive response to the activeness-probes from the peer node.

The default is 2, and the minimum value is 2, and the maximum is 15.

Example: If you configure the multiplier value to two, the backup node will transition to the active state if it does not receive a response to activeness probing request from the peer node after two seconds.

You can configure multiplier and minimal-interval in switching and hybrid deployments.

In hybrid mode deployments, if you've configured the probe destination IP details for activeness determination (by using the activeness-probe dest-ip statement), then donot configure the multiplier and minimal-interval values. Configure these parameters when you are using VIP-based activeness probing.

Failover and Resiliency

The Multinode High Availabilty solution supports redundancy at the service level. Service-level redundancy minimizes the effort needed to synchronize the control plane state across the node.

After the Multinode High Availability setup determines activeness, it negotiates subsequent high availability (HA) state through the ICL.​ The backup node sends ICMP probes using the floating IP address. If the ICL is up, the node gets the response to its probe and remains as the backup node. If the ICL is down, and there are no probe response, the backup node transitions into the active node.

The SRG1 of the previous backup node now transitions to the active state and continues to operate seamlessly. When the transition happens, the floating IP address is assigned to the active SRG1. In this way, the IP address floats between the active and backup nodes and remains reachable to all the connected hosts. Thus, traffic continues to flow without any disruption.

Services, such as IPsec VPN, that require both control plane and data plane states are synchronized across the nodes. Whenever an active node fails for this service function, both control plane and data plane fail over to the backup node at the same time.

The nodes use the following messages to synchronize data:

  • Routing Engine to Routing Engine control application messages
  • Routing Engine configuration-related messages
  • Data plane RTO messages

Interchassis Link (ICL) Encryption

In Multinode High Availability, active and backup nodes are connected by an ICL using IP addresses that are routable in the network.

You can establish an ICL with a logical IP link connecting both nodes using a loopback (lo0) interface or an aggregated Ethernet interface (ae0) for the nodes located across different locations or connect two ports on each node directly using a crossover cable in case the nodes are in same location.

Nodes use the ICL to synchronize control plane and data plane states between them. Communication go over the network including upstream and downstream routers. Packets sent over the ICL may traverse a path that is not always trusted. Therefore, you must secure the packets traversing the ICL.

In Multinode High Availability, you must also separate the transit traffic in revenue interfaces from the high availability (HA) traffic.

For these reasons, you must encrypt the traffic traversing the ICL using IPsec standards. IPsec protects traffic by establishing an encryption tunnel for the ICL. When you apply HA link encryption, the HA traffic flows between the nodes only through the secure, encrypted tunnel. Without HA link encryption, communication between the nodes may not be secure.

To encrypt the HA link for the ICL:

  • Install the Junos IKE package on your SRX Series device by using the following command:

    request system software add optional://junos-ike.tgz .

  • Configure a VPN profile for the HA traffic and apply the profile for both the nodes. The IPsec tunnel negotiated between the SRX Series devices uses the IKEv2 protocol.
  • Ensure you have included the statement ha-link-encryption in your IPsec VPN configuration. Example: user@host# set security ipsec vpn vpn-name ha-link-encryption.

For ICL set up, we recommend

  • Use ports and network which is less likely to be saturated
  • Not to use the dedicated HA ports (control and fabric ports, if available on your SRX Series device)

See Configuring Multinode High Availability for more details.

PKI-Based Link Encryption for ICL

Starting in Junos OS Release 22.3R1, we support PKI-based link encryption for interchassis link (ICL) in Multinode High Availability. As a part of this support, you can now generate and store node-specific PKI objects such as local keypairs, local certificates, and certificate-signing requests on both nodes. The objects are specific to local nodes and are stored in the specific locations on both nodes.

The node local objects enable you to distinguish between PKI objects that are used for ICL encryption and PKI objects used for IPsec VPN tunnel created between two endpoints.

You can use the following commands run on local node to work with node-specific PKI objects.

On your security device in Multinode High Availability, if you've configured the automatic re-enrollment option and if the ICL goes down at the time of re-enrollment trigger, both the devices start enrolling the same certificate separately with the CA server and download the same CRL file. Once Multinode High Availability re-establishes the ICL, the setup uses only one local certificate. You must synchronize the certificates from the active node to backup node using the user@host> request security pki sync-from-peer command on the backup node.

If you don't synchronize the certificates, the certificate mismatch issue between peer nodes persists till the next re-enrollment.

Optionally you can enable TPM (Trusted Platform module) on both nodes before generating any keypairs on the nodes. See Using Trusted Platform Module to Bind Secrets on SRX Series devices.

Split-Brain Detection and Prevention

Split-brain detection or activeness conflict happens when the ICL between two Multinode High Availability nodes is down and both nodes cannot reach each other to gather the status of peer node anymore.

Consider a scenario where two SRX Series devices are part of Multinode High Availability setup. Lets consider SRX-1 as a local node and SRX-2 a remote node. The local node is currently in active role and the upstream router has higher priority path for the local node.

Case 1: Active Node is Up

  • The upstream router, that hosts the probe destination IP address, receives the ICMP probes from both nodes.

  • Upstream router replies to only to the active node; because it's configuration has the higher preference path for the active node

  • The active node retains the active role.

If Active Node is Down

When the ICL between the nodes goes down, both nodes initiate an activeness determination probe (ICMP probe). The nodes use the floating IP address (activeness determination IP address) as source IP address and IP addresses of the upstream routers as destination IP address for the probes.

  • The remote node restarts the activeness determination probes.
  • The router hosting the probe destination IP address has lost its higher preference path (of former active node) and replies to the remote node.
  • The probe result is a success for the remote node and the remote node transitions to the active state.

As demonstrated in the above cases, activeness determination probes and the configuration of higher path preference in the upstream router ensures one node always stays in the active role and prevents split-brain taking place.

You must also ensure that ICMP packets can be reached and allowed all the way to the router hosting the probe destination IP address.

See Configuring Multinode High Availability In a Layer 3 for details.

Default Gateway Deployments

In a default gateway deployments, when the ICL connection is down, SRX Series devices are not able to communicate with each other. In such case, there is a possibility of both devices could claim active role. To prevent this, Multinode High Availability probes using ICMP-based ping to the virtual IP from the backup node. Following two scenarios are possible:

  • ICL Down and Active Node Up

    The active node owns the virtual IP address and hence when active node pings to the virtual IP using ICMP probes, the probe succeeds. The backup node remains in backup state.

  • ICL and Active Node Down

    The backup node pings the virtual IP using ICMP probes. Because the active node is down, it does not host VIP, and does not respond to the IP-based probes. So after a specified number of failures, the backup node transitions to active state.

    See Configuring Multinode High Availability In a Default Gateway Deployment for details.

Hybrid Deployments

You can use Layer 3 side or Layer 2 side for split brain prevention. If you use the Layer 2 side, then Multinode High Availability uses the VIP probing mechanism. If you use Layer 3 side, then the activeness determination probe (ICMP probe) method is used. If you configure the activeness determination probes, then the Layer 3 side probing takes place. Or else, virtual IP with index 1 is used for probing at the Layer 2 side.

See Configuring Multinode High Availability In a Default Gateway Deployment or Configuring Multinode High Availability In a Layer 3.

In spite of the split brain prevention mechanism, theoretically the nodes can still get in to a active-active state. This happens when the ICL is down and there are other network issues on the probe router at the same time. Because of this, the probe router replies to probe requests from both the nodes. In this case, once the situation improves and the ICL is up, one of the nodes takes up the active role based on your activeness-priority configuration. In case the activeness-priority configuration is not available, the node with lower local ID takes up the backup role.

Multinode High Availability Monitoring

A high availability failure is a loss of connection between the nodes caused by device failure or network failure or loss in connectivity between the nodes. In a typical deployment of Multinode High Availability, the network includes multiple hops on north and southbound networks. There is a possibility of failures at any of the hops, or there could be hardware/software issues. Multinode High Availability system remains available even during failures with following monitoring capabilities:

  • BFD Monitoring

    The Bidirectional Forwarding Detection (BFD) protocol is a simple hello mechanism that detects failures in a network. The devices send hello packets at a specified, regular interval and detects a failure when the routing device stops responding after a specified interval. BFD monitoring ensures the reachability to the adjacent router and enables quick failover between a active and a backup node.

    You can configure Multinode High Availability to monitor one or more links using BFD. This configuration triggers a failover in the event of BFD failure. You can use this optional feature for additional reliability in Multinode high Availability. You can configure BFD liveliness by specifying source and destination IP and the interface connecting to the peer device.

  • IP Monitoring

    IP Monitor constantly pings a remote IP address, to check if the path works. The source IP address is the local interface IP address (not a floating IP). Purpose is to check the path and if it fails,

    IP monitoring is a technique that checks the reachability of an IP address or a set of IP addresses using ICMP ping messages. When using IP monitoring, both active node and backup node constantly ping a remote destination IP address at the same time. The source IP address is the local interface IP address (not a floating IP address).

    If both nodes can successfully ping the target, no failover occurs. But, if one node can ping the target but the other node cannot, then the SRG on that node goes to Ineligible state, which triggers a failover if it was active before.

  • Interface Monitoring

    In a Multinode High Availability setup, if an interface on any side goes down, it can cause an outage. For Layer 3 deployments, BFD monitoring addresses the problem. However, for default gateway and hybrid mode, you must configure interface monitoring option for each SRG.

    The node which detects the interface monitoring failure transitions to ineligible state for the corresponding SRG and the other node (if healthy) takes over the active role or that SRG and the subsequent GARP ensures traffic switching and recovery.

Multinode High Availability Failure Scenarios

The following sections describe possible failure scenarios: how a failure is detected, what recovery action to take, and if applicable, the impact on the system caused by the failure.

Node Failure

Hardware Failure

  • Cause—A failed hardware component or an environmental issue such as a power failure.
  • Detection— In Multinode High Availability
    • Affected device/node not accessible
    • SRG1 status changes to INELIGIBLE

    On SRX5000-line devices, Multinode High Availability automatically detects hardware failure based on chassis hardware failure detection results.

  • Impact —Hardware failure on the active node triggers a failover immediately. The backup node takes over the active role and continues to process traffic. As shown in Figure 14, When a hardware failure detected in SRX-1, the SRX-2 takes up active role after a failover. Traffic is diverted to SRX-2.
    Figure 14: Hardware Failure in Multinode High Availability Hardware Failure in Multinode High Availability
  • Recovery—Recovery of failure takes place when you clear chassis hardware failure (ex: replace or repair the failed hardware component.
  • Results—Check status using the following commands:

System/Software Failure

  • Cause—A failure in software process or service or issues with operating system.
  • Detection— In Multinode High Availability
    • Affected device/node not accessible
    • SRG1 status changes to INELIGIBLE on the affected node.
  • Impact —Software/system failure on the active node triggers a failover immediately. The backup nodes takes over the active role and continues to process traffic. Figure 15 shown software failure. When a software failure is detected in SRX-1, the SRX-1 is isolated and SRX-2 takes up the active role. After a failover, traffic and services are migrated to SRX-02.
    Figure 15: Software Failure in Multinode High Availability Software Failure in Multinode High Availability
  • Recovery—Automatically and gracefully recovers from the outage once the issue is addressed. The backup node that has taken the active role, continues to remain active. The formerly active node remains as the backup node.
  • Results—Check status using the show chassis high-availability information detail command.

Network/Connectivity Failure

Physical Interfaces (Link) Failure

  • Cause—A failure in interfaces could be due to network outages, or disruption with physical cable or inconsistent configurations.
  • Detection— In Multinode High Availability
    • Affected device/node is not accessible.
    • SRG1 status changes to INELIGIBLE on the affected node.

    In a Layer 3 deployments, BFD monitoring detects interface failure. For Layer 2 (default gateway) and hybrid deployments, interface monitoring detects interface failure.

  • Impact—A change in the link state of the interfaces triggers a failover. The backup node takes up the active role, and services that were running on the failed node are migrated to other node.

    As shown in Figure 16, interface connected to SRX-1 goes down. In this case, SRX-1 changes into BACKUP state and SRX-2 takes over as active role and starts processing traffic.

    Figure 16: Interface Failure Interface Failure
  • Configuration—To configure BFD monitoring and interface monitoring, use the following configuration statement:

    All links critical to traffic flow should be monitored.

    Checkout Configuring Multinode High Availability In a Layer 3 Network or Configuring Multinode High Availability In a Default Gateway Deployment for complete configuration details.

  • Recovery—Recovers when you repair/replace the failed interface. After the network/connectivity failure recovers, SRG1 moves from the INELIGIBLE state to the BACKUP state. The new-active node continues advertise better metrics to its upstream router and processes traffic.
  • Results—Check status using the following commands:
  • For information on configuring interfaces, see Configuring Multinode High Availability In a Layer 3 Network, Configuring Multinode High Availability In a Hybrid Deployment, Configuring Multinode High Availability In a Default Gateway Deployment, Troubleshooting Interfaces.

Interchassis Link (ICL) Failure

  • Cause—A failure in ICL could be due to network outages, or inconsistent configurations.
  • Detection— In Multinode High Availability, nodes cannot reach each other and they initiate a activeness determination probe (ICMP probe).
  • Impact— In a Multinode High Availability system, ICL connects active and backup nodes; if the ICL goes down, both devices will notice this change and start the activeness probe (ICMP probe). Activeness probe is done to determine the node that can take active role for each SRG1. Based on the probe result, one of the node transitions to the active state.

    As shown in Figure 17, the ICL between SRX-1 and SRX-2 goes down. Both devices cannot reach each other and start sending activeness probes to the upstream router. Since SRX-1 is on higher preferred path in the router configuration, it takes up active role and continues to process traffic and advertises higher preference path. The other takes up backup role.

    Figure 17: ICL Failure in Multinode High Availability ICL Failure in Multinode High Availability
  • Configuration—To configure the activeness probing, use the following configuration statement:

    Checkout Configuring Multinode High Availability In a Layer 3 Network for complete configuration details.

  • Results—Check status using the following commands:
  • Recovery—Once one of the nodes assumes active role, Multinode High Availability restarts cold synchronization process and resynchronizes control-plane services (IPSec VPN). SRG state information is re-exchanged between the nodes.

Unreachable Upstream/Downstream Routers

  • Cause—Link failure, unreachable upstream routers on the untrust side results in external path failure. Link failure and unreachable downstream routers on the trust side results in internal path failure.
  • Detection— In Multinode High Availability
    • Nodes cannot reach each other and they initiate a activeness determination probe (ICMP probe).
    • SRG1 changes to INELIGIBLE state on the current node.

    The BFD monitoring can detect path/link failures.

  • Impact—The backup node transitions to the active state. Routes are re-advertised to swap the preference, so that packets start taking path to the new active node.

    As shown in Figure 18, the upstream router on SRX-1 side becomes inactive. In this case, SRX-1 changes into ineligible state and SRX-2 takes over as active role and starts processing traffic.

    Figure 18: Unreachable Upstream/Downstream Routers Unreachable Upstream/Downstream Routers
  • Configuration—To configure BFD monitoring, use the following configuration statement:

    Checkout Configuring Multinode High Availability In a Layer 3 Network for complete configuration details.

  • Results—Check status using the following commands:
  • Recovery—Recovers when you enable/address the failed link or device. The backup node, that has taken the active role, continues to remain active. The formerly active node remains as the backup node. In case the backup node was down and never took over the active role for some reasons, the active node resumes the active role.

Node Remains in Isolated State

  • Cause—In a Multinode High Availability setup, the node remains in isolated state after a reboot and associated interfaces continue to remain down when:
    • Inter chassis link (ICL) has no connectivity to the other node after booting up until the cold-sync complete

      and

    • The shutdown-on-failure option is configured on SRG0

      Note:

      The above cause could also happen if the other device is out of service.

  • Detection—SRG0 status displayed as ISOLATED in command output.
  • Recovery—The node automatically recovers when the other node comes online and the ICL can exchange system information or when you remove the shutdown-on-failure statement and commit the configuration.

    Use the delete chassis high-availability services-redundancy-group 0 shutdown-on-failure to remove the statement.

    If the above solution is not suitable for your environment, you can use the install-on-failure-route option. In this option, the Multinode High Availability setup uses a defined signal route for more graceful handling of the above situation using routing policy options, which is similar to active-signal-route and backup-signal-route approach available in SRG1+.