Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

Navigation
Guide That Contains This Content
[+] Expand All
[-] Collapse All

    Understanding Chassis Cluster Control Link Failure and Recovery

    If the control link fails, Junos OS disables the secondary node to prevent the possibility of each node becoming primary for all redundancy groups, including redundancy group 0.

    A control link failure is described as not receiving heartbeats over the control link; however, heartbeats are still received over the fabric link.

    In the event of a legitimate control link failure, redundancy group 0 remains primary on the node on which it is currently primary, inactive redundancy groups x on the primary node become active, and the secondary node enters a disabled state.

    Note: When the secondary node is disabled, you can still log in to the management port and run diagnostics.

    To determine if a legitimate control link failure has occurred, the system relies on redundant liveliness signals sent across the control link and the data link.

    The system periodically transmits probes over the fabric data link and heartbeat signals over the control link. Probes and heartbeat signals share a common sequence number that maps them to a unique time event. The software identifies a legitimate control link failure if the following two conditions exist:

    • The threshold number of heartbeats were lost.
    • At least one probe with a sequence number corresponding to that of a missing heartbeat signal was received on the data link.

    When a legitimate control link failure occurs, the following conditions apply:

    • Redundancy group 0 remains primary on the node on which it is presently primary (and thus its Routing Engine remains active), and all redundancy groups x on the node become primary.

      If the system cannot determine which Routing Engine is primary, the node with the higher priority value for redundancy group 0 is primary and its Routing Engine is active. (You configure the priority for each node when you configure the redundancy-group statement for redundancy group 0.)

    • The system disables the secondary node.

      To recover a device from the disabled mode, you must reboot the device. When you reboot the disabled node, the node synchronizes its dynamic state with the primary node.

    Note: If you make any changes to the configuration while the secondary node is disabled, execute the commit command to synchronize the configuration after you reboot the node. If you did not make configuration changes, the configuration file remains synchronized with that of the primary node.

    You cannot enable preemption for redundancy group 0. If you want to change the primary node for redundancy group 0, you must do a manual failover.

    When you use dual control links (supported on the SRX1400 Services Gateways and SRX5000 and SRX3000 lines), note the following conditions:

    • Host inbound or outbound traffic can be impacted for up to 3 seconds during a control link failure. For example, consider a case where redundancy group 0 is primary on node 0 and there is a Telnet session to the Routing Engine through a network interface port on node 1. If the currently active control link fails, the Telnet session will lose packets for 3 seconds, until this failure is detected.
    • A control link failure that occurs while the commit process is running across two nodes might lead to commit failure. In this situation, run the commit command again after 3 seconds.

    Note: For SRX5000 and SRX3000 lines, dual control links require a second Routing Engine on each node of the chassis cluster.

    You can specify that control link recovery be done automatically by the system by setting the control-link-recovery statement. In this case, once the system determines that the control link is healthy, it issues an automatic reboot on the disabled node. When the disabled node reboots, the node joins the cluster again.

    Published: 2012-06-29