Troubleshoot a Chassis Cluster with One Node in the Primary State and the Other Node in the Lost State
Problem
Description
The nodes of the chassis cluster are in primary and lost states.
Environment
Chassis cluster
Symptoms
One node of the cluster is in the primary
state and the other node is in the lost state. Run the show chassis
cluster status command on each node to view the status of the
node. Here is a sample output:
{primary:node0}
root@primary-srx> show chassis cluster status
Cluster ID: 1
Node Priority Status Preempt Manual failover
Redundancy group: 0 , Failover count: 1
node0 100 primary no no
node1 0 lost no no
Redundancy group: 1 , Failover count: 1
node0 100 primary no no
node1 0 lost no noDiagnosis
Is the node that is in the lost state powered on?
Yes: Are you able to access the node that is in the lost state through a console port? Do not use Telnet or SSH to access the node.
If you are able to access the node, proceed to Step 3.
If you are unable to access the node and if the device is at a remote location, access the node through a console for further troubleshooting. If you have console access, but do not see any output, it might indicate a hardware issue. Open a case with your technical support representative for further troubleshooting. See Data Collection for Customer Support.
No: Power on the node and proceed to Step 2.
-
After both nodes are powered on, run the
show chassis cluster statuscommand again. Is the node still in the lost state?-
Yes: Are you able to access the node that is in the lost state through a console port? Do not use Telnet or SSH to access the node.
-
If you are able to access the node, proceed to Step 3.
-
If you are unable to access the node and if the node is at a remote location, access the node through a console for further troubleshooting. If you have console access, but do not see any output, it might indicate a hardware issue. Open a case with your technical support representative for further troubleshooting. See Data Collection for Customer Support.
-
-
No: Powering on the device has resolved the issue.
-
-
Connect a console to the primary node, and run the
show chassis cluster statuscommand. Does the output show this node as primary and the other node as lost?-
Yes: This might indicate a split-brain scenario. Each node would show itself as primary and the other node as lost. Run the following commands to verify which node is processing the traffic:
-
show security monitoring -
show security flow session summary -
monitor interface traffic
Isolate the node that is not processing the traffic. You can isolate the node from the network by removing all the cables except the control and fabric links. Proceed to Step 4.
-
-
No: Proceed to Step 4.
-
-
Verify that all the FPCs are online on the node that is in the lost state by running the
show chassis fpc pic-statuscommand. Are all the FPCs online?-
Yes: Proceed to Step 5.
-
No: Open a case with your technical support representative for further troubleshooting. See Data Collection for Customer Support.
-
-
Are the nodes connected through a switch?
-
Yes: See Troubleshoot a Fabric Link Failure in a Chassis Cluster and Troubleshoot a Control Link Failure in a Chassis Cluster.
-
No: Proceed to Step 6.
-
-
Create a backup of the configuration from the node that is currently primary:
{primary:node0}root@primary-srx# show configuration | save /var/tmp/cfg-bkp.txtCopy the configuration to the node that is in the lost state, and load the configuration:
root@lost-srx# load override <terminal or filename>Note:If you use the
terminaloption, paste the complete configuration into the window. Make sure that you use Ctrl+D at the end of the configuration.If you use the
filenameoption, provide the path to the configuration file (for example: /var/tmp/Primary_saved.conf), and press Enter.When you connect to the node in the lost state through a console, you might see the state as either primary or hold/disabled. If the node is in the hold/disabled state, a fabric link failure might have occurred before the device went into the lost state. To troubleshoot this issue, follow the steps in Troubleshoot a Fabric Link Failure in a Chassis Cluster.
Commit the changes after the configuration is loaded. If the problem persists, then replace the existing control and fabric links on this device with new cables and reboot the node:
{primary:node1}[edit]root@lost-srx# request system rebootIs the issue resolved?
-
No: Open a case with your technical support representative for further troubleshooting. See Data Collection for Customer Support.
-