Double-Bit Errors on SRP Modules

SRP modules include error checking and correction (ECC) to protect their SDRAM. ECC provides error detection of single-bit and double-bit errors and correction of single-bit errors for the SDRAM as follows:

Detecting Double-Bit Errors

The following message appears on the console when ECC detects a double-bit error:

ALERT 05/10/2004 13:10:33 os: failed: ECC DOUBLE BIT ERROR OCCURRED
  Address = 0xe95db10
  Data (Upper 32Bits) = 0xe95db20
  Data (Lower 32Bits) = 0x55d06c
  ECC Data Bits =  0x2b
  ECC 1Bit Error Counter =  0x0
ALERT 05/10/2004 13:10:34 os: PROCESSOR EXCEPTION: 0x200n

When ECC detects a double-bit error in a system that contains a redundant SRP module, the redundant module becomes active and the system continues to operate. However, you must still troubleshoot the SRP module with the double-bit error. When ECC detects a double-bit error in a system that does not contain a redundant SRP module, you must troubleshoot the SRP module immediately. See Fixing Double-Bit Errors.

The double-bit errors that ECC detects on the SDRAM of the SRP modules are categorized as errors with the major severity level. This severity classification for double-bit errors occurs because double-bit errors are also caused by uninitialized memory in the SDRAM. A double-bit error in the SDRAM causes a physical error in the transmit state machine because the error is transmitted to the egress transmit state machine.

Fixing Double-Bit Errors

To fix a double-bit error:

  1. Remove the second SRP module, if there is one.
  2. Reboot the system with the module reset button on the primary SRP module. (See Figure 5.)

These actions attempt to correct a transient double-bit error. However, if the console displays a memory test failure for the SRP module after you reboot, or if the FAIL LED on the SRP module stays on during rebooting, the SDRAM is permanently damaged and needs replacing. In this event, call the Juniper Networks Technical Assistance Center to arrange for repair.