Help us improve your experience.

Let us know what you think.

Do you have time for a two-minute survey?

Troubleshooting HDD Module Failures in a RAID Volume

 

The two QFX3100 hard disk drive (HDD) modules are organized as a RAID 1, or mirrored configuration. If one HDD module fails or needs to be replaced, the remaining HDD module becomes the primary drive in the RAID. The QFX3100 continues to operate with only the primary drive, but the RAID does not operate optimally because there is no redundancy. Normally, the primary drive is only used in standalone mode until you replace the failed drive. You can replace the failed drive while the unit is running (hot-swap) and the RAID automatically resynchronizes the primary drive to the new drive.

The system operation is severely degraded while the drive automatically synchronizes the remaining drive to the replacement drive. Expect this synchronization process to take at least 17 hours. If you remove the primary drive before the newer drive becomes synchronized with the primary drive, you must re-create the RAID and reinstall the QFX3100 software image.

Caution

If the primary (remaining) HDD module fails or is removed while the RAID is synchronizing, the RAID contents are lost and cannot be recovered.

The independent failure of two drives is extremely rare. When one or both HDD modules appear to fail, likely causes are:

  • One HDD module is not able to autocorrect, causing a hard failure. A replacement HDD module can be hot-swapped with the failed module.

  • Both HDD modules have been removed from the QFX3100 and moved to another QFX3100. The RAID becomes inactive and is not recognized as storage in the new location. If the RAID is inactive, you can restore it using an internal controller utility to activate the disk.

  • One HDD module has been removed and the new module inserted. However, before the two drives are synchronized, the primary drive is removed. In this scenario, the system loses track of the synchronization and corrupts the RAID. Once it has been corrupted, you must bring down the Director device, isolate the Director device, and use an internal controller utility to delete, re-create, and activate both disk members.

Best Practice

Allow the RAID to fully synchronize before removing an HDD module.

Isolating a Director Device

Before restoring an inactive RAID or restoring a corrupted RAID, intentionally isolate the Director device.

  1. Gracefully bring down the failing Director device. See Powering Off a QFX3100 Director Device.
  2. Disconnect the cable in port 0 of the failing Director device, which connects to the control plane virtual chassis.
  3. Disable the interfaces on the EX4200 or E4300 that connect the failing Director device to both Interconnect devices using the set interface <ge-x/y/z> disable command.
    • On a QFX3000-M QFabric system, disable ge-0/0/40 and ge0/0/41.

      • Copper or fiber EX Series VC0 interfaces:

        • ge-0/0/20

        • ge-0/0/21

      • Copper or fiber EX Series VC1 interfaces:

        • ge-0/0/22

        • ge-0/0/23

    • On a QFX3000-G QFabric system using copper connections, disable port 40 for Director Group 1 failures or port 41 for Director Group 2 failures.

      • Copper EX Series VC0 interfaces:

        • ge-0/0/40

        • ge-1/0/40

        • ge-2/0/40

      • Copper EX Series VC1 interfaces:

        • ge-0/0/41

        • ge-1/0/41

        • ge-2/0/41

      • Fiber EX Series VC0 interfaces:

        • ge-0/0/22

        • ge-1/0/22

        • ge-2/0/22

      • Fiber EX Series VC1 interfaces:

        • ge-0/0/23

        • ge-1/0/23

        • ge-2/0/23

The Director device is now isolated from the rest of the QFabric system.

Hot-Swapping a Failed HDD Module

Problem:

Description: A single drive has failed and cannot autocorrect.

Solution

If the drive in a single HDD module fails, remove the failing HDD and insert a new one. System operation is degraded while the primary drive transfers data to the replacement drive. Synchronizing data while the system is operational can take up to 17 hours. When synchronization is complete, the RAID is considered optimal. If the system does not synchronize within 17 hours, or if the new HDD module is not recognized as storage, treat the RAID as corrupted.

Restoring an Inactive RAID

Problem:

Description: A valid RAID has become inactive when both HDD modules were removed from a QFX3100 and inserted into another QFX3100.

Solution

To restore an inactive RAID:

  1. Start the reboot of one of the HDD modules.
  2. Press Ctrl+c to interrupt the reboot sequence at the BIOS page. The following example shows a typical BIOS page.

    Pressing Ctrl+c starts the configuration utility, which after initialization, displays the Adapter List page.

  3. Use the arrow keys to select the default adapter, and press Enter to open the Adapter Properties page.
  4. Use the arrow keys to select RAID Properties, and press Enter to see the Array Type options.
  5. Use the arrow keys to select View Existing Array, and press Enter to see the existing RAID. The following example shows two inactive HDD modules:
  6. Review the RAID array information and ensure that the internal controller utility detects both modules.
  7. Use the arrow keys to select Manage Array, and press Enter. The Manage Array page appears, as shown in the following example:
  8. Select Activate Array, and press Enter to enable the system to use both HDD modules as a RAID. The utility returns you to the Adapter Properties page.
  9. Select RAID Properties, and press Enter to open the New Array Type options page.
  10. Select View Existing Array, and press Enter to see the status of the HDD modules.

    In the following example, the array is online, active, and synchronizing data. System operation is degraded while the RAID is synchronizing the data.

    Caution

    Do not reboot the HDD module until synchronization is complete. Otherwise, the synchronization process restarts from the beginning. Synchronization can take up to 17 hours to finish.

  11. Press the ESC key three times after synchronization is complete.
  12. Select Exit the Configuration Utility and Reboot using the arrow keys, and press Enter.
  13. After the drive is recognized and is synchronizing, you can monitor the progress of the synchronization using the show fabric administration inventory direct-group status command. The View Array page in the utility displays both Primary and Secondary in the Drive Status column when synchronization of both modules is complete.
  14. Gracefully bring down the Director device. See Powering Off a QFX3100 Director Device.
  15. Reconnect cables and enable the interfaces. See Reconnecting the Director Device to the Control Plane.

Restoring a Corrupted RAID

Problem:

Description: The RAID has become corrupted.

Solution

To recover a corrupted RAID using the internal controller utility:

  1. Start the reboot of one of the HDD modules.
  2. Press Ctrl+c to interrupt the reboot sequence at the BIOS page. The following example shows a typical BIOS page.

    Pressing Ctrl+c starts the configuration utility, which after initialization, displays the Adapter List page.

  3. Use the arrow keys to select the default adapte,r and press Enter to open the Adapter Properties page.
  4. Use the arrow keys to highlight RAID Properties, and press Enter to open the New Array Type Options page.
  5. Select Create IM Volume, and press Enter to delete the existing volume and create a new volume. The new volume information appears on the Create New Array page.

    The following example shows that both HDD modules are visible but not part of the RAID.

  6. Select the first No field in the RAID Disk column and press the Spacebar to change the entry to Yes, to select a disk. The utility gives you the option to overwrite all the data on the drive or to synchronize the data with that on the other drive.
  7. Select D to delete the corrupt data from the disk.
  8. When the Create New Array page appears again, select the second disk and press the Spacebar.
  9. Select C to create the array after both fields in the RAID Disk column are recognized as storage.
  10. Select Save changes then exit this menu, and press Enter to start creating the array.

    When the RAID has been created. the adapter Properties page appears again and shows the status as enabled.

  11. Press the ESC key twice.
  12. Select Exit the Configuration Utility and Reboot to complete the RAID recovery.
  13. Install the QFX3100 software image using either the QFabric USB install media or the QFabric Director Group recovery media.
  14. Gracefully bring down the Director device. See Powering Off a QFX3100 Director Device.
  15. Reconnect cables and enable the interfaces. See Reconnecting the Director Device to the Control Plane.

Reconnecting the Director Device to the Control Plane

After restoring an inactive Director device or a corrupted RAID, reconnect the cables and enable the interfaces:

  1. Reconnect the cable from port 0 of the Director device to control plane VC0 and port 6 to control plane VC1 .
  2. Enable the interfaces on the EX4200 or E4300 that connect the Director device to both control plane VCs using the set interface <ge-x/y/z> command.
    • On a QFX3000-M QFabric system, enable ge-0/0/40 and ge0/0/41.

      • Copper or fiber EX Series VC0 interfaces:

        • ge-0/0/20

        • ge-0/0/21

      • Copper or fiber EX Series VC1 interfaces:

        • ge-0/0/22

        • ge-0/0/23

    • On a QFX3000-G QFabric system using copper connections, enable port 40 for Director Group 1 or port 41 for Director Group 2.

      • Copper EX Series VC0 interfaces:

        • ge-0/0/40

        • ge-1/0/40

        • ge-2/0/40

      • Copper EX Series VC1 interfaces:

        • ge-0/0/41

        • ge-1/0/41

        • ge-2/0/41

      • Fiber EX Series VC0 interfaces:

        • ge-0/0/22

        • ge-1/0/22

        • ge-2/0/22

      • Fiber EX Series VC1 interfaces:

        • ge-0/0/23

        • ge-1/0/23

        • ge-2/0/23

  3. Power on the Director device, see Powering On a QFX3100 Director Device, and synchronization of the RAID begins. Synchronizing data while the system is operational can take up to 17 hours. When synchronization is complete, the RAID is considered optimal.