Troubleshooting HDD Module Failures in a RAID Volume
The two QFX3100 hard disk drive (HDD) modules are organized as a RAID 1, or mirrored configuration. If one HDD module fails or needs to be replaced, the remaining HDD module becomes the primary drive in the RAID. The QFX3100 continues to operate with only the primary drive, but the RAID does not operate optimally because there is no redundancy. Normally, the primary drive is only used in standalone mode until you replace the failed drive. You can replace the failed drive while the unit is running (hot-swap) and the RAID automatically resynchronizes the primary drive to the new drive.
The system operation is severely degraded while the drive automatically synchronizes the remaining drive to the replacement drive. Expect this synchronization process to take at least 17 hours. If you remove the primary drive before the newer drive becomes synchronized with the primary drive, you must re-create the RAID and reinstall the QFX3100 software image.
If the primary (remaining) HDD module fails or is removed while the RAID is synchronizing, the RAID contents are lost and cannot be recovered.
The independent failure of two drives is extremely rare. When one or both HDD modules appear to fail, likely causes are:
One HDD module is not able to autocorrect, causing a hard failure. A replacement HDD module can be hot-swapped with the failed module.
Both HDD modules have been removed from the QFX3100 and moved to another QFX3100. The RAID becomes inactive and is not recognized as storage in the new location. If the RAID is inactive, you can restore it using an internal controller utility to activate the disk.
One HDD module has been removed and the new module inserted. However, before the two drives are synchronized, the primary drive is removed. In this scenario, the system loses track of the synchronization and corrupts the RAID. Once it has been corrupted, you must bring down the Director device, isolate the Director device, and use an internal controller utility to delete, re-create, and activate both disk members.
Allow the RAID to fully synchronize before removing an HDD module.
Isolating a Director Device
Before restoring an inactive RAID or restoring a corrupted RAID, intentionally isolate the Director device.
Gracefully bring down the failing Director device. See Powering Off a QFX3100 Director Device.
Disconnect the cable in port 0 of the failing Director device, which connects to the control plane virtual chassis.
Disable the interfaces on the EX4200 or E4300 that connect the failing Director device to both Interconnect devices using the
set interface <ge-x/y/z> disable
command.On a QFX3000-M QFabric system, disable ge-0/0/40 and ge0/0/41.
Copper or fiber EX Series VC0 interfaces:
ge-0/0/20
ge-0/0/21
Copper or fiber EX Series VC1 interfaces:
ge-0/0/22
ge-0/0/23
On a QFX3000-G QFabric system using copper connections, disable port 40 for Director Group 1 failures or port 41 for Director Group 2 failures.
Copper EX Series VC0 interfaces:
ge-0/0/40
ge-1/0/40
ge-2/0/40
Copper EX Series VC1 interfaces:
ge-0/0/41
ge-1/0/41
ge-2/0/41
Fiber EX Series VC0 interfaces:
ge-0/0/22
ge-1/0/22
ge-2/0/22
Fiber EX Series VC1 interfaces:
ge-0/0/23
ge-1/0/23
ge-2/0/23
The Director device is now isolated from the rest of the QFabric system.
Hot-Swapping a Failed HDD Module
Problem
Description
A single drive has failed and cannot autocorrect.
Solution
If the drive in a single HDD module fails, remove the failing HDD and insert a new one. System operation is degraded while the primary drive transfers data to the replacement drive. Synchronizing data while the system is operational can take up to 17 hours. When synchronization is complete, the RAID is considered optimal. If the system does not synchronize within 17 hours, or if the new HDD module is not recognized as storage, treat the RAID as corrupted.
Restoring an Inactive RAID
Problem
Description
A valid RAID has become inactive when both HDD modules were removed from a QFX3100 and inserted into another QFX3100.
Solution
To restore an inactive RAID:
Start the reboot of one of the HDD modules.
Press Ctrl+c to interrupt the reboot sequence at the BIOS page. The following example shows a typical BIOS page.
LSI Corporation MPT SAS BIOS MPTBIOS-6.30.00.00 (2009.11.12) Copyright 2000-2009 LSI Corporation. Integrated RAID exception detected: Volume (00:130) is currently in state INACTIVE/OPTIMAL enter the LSI Corp Configuration Utility to investigate! Press Ctrl-C to start LSI Corp Configuration Utility...
Pressing Ctrl+c starts the configuration utility, which after initialization, displays the Adapter List page.
Use the arrow keys to select the default adapter, and press Enter to open the Adapter Properties page.
Use the arrow keys to select RAID Properties, and press Enter to see the Array Type options.
Use the arrow keys to select View Existing Array, and press Enter to see the existing RAID. The following example shows two inactive HDD modules:
* LSI Corp Config Utility v6.30.00.00 (2009.11.12) * * View Array -- SAS1068E * * Array 1 of 1 * * Identifier * * Type IM * * Scan Order --- * * Size(MB) 1907348 * * Status Inactive * * * * Manage Array * * * * Slot Device Identifier RAID Hot Drive Pred Size * * Num Disk Spr Status Fail (MB) * * 0 ATA WDC WD2003FYYS-01D01 Yes No Inactive No 1907348 ** * 1 ATA WDC WD2003FYYS-01D01 Yes No Inactive No 1907348 ** * * * * * * * * * * * ** * Esc = Exit Menu F1/Shift+1 = Help * * Enter=Select Item Alt+N=Next Array C=Create an array R=Refresh Display * *******************************************************************************
Review the RAID array information and ensure that the internal controller utility detects both modules.
Use the arrow keys to select Manage Array, and press Enter. The Manage Array page appears, as shown in the following example:
* LSI Corp Config Utility v6.30.00.00 (2009.11.12) * * Manage Array -- SAS1068E * * * * Identifier * * Type IM * * Scan Order --- * * Size(MB) 1907348 * * Status Inactive * * * * Manage Hot Spares * * * * Synchronize Array * * * * Activate Array * * * * Delete Array * * * * * * * * * * * * Esc = Exit Menu F1/Shift+1 = Help * * Enter = Select Item * *******************************************************************************
Select Activate Array, and press Enter to enable the system to use both HDD modules as a RAID. The utility returns you to the Adapter Properties page.
Select RAID Properties, and press Enter to open the New Array Type options page.
Select View Existing Array, and press Enter to see the status of the HDD modules.
In the following example, the array is online, active, and synchronizing data. System operation is degraded while the RAID is synchronizing the data.
* LSI Corp Config Utility v6.30.00.00 (2009.11.12) * * View Array -- SAS1068E * * Array 1 of 1 * * Identifier LSILOGICLogical Volume 3000 * * Type IM * * Scan Order 12 * * Size(MB) 1907348 * * Status 0% Syncd * * * * Manage Array * * * * Slot Device Identifier RAID Hot Drive Pred Size * * Num Disk Spr Status Fail (MB) * * 0 ATA WDC WD2003FYYS-01D01 Yes No Not Syncd No 1907348 * * 1 ATA WDC WD2003FYYS-01D01 Yes No Primary No 1907348 * * * * * * * * * * * * * * Esc = Exit Menu F1/Shift+1 = Help * * Enter=Select Item Alt+N=Next Array C=Create an array R=Refresh Display * *******************************************************************************
CAUTION:Do not reboot the HDD module until synchronization is complete. Otherwise, the synchronization process restarts from the beginning. Synchronization can take up to 17 hours to finish.
Press the ESC key three times after synchronization is complete.
Select Exit the Configuration Utility and Reboot using the arrow keys, and press Enter.
After the drive is recognized and is synchronizing, you can monitor the progress of the synchronization using the
show fabric administration inventory direct-group status
command. The View Array page in the utility displays both Primary and Secondary in the Drive Status column when synchronization of both modules is complete.Gracefully bring down the Director device. See Powering Off a QFX3100 Director Device.
Reconnect cables and enable the interfaces. See Reconnecting the Director Device to the Control Plane.
Restoring a Corrupted RAID
Problem
Description
The RAID has become corrupted.
Solution
To recover a corrupted RAID using the internal controller utility:
Start the reboot of one of the HDD modules.
Press Ctrl+c to interrupt the reboot sequence at the BIOS page. The following example shows a typical BIOS page.
LSI Corporation MPT SAS BIOS MPTBIOS-6.30.00.00 (2009.11.12) Copyright 2000-2009 LSI Corporation. Integrated RAID exception detected: Volume (00:130) is currently in state INACTIVE/OPTIMAL enter the LSI Corp Configuration Utility to investigate! Press Ctrl-C to start LSI Corp Configuration Utility...
Pressing Ctrl+c starts the configuration utility, which after initialization, displays the Adapter List page.
Use the arrow keys to select the default adapte,r and press Enter to open the Adapter Properties page.
Use the arrow keys to highlight RAID Properties, and press Enter to open the New Array Type Options page.
Select Create IM Volume, and press Enter to delete the existing volume and create a new volume. The new volume information appears on the Create New Array page.
The following example shows that both HDD modules are visible but not part of the RAID.
LSI Corp Config Utility v6.30.00.00 (2009.11.12) * * Create New Array -- SAS1068E * * Array Type: IM * * Array Size(MB): --------- * * * * Slot Device Identifier RAID Hot Drive Pred Size * * Num Disk Spr Status Fail (MB) * * 0 ATA WDC WD2003FYYS-01D01 [No] [No] --------- --- 1907729 ** * 1 ATA WDC WD2003FYYS-01D01 [No] [No] --------- --- 1907729 ** * * * * * * * * * * * * * * * * * * * * * * * ** * Esc = Exit Menu F1/Shift+1 = Help * * Space/+/- = Select disk for array or hot spare C = Create array * *******************************************************************************
Select the first No field in the RAID Disk column and press the Spacebar to change the entry to Yes, to select a disk. The utility gives you the option to overwrite all the data on the drive or to synchronize the data with that on the other drive.
Select D to delete the corrupt data from the disk.
When the Create New Array page appears again, select the second disk and press the Spacebar.
Select C to create the array after both fields in the RAID Disk column are recognized as storage.
Select Save changes then exit this menu, and press Enter to start creating the array.
When the RAID has been created. the adapter Properties page appears again and shows the status as enabled.
Press the ESC key twice.
Select Exit the Configuration Utility and Reboot to complete the RAID recovery.
Install the QFX3100 software image using either the QFabric USB install media or the QFabric Director Group recovery media.
Gracefully bring down the Director device. See Powering Off a QFX3100 Director Device.
Reconnect cables and enable the interfaces. See Reconnecting the Director Device to the Control Plane.
Reconnecting the Director Device to the Control Plane
After restoring an inactive Director device or a corrupted RAID, reconnect the cables and enable the interfaces: