Instructions for M-Series Metadata RAID Disk Failures

The normal course of events during a failure will be :

Disk fails
RAS events are logged
GUI indicates Storage Array failure
Data reconstructs to hot spare
Engineer physically replaces the failed drive
Copyback moves the data from the hot spare to the replacement drive

The LSI CLI command used in the majority of cases is /opt/SMgr/client/SMcli which should be on root’s default path. Note that the commands are directed to the array of interest. The Valid Storage System Name For M-Series Appliance is “Qarray1”.

The Controller\Drive Tray is numbered “85”. Depending on the Appliance and optional Expansion (M660 only),

Each tray contains 6+1 (M330/M440 ) or 22+2 (M660 base ) disks numbered left to right 1->24 when viewed from the front.

Hot spares are the leftmost drive in each tray (Slot 1 & Slot 2) but protect against failures in any tray.

Each RAID set has volumes number <TRAY_NO_VOL_NO> i.e. TRAY_85_VOL1

The GUI is pretty good at tracking RAID disk failures and will show which disks have failed in which way and what volumes are affected. It can be used to determine which disk has failed and the monitor the reconstruction of the hot spare and the copyback onto the replaced disk.

Failures are reported in the GUI against the various components as “Warning”, “Attention”, “Missing”, “Degraded” or “Failure”.

Drives are reported as “Missing” or “Failed” depending on the nature of how they failed. “Missing” is used for drive slots that have got into the bypassed state.

“Degraded” is used for volumes that are compromised by the failure of a drive until they have had their data reconstructed onto the hot spare. (See “Monitoring Reconstruction” below).

Determining Hot Spare usage

The SanTricity GUI will show the hot spare drives within the HW View. If a hot spare is in use this should be showing a Warning or Failure with Volumes and Drives similarly showing Failure or Attention. For the M-Series Appliance MD Array,

The HotSpare(s) are assigned to Slot1 & Slot2 ( M660 base ).

In our Example, Drive at Tray85 Slot 2 failed, which is associated with Volume Group 0. The HotSpare in Slot 1 kicked in

and is now associated with Volume Group 0

Monitoring Reconstruction

Reconstruction is the action of building the data from the volumes affected by the failure of their disk onto the hot spare.

Volume_Reconstruct_InProgress

The amount of volumes affected by a single disk failure depends on the Volumes which are associated with the failed

Physical Drive. You can ‘right-click’ on the Volume or Volume Group and Select “View Associated Physical Components”

In our Example, we see that Drive Slot 1-3 are associated with our degraded Volume “TRAY_85_VOL1” having a failed Disk in Slot 2 and owns the HotSpare Drive in Slot1

Once the reconstruction is completed all affected volumes will show as “Optimal” and only the Drives will show as failed.

Using the CLI to show what volumes are currently degraded use the show allVolumes command

[root@M330_01nh ~]# SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "show allVolumes summary;"

STANDARD VOLUMES SUMMARY

Number of standard volumes: 9

Name Thin Provisioned Status Capacity Accessible by Source

TRAY_85_VOL_1 No Degraded 136.232 GB Default Group Volume Group 0

TRAY_85_VOL_2 No Optimal 34.000 GB Default Group Volume Group 1

More accurately you can see the progress of the reconstruction using the “actionProgress” command

[root@M330_01nh ~]# SMcli -n Qarray1 -p Qa@Ar39! -S -c "show volume [TRAY_85_VOL_1] actionProgress;"

Volume TRAY_85_VOL_1

Action: Reconstruction

Percent Complete: 2%

Time To Completion: 66 minutes.

Note: that Volumes waiting for or that have completed reconstruction will report “No action in progress.”

Copyback

Once reconstruction is complete and the failed disk has been replaced the copyback of the data from the hot spare to the replacement disk should start automatically. In the Santricity GUI the Drives status will show the replaced drive as “Replaced” and the volumes will show “Copyback Progress”.:

Replaced

In the CLI the “show allDrives” will show the drive as “Replaced” in the summary and in the details of the drive itself. The actionProgress command previously detailed can be used to monitor the progress of the copyback :

[root@M330_01nh ~]# SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "show drive [85,2] summary;"

TRAY, SLOT STATUS CAPACITY MEDIA TYPE INTERFACE TYPE CURRENT DATA RATE PRODUCT ID FIRMWARE VERSION CAPABILITIES

85, 2 Replaced 136.732 GB Hard Disk Drive SAS 6 Gbps ST9146803SS MS04

[root@M330_01nh ~]# SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "show volume [TRAY_85_VOL_1] actionProgress;"

Volume TRAY_85_VOL_1

Action: Copyback

Percent Complete: 28%

Copyback Failure

In instances where a disk is replaced after reconstruction has completed and a copyback fails to initiate, a new Storage Array collect log needs to be obtained and escalated to Sustaining for further investigation to determine why automatic copyback didn’t start. A timeline of the actions carried out and the observed status following each action should accompany the escalation.

If the drive swap appeared to go smoothly and the service and failure lights on the new drive are off then check the status of the drive in the CLI to ensure it is “optimal” or “replaced” and has the same characteristics as other drives in the volume. In particular ensure it is running the same firmware. EG :

[root@M330_01nh ~]# SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "show drive [85,2];"

Drive at Tray 85, Slot 2

Status: Optimal

Mode: Unassigned

Associated volume group: None

...

Speed: 10,000 RPM

Current data rate: 6 Gbps

Product ID: ST9146803SS

Drive Firmware Version: MS04

...

If the drive compares favourably with the other drives in the volume then manually initiate the copyback procedure with the following command :

SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "replace drive [<drive>] replacementDrive=<drive>;"

Where <drive> is the tray and slot of the replaced drive. EG for tray 85 slot 2 in Qarray1

SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "replace drive [85,2] replacementDrive=85,2;"

After a short while the GUI should show the expected copyback indicators shown in “Copyback” above.

Early Swap Copyback Failure

One instance where we expect copyback to fail to automatically start is when a disk is physically replaced before reconstruction to the hot spare has finished. This would normally only be the case if we are proactively swapping a drive unless we have been able to dispatch a disk and an Engineer within the reconstruct phase.

LSI’s reasoning for not initiating copyback in such circumstances is to allow the option to reassign the replacing hot spare as the permanent replacement for the failed drive. The new drive that then replaces the failed drive becomes the new hot spare.

In instances where this happens the procedure above for manually initiating copyback should be followed either remotely or by an onsite Engineer.

Wrong Drive replacement

In the event that an FE swaps the wrong physical drive the situation needs careful consideration in order to formulate the correct recovery procedure. Under no circumstance should the drive be re-swapped and revived or the good drive that was removed used as a replacement! Although it is technically possible to reset this drive and use it as a replacement this should only be done as a last resort and only after internal escalation.

In normal circumstances the removal of a good drive will trigger the usual drive failure procedures and the data will automatically start rebuilding to a hot spare. In these circumstances the erroneous swap should be treated as a new failure and the copyback to the replaced drive manually triggered after the reconstruction has completed using the procedures above. A fresh replacement drive should be ordered to replace the originally failed drive.

If the drive swapped by mistake was not part of the volume group associated with the original failure the 2 incidents can be resolved in parallel.

If the drive swapped by mistake was part of the same volume group then special care should be taken in recovering and internal escalation should be considered. Provided the original disk had completed reconstruction to the hot spare then the original disk can be replaced and copyback started. Replacement of the second “failure” should be addressed after the first has been successfully recovered from.

If the drive swapped by mistake was part of the same volume group and swapped before reconstruction had completed the filesystem maybe compromised. Immediate escalate to Sustaining and create a LSI escalation to avoid data loss/corruption.