Instructions for M-Series Metadata RAID Disk Failures

 

Instructions for M-Series Metadata RAID Disk Failures

 

The normal course of events during a failure will be :

 

    • Disk fails
    • RAS events are logged
    • GUI indicates Storage Array failure
    • Data reconstructs to hot spare
    • Engineer physically replaces the failed drive
    • Copyback moves the data from the hot spare to the replacement drive

 

 

 

The LSI CLI command used in the majority of cases is /opt/SMgr/client/SMcli which should be on root’s default path. Note that the commands are directed to the array of interest. The Valid Storage System Name For M-Series Appliance  is “Qarray1”.

 

The Controller\Drive Tray is numbered “85”. Depending on the Appliance and optional Expansion (M660 only),

Each tray contains 6+1 (M330/M440 ) or 22+2 (M660 base ) disks numbered left to right 1->24 when viewed from the front.

 

Hot spares are the leftmost drive in each tray (Slot 1 & Slot 2) but protect against failures in any tray.

 

Each RAID set has volumes number <TRAY_NO_VOL_NO> i.e. TRAY_85_VOL1

 

 

 

 

The GUI is pretty good at tracking RAID disk failures and will show which disks have failed in which way and what volumes are affected. It can be used to determine which disk has failed and the monitor the reconstruction of the hot spare and the copyback onto the replaced disk.

 

Failures are reported in the GUI against the various components as “Warning”, “Attention”, “Missing”, “Degraded” or “Failure”.

 

Drives are reported as “Missing” or “Failed” depending on the nature of how they failed. “Missing” is used for drive slots that have got into the bypassed state.

 

 

“Degraded” is used for volumes that are compromised by the failure of a drive until they have had their data reconstructed onto the hot spare. (See “Monitoring Reconstruction” below).

 

 

 

 

 

Determining Hot Spare usage

 

The SanTricity GUI will show the hot spare drives within the HW View. If a hot spare is in use this should be showing a Warning or Failure with Volumes and Drives similarly showing Failure or Attention. For the  M-Series Appliance MD Array,

The HotSpare(s) are assigned to Slot1 & Slot2 ( M660 base ).

 

 

In our Example, Drive at Tray85 Slot 2 failed, which is associated with Volume Group 0. The HotSpare in Slot 1 kicked in

and is now associated with Volume Group 0

 

 

 

 

Monitoring Reconstruction

 

Reconstruction is the action of building the data from the volumes affected by the failure of their disk onto the hot spare.

 

Volume_Reconstruct_InProgress

 

The amount of volumes affected by a single disk failure depends on the Volumes which are associated with the failed

Physical Drive. You can ‘right-click’ on the Volume or Volume Group and Select “View Associated Physical Components”

 

 

 

In our Example, we see that Drive Slot 1-3 are associated with our degraded Volume “TRAY_85_VOL1” having a failed Disk in Slot 2 and owns the HotSpare Drive in Slot1

 

Once the reconstruction is completed all affected volumes will show as “Optimal” and only the Drives will show as failed.

 

Using the CLI to show what volumes are currently degraded use the show allVolumes command

 

[root@M330_01nh ~]# SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "show allVolumes summary;"

STANDARD VOLUMES SUMMARY

Number of standard volumes: 9

 

Name           Thin Provisioned     Status     Capacity     Accessible by     Source

TRAY_85_VOL_1  No                   Degraded   136.232 GB   Default Group     Volume Group 0

TRAY_85_VOL_2  No                   Optimal    34.000 GB    Default Group     Volume Group 1

..

 

More accurately you can see the progress of the reconstruction using the “actionProgress” command

 

[root@M330_01nh ~]# SMcli -n Qarray1 -p Qa@Ar39! -S -c "show volume [TRAY_85_VOL_1] actionProgress;"

Volume TRAY_85_VOL_1

Action: Reconstruction

Percent Complete: 2%

Time To Completion: 66 minutes.

 

 

Note: that Volumes waiting for or that have completed reconstruction will report “No action in progress.”

 

 

Copyback

 

Once reconstruction is complete and the failed disk has been replaced the copyback of the data from the hot spare to the replacement disk should start automatically. In the Santricity GUI the Drives status will show the replaced drive as “Replaced” and the volumes will show “Copyback Progress”.:

 

 

  Replaced

 

In the CLI the “show allDrives” will show the drive as “Replaced” in the summary and in the details of the drive itself. The actionProgress command previously detailed can be used to monitor the progress of the copyback :

 

[root@M330_01nh ~]# SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "show drive [85,2] summary;"

TRAY, SLOT  STATUS    CAPACITY    MEDIA TYPE       INTERFACE TYPE  CURRENT DATA RATE  PRODUCT ID        FIRMWARE VERSION  CAPABILITIES

85,    2    Replaced  136.732 GB  Hard Disk Drive  SAS             6 Gbps             ST9146803SS       MS04

 

 

[root@M330_01nh ~]SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "show volume [TRAY_85_VOL_1] actionProgress;"

  Volume TRAY_85_VOL_1

  Action: Copyback

  Percent Complete: 28%

 

 

 

Copyback Failure

 

In instances where a disk is replaced after reconstruction has completed and a copyback fails to initiate, a new Storage Array collect log needs to be obtained and escalated to Sustaining for further investigation to determine why automatic copyback didn’t start. A timeline of the actions carried out and the observed status following each action should accompany the escalation.

 

If the drive swap appeared to go smoothly and the service and failure lights on the new drive are off then check the status of the drive in the CLI to ensure it is “optimal” or “replaced” and has the same characteristics as other drives in the volume. In particular ensure it is running the same firmware. EG :

 

[root@M330_01nh ~]# SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "show drive [85,2];"

 

Drive at Tray 85, Slot 2

 

   Status:                   Optimal

 

   Mode:                     Unassigned

   Associated volume group:  None

...

   Speed:                          10,000 RPM

   Current data rate:              6 Gbps

   Product ID:                     ST9146803SS

   Drive Firmware Version:         MS04

...

 

 

If the drive compares favourably with the other drives in the volume then manually initiate the copyback procedure with the following command :

 

SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "replace drive [<drive>] replacementDrive=<drive>;"

 

Where <drive> is the tray and slot of the replaced drive. EG for tray 85 slot 2 in Qarray1

 

SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "replace drive [85,2] replacementDrive=85,2;"

 

After a short while the GUI should show the expected copyback indicators shown in “Copyback” above.

 

  

 

Early Swap Copyback Failure

 

One instance where we expect copyback to fail to automatically start is when a disk is physically replaced before reconstruction to the hot spare has finished. This would normally only be the case if we are proactively swapping a drive unless we have been able to dispatch a disk and an Engineer within the reconstruct phase.

 

LSI’s reasoning for not initiating copyback in such circumstances is to allow the option to reassign the replacing hot spare as the permanent replacement for the failed drive. The new drive that then replaces the failed drive becomes the new hot spare.

 

In instances where this happens the procedure above for manually initiating copyback should be followed either remotely or by an onsite Engineer.

 

 

 

 

Wrong Drive replacement

 

In the event that an FE swaps the wrong physical drive the situation needs careful consideration in order to formulate the correct recovery procedure. Under no circumstance should the drive be re-swapped and revived or the good drive that was removed used as a replacement! Although it is technically possible to reset this drive and use it as a replacement this should only be done as a last resort and only after internal escalation.

 

In normal circumstances the removal of a good drive will trigger the usual drive failure procedures and the data will automatically start rebuilding to a hot spare. In these circumstances the erroneous swap should be treated as a new failure and the copyback to the replaced drive manually triggered after the reconstruction has completed using the procedures above. A fresh replacement drive should be ordered to replace the originally failed drive.

 

If the drive swapped by mistake was not part of the volume group associated with the original failure the 2 incidents can be resolved in parallel.

 

If the drive swapped by mistake was part of the same volume group then special care should be taken in recovering and internal escalation should be considered. Provided the original disk had completed reconstruction to the hot spare then the original disk can be replaced and copyback started. Replacement of the second “failure” should be addressed after the first has been successfully recovered from.

 

If the drive swapped by mistake was part of the same volume group and swapped before reconstruction had completed  the filesystem maybe compromised. Immediate escalate to Sustaining and create a LSI escalation to avoid data loss/corruption.

 

 



This page was generated by the BrainKeeper Enterprise Wiki, © 2018