New command vdmRecoverAllRAIDVols to recover volumes

New command vdmRecoverAllRAIDVols to recover volumes to fix seemingly "failed" Volume:

From Lou Poletti and Robby Robertson.

See attachment

Two points I see here:

A) Who to believe???

I have been burned many times in my life of supporting LSI/NetApp disk arrays by “trusting” what the SANtricity GUI is reporting AND/OR what the ‘storage-array-profile.txt’ file is reporting.

BOTH of these can lie to you!!!!

To get the truth à you must look at the ‘state-capture-data.txt’ file and even here you have to perform a reality check on this file.

1) Make sure there are shell command outputs from both controllers of the disk array in this file.

2) Are the controllers synchronized?? The quickest way to perform this test is to see if both controllers are reporting the same state/status for all disk drives on the array.

a) Look at ‘vdmShowDriveList’ outputs for both controllers in the state-capture-data.txt file à if all drives report their state exactly the same on both controllers then you “can trust” the file as the truth.

b) If not, controllers are not synchronized and you must perform staggered controller reboots (assuming customer has failover implemented on all attached hosts) because there most likely is a “masked condition” on the disk array and you have to bring it to the surface (via controller reboots) before you proceed with working on the disk array.

B) There are two types of volume failures on these disk arrays.

1) Hardware failure à a failed disk drive (depended upon the RAID type, it is typically more than one disk drive in the same volume group) {look at ‘vdmShowDriveList’}

2) Piece failure à All disk drive members in that volume group are online/optimal yet the volume is failed. A piece member of the volume is in a failed state. (this is kind of a ‘logical’ failure if you will)

{look at ‘vdmShowVGInfo’ then ‘evfShowVol (SSID#)’ then vdmShowOosPieces}

For Hardware failure, the shell command would report a drive state as “Acc/GrA/F” or “Acc/GrA/NP” for the bad disk drive.

For Piece failure, this new shell command “vdmRecoverAllRAIDVols” is a new shell script that automates the way it was previously handled (pre 08.10.13.00 f/w) à vdmShowVGInfo, evfShowVol, vdmShowOosPieces, etc.

Example of shell commands mentioned above:

Password:

-> vdmShowDriveList

Total Drives: 20

DriveAddr Devnum T/S Secure PI Med State Ord/VG# Vols WWN

====================================================================================================

0x127c32dc 0x00010018 99/25 Capable 2 HDD Acc/GrA/Opt 2/2 1 5000c50034a4232f0000000000000000

0x127c3730 0x00010027 99/40 Capable 2 HDD Acc/GrA/Opt 2/5 1 5000c50034a614830000000000000000

0x127c2578 0x00010033 99/52 Capable 2 HDD Acc/GrA/Opt 1/5 1 5000c50034ad97170000000000000000

0x127c35e4 0x0001001e 99/31 Capable 2 HDD Acc/GrA/Opt 2/7 1 5000c50034b213b70000000000000000

0x127c20ec 0x0001002a 99/43 Capable 2 HDD Acc/GrA/Opt 1/7 1 5000c5003502d16b0000000000000000

0x127c2848 0x00010039 99/58 Capable 2 HDD Acc/GrA/Opt 1/10 1 5000c5003502d3070000000000000000

0x132542ac 0x0001000c 99/13 Capable 2 HDD Acc/GrA/Opt 1/1 1 5000c5003502d65f0000000000000000

0x127c23f4 0x00010000 99/1 Capable 2 HDD Acc/GrA/Opt 2/1 1 5000c5003502fbab0000000000000000

0x127c3120 0x00010015 99/22 Capable 2 HDD Acc/GrA/Opt 2/9 1 5000c50035030ba70000000000000000

0x127c1fc4 0x00010030 99/49 Capable 2 HDD Acc/GrA/Opt 2/3 1 5000c500350332070000000000000000

0x127c1e1c 0x00010024 99/37 Capable 2 HDD Acc/GrA/Opt 1/2 1 5000c5003503320f0000000000000000

0x127c3a70 0x00010012 99/19 Capable 2 HDD Acc/GrA/Opt 1/6 1 5000c5003503389b0000000000000000

0x12770f88 0x00010003 99/4 Capable 2 HDD Acc/GrA/Opt 1/3 1 5000c50035035ab30000000000000000

0x127c38b4 0x0001001b 99/28 Capable 2 HDD Acc/GrA/Opt 1/4 1 5000c5003503b62b0000000000000000

0x127c2df8 0x0001000f 99/16 Capable 2 HDD Acc/GrA/Opt 2/4 1 5000c5003503cf630000000000000000

0x13254be4 0x00010036 99/55 Capable 2 HDD Acc/GrA/Opt 2/8 1 5000c500350401a70000000000000000

0x127c2fd4 0x00010009 99/10 Capable 2 HDD Acc/GrA/Opt 1/8 1 5000c50040a0e1bf0000000000000000

0x127c26fc 0x00010006 99/7 Capable 2 HDD Acc/GrA/Opt 2/6 1 5000c50040a0f1fb0000000000000000

0x127c3bbc 0x00010021 99/34 Capable 2 HDD Acc/GrA/Opt 1/9 1 5000c50040ace4570000000000000000

0x127c3d40 0x0001002d 99/46 Capable 2 HDD Acc/GrA/Opt 2/10 1 5000c50040acefd30000000000000000

\ /

value = 1 = 0x1 \ /

\ /

è Alternate controller MUST report the same state for each individual drive for the controllers to be ‘synchronized’.

-> vdmShowVGInfo

Total Volume Groups: ............. 10

Seq:1 / RAID 1 / VGCompleteState / TLP:F / DLP:T / SSM:T / ActDrv:2 / InActDrv:0 / VolCnt:1 / Secure:Capable

BlockSize:512 / PI Capable:T - 2 / Label:0 / VGWwn:60080e50001f9d90000004fe4f5dbf8d

(Active) Drive:0x132542ac devnum:0x0001000c seqNum:1 Tray/Slot:99/13 State:Acc/GrA/Opt

(Active) Drive:0x127c23f4 devnum:0x00010000 seqNum:2 Tray/Slot:99/01 State:Acc/GrA/Opt

Volumes: 0x00000 [O] à this volume’s SSID is “0”

Seq:2 / RAID 1 / VGCompleteState / TLP:F / DLP:T / SSM:T / ActDrv:2 / InActDrv:0 / VolCnt:1 / Secure:Capable

BlockSize:512 / PI Capable:T - 2 / Label:1 / VGWwn:60080e50001f9d90000005014f5dbfe2

(Active) Drive:0x127c1e1c devnum:0x00010024 seqNum:1 Tray/Slot:99/37 State:Acc/GrA/Opt

(Active) Drive:0x127c32dc devnum:0x00010018 seqNum:2 Tray/Slot:99/25 State:Acc/GrA/Opt

Volumes: 0x00001 [O] à this volume’s SSID is “1”

Seq:3 / RAID 1 / VGCompleteState / TLP:F / DLP:T / SSM:T / ActDrv:2 / InActDrv:0 / VolCnt:1 / Secure:Capable

BlockSize:512 / PI Capable:T - 2 / Label:2 / VGWwn:60080e50001f9d90000005024f5dc018

(Active) Drive:0x12770f88 devnum:0x00010003 seqNum:1 Tray/Slot:99/04 State:Acc/GrA/Opt

(Active) Drive:0x127c1fc4 devnum:0x00010030 seqNum:2 Tray/Slot:99/49 State:Acc/GrA/Opt

Volumes: 0x00002 [O] If this was the failed volume the [O] would be [F] à this volume’s SSID is “2” {Side note: you can have more than one volume listed in a Volume Group.}

Both disk drives would be listed as “Optimal” but one or more volumes will be listed as “[F]” and others as [O].

<<balance of output snipped>>

-> evfShowVol 2

Volume 0x2(RAIDVolume)

4 VolumeListener(s)

N3vdm11VolumeGroupE

N3vdm17RAIDVolumeManagerE

N3evf10CmdHandlerE

N3evf22VolumeUserLabelManagerE

0 Children

Volume 0x2 Attributes:

Volume Type: 1+1 RAID 1

User Label: 3

WWN: 60080e50001f9d90000005034f5dc02e

Address: 0x130020a0

Devnum: 0x10000002

Capacity: 5859482533 blocks

BlockSize: 512 bytes

LargeIoSize: 4096 blocks

Has Extents: false

Ownership: Alternate

PreferredPath: Alternate

Transferring: false

StopIOInProgress: false

Suspended: 0

Quiesce Count: 0

Quiesce Both Ctl: 0 (Local: 0/Alt: 0)

State: RV_OPTIMAL

Unreadable Sectors:Not Present

App Tag: 0xffff

App Tag Owned: false

Protection Type: 1

Permissions: CONFIG=Y CPYSRC=Y CPYTGT=Y

<<snipped>>

*** Volume Group Info ***

VG Label : 2

WWN : 60080e50001f9d90000005024f5dc018

Address : 0x2ccbfbc0

RAID Level : RAID 1

Drive Count: 2

Boundary : 0 0x0

Secure : False

PI Capable : True - 2

Media Type : HDD

Volumes on this Group:

Volume Count: 1

0x00002 [O] Our ‘pretend’ failed volume [F]

*** Pieces ***

Offset: 0x0 Length: 0x15d40a000 (5859483648 dec)

Count : 2 Data Count: 1

Piece Devnum Address Tray/Slot State

0 0x00010003 0x2cce4540 99,4 PieceOptimalState

1 0x00010030 0x2cce42ec 99,49 PieceOptimalState àOne these pieces { 0 or 1 } would be failed

<<balance of output snipped>>

-> vdmShowOosPieces

*************************** PieceManager OOS Pieces **********************

**************************************************************************

value = 76 = 0x4c = 'L'

Since we don’t have any failed “pieces” on this array à nothing is listed, but this command ‘would’ list the timestamp of OOS (Out Of Service) for the failures.

The next step would be to bring the failed pieces back online in reverse order, but with this new “vdmRecoverAllRAIDVols” shell command à this procedure is now automated!!! J

Title	Last Updated	Updated By
vdmRecoverAllRAIDVols	12/23/2015 04:19 PM	Mamoon Ansari