New command vdmRecoverAllRAIDVols to recover volumes

New command vdmRecoverAllRAIDVols to recover volumes to fix seemingly "failed" Volume:

 

From Lou Poletti and Robby Robertson.

See attachment

 

 Two points I see here:

A)     Who to believe???

I have been burned many times in my life of supporting LSI/NetApp disk arrays by “trusting” what the SANtricity GUI is reporting AND/OR what the ‘storage-array-profile.txt’ file is reporting.

BOTH of these can lie to you!!!!

To get the truth à you must look at the ‘state-capture-data.txt’ file and even here you have to perform a reality check on this file.

1)      Make sure there are shell command outputs from both controllers of the disk array in this file.

2)      Are the controllers synchronized??   The quickest way to perform this test is to see if both controllers are reporting the same state/status for all disk drives on the array.

a)      Look at ‘vdmShowDriveList’ outputs for both controllers in the state-capture-data.txt file à if all drives report their state exactly the same on both controllers then you “can trust” the file as the truth.

b)      If not, controllers are not synchronized and you must perform staggered controller reboots (assuming customer has failover implemented on all attached hosts) because there most likely is a “masked condition” on the disk array and you have to bring it to the surface (via controller reboots) before you proceed with working on the disk array.

 

B)     There are two types of volume failures on these disk arrays.

1)      Hardware failure à a failed disk drive (depended upon the RAID type, it is typically more than one disk drive in the same volume group)  {look at ‘vdmShowDriveList’}

2)      Piece failure à All disk drive members in that volume group are online/optimal yet the volume is failed.  A piece member of the volume is in a failed state.  (this is kind of a ‘logical’ failure if you will)

                             {look at ‘vdmShowVGInfo’ then ‘evfShowVol (SSID#)’ then vdmShowOosPieces}

For Hardware failure, the shell command would report a drive state as “Acc/GrA/F”  or  “Acc/GrA/NP” for the bad disk drive.

For Piece failure, this new shell command “vdmRecoverAllRAIDVols” is a new shell script that automates the way it was previously handled (pre 08.10.13.00 f/w) à vdmShowVGInfo, evfShowVol, vdmShowOosPieces, etc.

 

Example of shell commands mentioned above:

login: shellUsr

Password:

-> vdmShowDriveList

Total Drives: 20

 

DriveAddr  Devnum      T/S  Secure  PI Med State       Ord/VG# Vols WWN

====================================================================================================

0x127c32dc 0x00010018 99/25 Capable  2 HDD Acc/GrA/Opt   2/2      1 5000c50034a4232f0000000000000000

0x127c3730 0x00010027 99/40 Capable  2 HDD Acc/GrA/Opt   2/5      1 5000c50034a614830000000000000000

0x127c2578 0x00010033 99/52 Capable  2 HDD Acc/GrA/Opt   1/5      1 5000c50034ad97170000000000000000

0x127c35e4 0x0001001e 99/31 Capable  2 HDD Acc/GrA/Opt   2/7      1 5000c50034b213b70000000000000000

0x127c20ec 0x0001002a 99/43 Capable  2 HDD Acc/GrA/Opt   1/7      1 5000c5003502d16b0000000000000000

0x127c2848 0x00010039 99/58 Capable  2 HDD Acc/GrA/Opt   1/10     1 5000c5003502d3070000000000000000

0x132542ac 0x0001000c 99/13 Capable  2 HDD Acc/GrA/Opt   1/1      1 5000c5003502d65f0000000000000000

0x127c23f4 0x00010000 99/1  Capable  2 HDD Acc/GrA/Opt   2/1      1 5000c5003502fbab0000000000000000

0x127c3120 0x00010015 99/22 Capable  2 HDD Acc/GrA/Opt   2/9      1 5000c50035030ba70000000000000000

0x127c1fc4 0x00010030 99/49 Capable  2 HDD Acc/GrA/Opt   2/3      1 5000c500350332070000000000000000

0x127c1e1c 0x00010024 99/37 Capable  2 HDD Acc/GrA/Opt   1/2      1 5000c5003503320f0000000000000000

0x127c3a70 0x00010012 99/19 Capable  2 HDD Acc/GrA/Opt   1/6      1 5000c5003503389b0000000000000000

0x12770f88 0x00010003 99/4  Capable  2 HDD Acc/GrA/Opt   1/3      1 5000c50035035ab30000000000000000

0x127c38b4 0x0001001b 99/28 Capable  2 HDD Acc/GrA/Opt   1/4      1 5000c5003503b62b0000000000000000

0x127c2df8 0x0001000f 99/16 Capable  2 HDD Acc/GrA/Opt   2/4      1 5000c5003503cf630000000000000000

0x13254be4 0x00010036 99/55 Capable  2 HDD Acc/GrA/Opt   2/8      1 5000c500350401a70000000000000000

0x127c2fd4 0x00010009 99/10 Capable  2 HDD Acc/GrA/Opt   1/8      1 5000c50040a0e1bf0000000000000000

0x127c26fc 0x00010006 99/7  Capable  2 HDD Acc/GrA/Opt   2/6      1 5000c50040a0f1fb0000000000000000

0x127c3bbc 0x00010021 99/34 Capable  2 HDD Acc/GrA/Opt   1/9      1 5000c50040ace4570000000000000000

0x127c3d40 0x0001002d 99/46 Capable  2 HDD Acc/GrA/Opt   2/10     1 5000c50040acefd30000000000000000

                                                                                        \                   /       

value = 1 = 0x1                                                               \               /                                                 

                                                                                                           \           /

                                                                                                             \       /

                                                                                                               \   /

                                                                                                                 |

è Alternate controller MUST report the same state for each individual drive for the controllers to be ‘synchronized’.

 

 

-> vdmShowVGInfo

Total Volume Groups: ............. 10

 

Seq:1 / RAID 1 / VGCompleteState / TLP:F / DLP:T / SSM:T / ActDrv:2 / InActDrv:0 / VolCnt:1 / Secure:Capable

BlockSize:512 / PI Capable:T - 2 / Label:0 / VGWwn:60080e50001f9d90000004fe4f5dbf8d

(Active) Drive:0x132542ac devnum:0x0001000c seqNum:1 Tray/Slot:99/13  State:Acc/GrA/Opt

(Active) Drive:0x127c23f4 devnum:0x00010000 seqNum:2 Tray/Slot:99/01  State:Acc/GrA/Opt

Volumes: 0x00000 [O]           à this volume’s SSID is “0”

Seq:2 / RAID 1 / VGCompleteState / TLP:F / DLP:T / SSM:T / ActDrv:2 / InActDrv:0 / VolCnt:1 / Secure:Capable

BlockSize:512 / PI Capable:T - 2 / Label:1 / VGWwn:60080e50001f9d90000005014f5dbfe2

(Active) Drive:0x127c1e1c devnum:0x00010024 seqNum:1 Tray/Slot:99/37  State:Acc/GrA/Opt

(Active) Drive:0x127c32dc devnum:0x00010018 seqNum:2 Tray/Slot:99/25  State:Acc/GrA/Opt

Volumes: 0x00001 [O]          à this volume’s SSID is “1”

Seq:3 / RAID 1 / VGCompleteState / TLP:F / DLP:T / SSM:T / ActDrv:2 / InActDrv:0 / VolCnt:1 / Secure:Capable

BlockSize:512 / PI Capable:T - 2 / Label:2 / VGWwn:60080e50001f9d90000005024f5dc018

(Active) Drive:0x12770f88 devnum:0x00010003 seqNum:1 Tray/Slot:99/04  State:Acc/GrA/Opt

(Active) Drive:0x127c1fc4 devnum:0x00010030 seqNum:2 Tray/Slot:99/49  State:Acc/GrA/Opt

Volumes: 0x00002 [O]      If this was the failed volume the [O] would be [F]   à this volume’s SSID is “2”  {Side note: you can have more than one volume listed in a Volume Group.}

                                              Both disk drives would be listed as “Optimal” but one or more volumes will be listed as “[F]” and others as [O].

<<balance of output snipped>>

 

 

-> evfShowVol 2

Volume 0x2(RAIDVolume)

4 VolumeListener(s)

     N3vdm11VolumeGroupE

     N3vdm17RAIDVolumeManagerE

     N3evf10CmdHandlerE

     N3evf22VolumeUserLabelManagerE

0 Children

Volume 0x2 Attributes:

     Volume Type:       1+1 RAID 1

     User Label:        3

     WWN:               60080e50001f9d90000005034f5dc02e

     Address:           0x130020a0

     Devnum:            0x10000002

     Capacity:          5859482533 blocks

     BlockSize:         512 bytes

     LargeIoSize:       4096 blocks

     Has Extents:       false

     Ownership:         Alternate

     PreferredPath:     Alternate

     Transferring:      false

     StopIOInProgress:  false

     Suspended:         0

     Quiesce Count:     0

     Quiesce Both Ctl:  0 (Local: 0/Alt: 0)

     State:             RV_OPTIMAL

     Unreadable Sectors:Not Present

     App Tag:           0xffff

     App Tag Owned:     false

     Protection Type:   1

     Permissions:       CONFIG=Y            CPYSRC=Y            CPYTGT=Y

<<snipped>>

*** Volume Group Info ***

 

    VG Label   : 2

    WWN        : 60080e50001f9d90000005024f5dc018

    Address    : 0x2ccbfbc0

    RAID Level : RAID 1

    Drive Count: 2

    Boundary   : 0  0x0

    Secure     : False

    PI Capable : True - 2

    Media Type : HDD

 

     Volumes on this Group:

Volume Count:                      1

0x00002 [O]    Our ‘pretend’ failed volume [F]

 

*** Pieces ***

 

    Offset: 0x0                    Length:     0x15d40a000 (5859483648 dec)

    Count : 2                      Data Count: 1

 

    Piece     Devnum        Address  Tray/Slot   State

        0  0x00010003    0x2cce4540    99,4      PieceOptimalState 

        1  0x00010030    0x2cce42ec    99,49     PieceOptimalState     àOne these pieces { 0 or 1 } would be failed

 

<<balance of output snipped>>

 

-> vdmShowOosPieces

 

*************************** PieceManager OOS Pieces **********************

 

**************************************************************************

value = 76 = 0x4c = 'L'

->

Since we don’t have any failed “pieces” on this array à nothing is listed, but this command ‘would’ list the timestamp of OOS (Out Of Service) for the failures.

The next step would be to bring the failed pieces back online in reverse order, but with this new “vdmRecoverAllRAIDVols” shell command à this procedure is now automated!!! J

 

Attachments
Title Last Updated Updated By
vdmRecoverAllRAIDVols
12/23/2015 04:19 PM Mamoon Ansari


This page was generated by the BrainKeeper Enterprise Wiki, © 2018