New command vdmRecoverAllRAIDVols to recover volumes |
New command vdmRecoverAllRAIDVols to recover volumes to fix seemingly "failed" Volume:
From Lou Poletti and Robby Robertson.
See attachment
Two points I see here:
A) Who to believe???
I have been burned many times in my life of supporting LSI/NetApp disk arrays by “trusting” what the SANtricity GUI is reporting AND/OR what the ‘storage-array-profile.txt’ file is reporting.
BOTH of these can lie to you!!!!
To get the truth à you must look at the ‘state-capture-data.txt’ file and even here you have to perform a reality check on this file.
1) Make sure there are shell command outputs from both controllers of the disk array in this file.
2) Are the controllers synchronized?? The quickest way to perform this test is to see if both controllers are reporting the same state/status for all disk drives on the array.
a) Look at ‘vdmShowDriveList’ outputs for both controllers in the state-capture-data.txt file à if all drives report their state exactly the same on both controllers then you “can trust” the file as the truth.
b) If not, controllers are not synchronized and you must perform staggered controller reboots (assuming customer has failover implemented on all attached hosts) because there most likely is a “masked condition” on the disk array and you have to bring it to the surface (via controller reboots) before you proceed with working on the disk array.
B) There are two types of volume failures on these disk arrays.
1) Hardware failure à a failed disk drive (depended upon the RAID type, it is typically more than one disk drive in the same volume group) {look at ‘vdmShowDriveList’}
2) Piece failure à All disk drive members in that volume group are online/optimal yet the volume is failed. A piece member of the volume is in a failed state. (this is kind of a ‘logical’ failure if you will)
{look at ‘vdmShowVGInfo’ then ‘evfShowVol (SSID#)’ then vdmShowOosPieces}
For Hardware failure, the shell command would report a drive state as “Acc/GrA/F” or “Acc/GrA/NP” for the bad disk drive.
For Piece failure, this new shell command “vdmRecoverAllRAIDVols” is a new shell script that automates the way it was previously handled (pre 08.10.13.00 f/w) à vdmShowVGInfo, evfShowVol, vdmShowOosPieces, etc.
Example of shell commands mentioned above:
login: shellUsr
Password:
-> vdmShowDriveList
Total Drives: 20
DriveAddr Devnum T/S Secure PI Med State Ord/VG# Vols WWN
====================================================================================================
0x127c32dc 0x00010018 99/25 Capable 2 HDD Acc/GrA/Opt 2/2 1 5000c50034a4232f0000000000000000
0x127c3730 0x00010027 99/40 Capable 2 HDD Acc/GrA/Opt 2/5 1 5000c50034a614830000000000000000
0x127c2578 0x00010033 99/52 Capable 2 HDD Acc/GrA/Opt 1/5 1 5000c50034ad97170000000000000000
0x127c35e4 0x0001001e 99/31 Capable 2 HDD Acc/GrA/Opt 2/7 1 5000c50034b213b70000000000000000
0x127c20ec 0x0001002a 99/43 Capable 2 HDD Acc/GrA/Opt 1/7 1 5000c5003502d16b0000000000000000
0x127c2848 0x00010039 99/58 Capable 2 HDD Acc/GrA/Opt 1/10 1 5000c5003502d3070000000000000000
0x132542ac 0x0001000c 99/13 Capable 2 HDD Acc/GrA/Opt 1/1 1 5000c5003502d65f0000000000000000
0x127c23f4 0x00010000 99/1 Capable 2 HDD Acc/GrA/Opt 2/1 1 5000c5003502fbab0000000000000000
0x127c3120 0x00010015 99/22 Capable 2 HDD Acc/GrA/Opt 2/9 1 5000c50035030ba70000000000000000
0x127c1fc4 0x00010030 99/49 Capable 2 HDD Acc/GrA/Opt 2/3 1 5000c500350332070000000000000000
0x127c1e1c 0x00010024 99/37 Capable 2 HDD Acc/GrA/Opt 1/2 1 5000c5003503320f0000000000000000
0x127c3a70 0x00010012 99/19 Capable 2 HDD Acc/GrA/Opt 1/6 1 5000c5003503389b0000000000000000
0x12770f88 0x00010003 99/4 Capable 2 HDD Acc/GrA/Opt 1/3 1 5000c50035035ab30000000000000000
0x127c38b4 0x0001001b 99/28 Capable 2 HDD Acc/GrA/Opt 1/4 1 5000c5003503b62b0000000000000000
0x127c2df8 0x0001000f 99/16 Capable 2 HDD Acc/GrA/Opt 2/4 1 5000c5003503cf630000000000000000
0x13254be4 0x00010036 99/55 Capable 2 HDD Acc/GrA/Opt 2/8 1 5000c500350401a70000000000000000
0x127c2fd4 0x00010009 99/10 Capable 2 HDD Acc/GrA/Opt 1/8 1 5000c50040a0e1bf0000000000000000
0x127c26fc 0x00010006 99/7 Capable 2 HDD Acc/GrA/Opt 2/6 1 5000c50040a0f1fb0000000000000000
0x127c3bbc 0x00010021 99/34 Capable 2 HDD Acc/GrA/Opt 1/9 1 5000c50040ace4570000000000000000
0x127c3d40 0x0001002d 99/46 Capable 2 HDD Acc/GrA/Opt 2/10 1 5000c50040acefd30000000000000000
\ /
value = 1 = 0x1 \ /
\ /
\ /
\ /
|
è Alternate controller MUST report the same state for each individual drive for the controllers to be ‘synchronized’.
-> vdmShowVGInfo
Total Volume Groups: ............. 10
Seq:1 / RAID 1 / VGCompleteState / TLP:F / DLP:T / SSM:T / ActDrv:2 / InActDrv:0 / VolCnt:1 / Secure:Capable
BlockSize:512 / PI Capable:T - 2 / Label:0 / VGWwn:60080e50001f9d90000004fe4f5dbf8d
(Active) Drive:0x132542ac devnum:0x0001000c seqNum:1 Tray/Slot:99/13 State:Acc/GrA/Opt
(Active) Drive:0x127c23f4 devnum:0x00010000 seqNum:2 Tray/Slot:99/01 State:Acc/GrA/Opt
Volumes: 0x00000 [O] à this volume’s SSID is “0”
Seq:2 / RAID 1 / VGCompleteState / TLP:F / DLP:T / SSM:T / ActDrv:2 / InActDrv:0 / VolCnt:1 / Secure:Capable
BlockSize:512 / PI Capable:T - 2 / Label:1 / VGWwn:60080e50001f9d90000005014f5dbfe2
(Active) Drive:0x127c1e1c devnum:0x00010024 seqNum:1 Tray/Slot:99/37 State:Acc/GrA/Opt
(Active) Drive:0x127c32dc devnum:0x00010018 seqNum:2 Tray/Slot:99/25 State:Acc/GrA/Opt
Volumes: 0x00001 [O] à this volume’s SSID is “1”
Seq:3 / RAID 1 / VGCompleteState / TLP:F / DLP:T / SSM:T / ActDrv:2 / InActDrv:0 / VolCnt:1 / Secure:Capable
BlockSize:512 / PI Capable:T - 2 / Label:2 / VGWwn:60080e50001f9d90000005024f5dc018
(Active) Drive:0x12770f88 devnum:0x00010003 seqNum:1 Tray/Slot:99/04 State:Acc/GrA/Opt
(Active) Drive:0x127c1fc4 devnum:0x00010030 seqNum:2 Tray/Slot:99/49 State:Acc/GrA/Opt
Volumes: 0x00002 [O] If this was the failed volume the [O] would be [F] à this volume’s SSID is “2” {Side note: you can have more than one volume listed in a Volume Group.}
Both disk drives would be listed as “Optimal” but one or more volumes will be listed as “[F]” and others as [O].
<<balance of output snipped>>
-> evfShowVol 2
Volume 0x2(RAIDVolume)
4 VolumeListener(s)
N3vdm11VolumeGroupE
N3vdm17RAIDVolumeManagerE
N3evf10CmdHandlerE
N3evf22VolumeUserLabelManagerE
0 Children
Volume 0x2 Attributes:
Volume Type: 1+1 RAID 1
User Label: 3
WWN: 60080e50001f9d90000005034f5dc02e
Address: 0x130020a0
Devnum: 0x10000002
Capacity: 5859482533 blocks
BlockSize: 512 bytes
LargeIoSize: 4096 blocks
Has Extents: false
Ownership: Alternate
PreferredPath: Alternate
Transferring: false
StopIOInProgress: false
Suspended: 0
Quiesce Count: 0
Quiesce Both Ctl: 0 (Local: 0/Alt: 0)
State: RV_OPTIMAL
Unreadable Sectors:Not Present
App Tag: 0xffff
App Tag Owned: false
Protection Type: 1
Permissions: CONFIG=Y CPYSRC=Y CPYTGT=Y
<<snipped>>
*** Volume Group Info ***
VG Label : 2
WWN : 60080e50001f9d90000005024f5dc018
Address : 0x2ccbfbc0
RAID Level : RAID 1
Drive Count: 2
Boundary : 0 0x0
Secure : False
PI Capable : True - 2
Media Type : HDD
Volumes on this Group:
Volume Count: 1
0x00002 [O] Our ‘pretend’ failed volume [F]
*** Pieces ***
Offset: 0x0 Length: 0x15d40a000 (5859483648 dec)
Count : 2 Data Count: 1
Piece Devnum Address Tray/Slot State
0 0x00010003 0x2cce4540 99,4 PieceOptimalState
1 0x00010030 0x2cce42ec 99,49 PieceOptimalState àOne these pieces { 0 or 1 } would be failed
<<balance of output snipped>>
-> vdmShowOosPieces
*************************** PieceManager OOS Pieces **********************
**************************************************************************
value = 76 = 0x4c = 'L'
->
Since we don’t have any failed “pieces” on this array à nothing is listed, but this command ‘would’ list the timestamp of OOS (Out Of Service) for the failures.
The next step would be to bring the failed pieces back online in reverse order, but with this new “vdmRecoverAllRAIDVols” shell command à this procedure is now automated!!! J
Attachments |
This page was generated by the BrainKeeper Enterprise Wiki, © 2018 |