SR3425526 GUI Filesystem Expansion failed |
SR Information: 3425526 GWDG / Max Planck
Problem Description: GUI Filesystem Expansion failed
Product / Software Version:
MDC: SNMS 4.7.0.1 SuSE 10 SP3 64 bit (2.6.16.60-0.54.5) |
Overview
Superblock not updated after failed SG Expansion
Customer added 1SG/LUN via WebGui, after the required start/stop, we can see Allocation latency messages within the cvlog. 5 min later customer added 2 more SG which failed, ending in a failover.
All furhter attemps to expand the Filesystem will fail.
Symptoms & Identifying the problem
## 1 ## Log Review:
## GWDU 160 ## ( primary )
SnGui.log
1te Expansion -> OK
[0729 07:29:48.808857 3160] Enter: SNAdmin_FSMStopAndWait: nshandle=0 name=UNI tmout=60
[0729 07:29:48.815562 3160] WaitForStoppedFsm: Found UNI at gwdu161s.fs.gwdg.de. stopped = 0
[0729 07:29:52.834579 3160] Enter: SNAdmin_GetFSConfig: fsname=UNI autoconvert=1 msgfmt=1
[0729 07:29:52.852743 3160] Exit: OK: SNAdmin_GetFSConfig: fsname=UNI
[0729 07:29:52.871410 3160] SNAdmin_Init: Done: dbg_tracks=0x8
[0729 07:29:52.871503 3160] Enter: SNAdmin_ValidateFSConfig: fsname=UNI msgfmt=1
[0729 07:29:52.883846 3160] Exit: OK: SNAdmin_ValidateFSConfig: For fsname=UNI
[0729 07:29:52.885667 3160] SNAdmin_Init: Done: dbg_tracks=0x8
[0729 07:29:52.888418 3160] Enter: SNAdmin_PutFSConfig: fsname=UNI msgfmt=1
[0729 07:29:52.910810 3160] Exit: OK: SNAdmin_PutFSConfig: fsname=UNI
[0729 07:32:07.564480 3160] Enter: SNAdmin_FSMStartAndWaitHA: nshandle=0 fsname=UNI timeout=30
[0729 07:32:09.595359 3160] Enter: SNAdmin_FSMStart: nshandle=0 fsname='UNI' hostname=10.108.2.61
2nd Expansion failed
[0729 07:37:07.862898 25734] Enter: SNAdmin_ValidateFSConfig: fsname=UNI msgfmt=1
[0729 07:37:07.877283 25734] Exit: OK: SNAdmin_ValidateFSConfig: For fsname=UNI
[0729 07:37:07.878872 25734] SNAdmin_Init: Done: dbg_tracks=0x8
[0729 07:37:07.881167 25734] Enter: SNAdmin_PutFSConfig: fsname=UNI msgfmt=1
[0729 07:37:07.899711 25734] Exit: OK: SNAdmin_PutFSConfig: fsname=UNI
==> no furhter logging, why it failed .. seems timeout is 5min between PutFsConfig & RemoceFSConfig
[0729 07:42:07.908637 25734] Enter: SNAdmin_RemoveFSConfig: fsname=UNI msgfmt=2
[0729 07:42:07.909614 25734] Exit: OK: SNAdmin_RemoveFSConfig: fsname=UNI
[0729 07:42:07.929067 25734] SNAdmin_Init: Done: dbg_tracks=0x8
[0729 07:42:07.929162 25734] Enter: SNAdmin_ValidateFSConfig: fsname=UNI msgfmt=1
[0729 07:42:07.935447 25734] Exit: OK: SNAdmin_ValidateFSConfig: For fsname=UNI
[0729 07:42:07.936952 25734] SNAdmin_Init: Done: dbg_tracks=0x8
[0729 07:42:07.938905 25734] Enter: SNAdmin_PutFSConfig: fsname=UNI msgfmt=1
==> cleanup done ok RemoveFSConfig
tomcat logs:
2014-07-29 07:42:07,908 ERROR com.quantum.qutosgui.app.cli.ProcessThread - InterruptedException occurred when executing /usr/cvfs/bin/cvupdatefs java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:503)
at java.lang.UNIXProcess.waitFor(UNIXProcess.java:210)
at com.quantum.qutosgui.app.cli.ProcessThread.run(ProcessThread.java:128)
Output: Execution of command '[/usr/cvfs/bin/cvupdatefs, -F, UNI]' timed out after 300 seconds.
Error: Execution of command '[/usr/cvfs/bin/cvupdatefs, -F, UNI]' timed out after 300 seconds.
at com.quantum.qutosgui.app.cli.AdicInvokeCLI.invoke(AdicInvokeCLI.java:160)
at com.quantum.qutosgui.app.cli.AdicInvokeCLI.invoke(AdicInvokeCLI.java:109)
at com.quantum.qutosgui.app.fs.FileSystemCLI.updateFilesystem(FileSystemCLI.java:282)
... 11 more
==> From the traces you can see that the timestamp when the update has been run is:
cvupdatefs-07_29_2014-07_37_07, exactly the 5 min noticed and it was due to a timeout.
[root]# /usr/adic/java/jre/bin/java -jar /usr/adic/tomcat/lib/derbyrun.jar ij
ij version 10.5
ij> connect 'jdbc:derby://localhost:1527//usr/adic/gui/database/guiPrivateDb';
ij> select * from GUI_USER;
USERNAME |PASSWORD |TIMEOUT
-------------------------------------------------------------------------------
admin |5f4dcc3b5aa765d61d8327deb882cf99 |1800
service |4201fb824bbe97fea89d4f31e4914dae |1800
wsuser |ebf1d446ef32f4745169e440608e5d1d |1800
CvLog
==> we hit latency problems after the first start/stop due to the Expansion. Filesystem being very busy during the Migration due to User Activity
[0729 07:32:27.955463] 0x41616940 (Debug) Alloc max us latency 169529 inode 0x180031979cb40 byte hint 0xb actual 0x100000
[0729 07:35:19.677215] 0x4179c940 (Debug) Alloc max us latency 887859 inode 0x128000000e30c737 byte hint 0x3000 actual 0x3000
[0729 07:35:19.677257] 0x4140e940 (Debug) Alloc max us latency 887889 inode 0x128000000e30c795 byte hint 0x1000 actual 0x1000
[0729 07:35:19.677377] 0x41206940 (Debug) Alloc max us latency 887928 inode 0x128000000e30c724 byte hint 0x1000 actual 0x1000
[0729 07:35:19.677388] 0x41288940 (Debug) Alloc max us latency 888310 inode 0x128000000e30c74d byte hint 0x2000 actual 0x2000
[0729 07:49:38.586559] 0x41963940 (Debug) Alloc max us latency 252628 inode 0x314372cdb byte hint 0x1000 actual 0x1000
[0729 07:51:42.954793] 0x41698940 (Debug) Alloc max us latency 987269 inode 0x180020904a52a byte hint 0x1000 actual 0x1000
[0729 08:02:59.135016] 0x41512940 (Debug) Alloc max us latency 257154 inode 0x180020904a52b byte hint 0x1000 actual 0x1000
[0729 08:03:21.968633] 0x41657940 (Debug) Alloc max us latency 856274 inode 0x128000000e2c5523 byte hint 0x1000 actual 0x1000
[0729 08:05:59.484853] 0x413cd940 (Debug) Alloc max us latency 1052025 inode 0x180020904a52a byte hint 0x1000 actual 0x100
/usr/adic/DSM/data/UNI/trace/cvupdatefs*
Note: cvupdatefs trace was trimmed to only reflect the changes
1# Time: 7:29 am ==> OK
Stripe Group Name Stripe Status MetaData Journal
================= ============= ======== =======
UNIData37 Create
Flushing journal entries... done
Initializing stripe group UNIData37 information... ok
Flushing buffers...
Updating ICB information...
Updating SuperBlock information...
*Warning*: File system 'UNI' was modified.
2# Time: 7:37 am ==> Failed, presumably timed out, ICB not changed, failover due to election/vote during startup
UNIData37 No Change
UNIData38 Create
UNIData39 Create
Flushing journal entries... done
Initializing stripe group UNIData38 information... ok
Initializing stripe group UNIData39 information... ok
Flushing buffers...
Note: Log ends here
3# Time 7:53 am ==> Trial with only 1 SG Failed not being able to initialize the SG
UNIData37 No Change
UNIData38 Create
Flushing journal entries... done
Initializing stripe group UNIData38 information... *Error*: ERROR
*Fatal*: Failed to initialize stripe group
Check configuration and try again
4# Test to use a different LUN for the next StripeGroup fails ==> indicates *parti name* already exist with different value
cat cvupdatefs-07_31_2014-07_55_49
*Fatal*: The name of stripe group 40 changed from 'UNIData38' to 'UNIData39'
*Warning*: Error initializing shared metadata stripe group 'UNIData39' - Invalid argument
## 2 ## Troubleshooting:
From the log analysis above, we could assume that the second expansion failed but initialized SG 40/SG41 called UNIData38/UniData39.
The Superblock doesn’t seem to be updated with the new Stripegroups but the Partitioninformation were.
So lets take a look into cvfsdb and review the stripegroup information
cvfsdb> show stripegroup 39
Parti Block for StripeGroup 39
parti_marker = 0x5061527449243230 [PaRtI$20]
parti_off = 0xdc2739000
parti_size = 0xf423e00
parti_inode = 0x6c12d4fc
parti_base = 0xdc2739
parti_end = 0xdd1b5c
parti_depth = 0x1
parti_breadth = 0x400
parti_sum = 0x3835e746
parti_sg = 0x0
parti_rsvd1 = 0x0
parti_name = UNIData37 ==> LAST SUCCESSFULLY ADDED SG/LUN
cvfsdb> show stripegroup 40
ShowSg: Bad stripe group number
Note: The Superblock hasn’t been updated during the failed SG expansion
cvfsdb> show icb
Info Control Block
icb_fs_marker = 0x4376467324243232 [CvFs$$22]
icb_fsname = UNI
icb_fsbsize = 0x1000
icb_numparts = 0x28 ç /* Number of stripe groups */
icb_meta_root = 0x0
icb_journal_root = 0x0
icb_inode_offs = 0x0
icb_sb_offs = 0x80
icb_arb_offs = 0x81
icb_cfgi_offs = 0x82
icb_diski_offs = 0x92
icb_parti_offs = 0xd2 ç /* Partition information (Parti) Offset */
icb_fl_offs = 0x112
icb_journal_offs = 0x113
icb_data_offs = 0x114
icb_sum = 0x43172813 (okay)
Note: ICB reflects 40StripeGroups [icb_numparts]
cvfsdb> raw
RAW:s[0]blk[0]> block 0xd2
RAW:s[0]blk[210]> show | more
StripeGroup 0 Disk Block 210:
00000000000d2000: 5061 5274 4924 3230 0000 0000 0211 4000 |PaRtI$20......@.
00000000000d2010: 0000 0000 001d 4ba0 0000 0000 0000 0003 |......K.........
00000000000d2020: 0000 0000 0000 2114 0000 0000 0000 22e8 |......!.......*.
00000000000d2030: 0000 0001 0100 0000 0000 0040 f3a8 eff0 |...........@....
00000000000d2040: 0000 0000 0000 0000 4d65 7461 4a6f 7572 |........MetaJour
00000000000d2050: 6e61 6c46 696c 6573 0000 0000 0000 0000 |nalFiles........
..
00000000000d6e00: 5061 5274 4924 3230 0000 000d c273 9000 |PaRtI$20.....s..
00000000000d6e10: 0000 0000 0f42 3e00 0000 0000 6c12 d4fc |.....B>.....l...
00000000000d6e20: 0000 0000 00dc 2739 0000 0000 00dd 1b5c |......'9.......\
00000000000d6e30: 0000 0001 0100 0000 0000 0400 3835 e746 |............85.F
00000000000d6e40: 0000 0000 0000 0000 554e 4944 6174 6133 |........UNIData3
00000000000d6e50: 3700 0000 0000 0000 0000 0000 0000 0000 |7...............
..
00000000000d7000: 5061 5274 4924 3230 0000 000d d1b5 d000 |PaRtI$20........
00000000000d7010: 0000 0000 0f42 3e00 0000 0000 6cd8 c38f |.....B>.....l...
00000000000d7020: 0000 0000 00dd 1b5d 0000 0000 00de 0f80 |.......]........
00000000000d7030: 0000 0001 0100 0000 0000 0400 272b d06b |............'+.k
00000000000d7040: 0000 0000 0000 0000 554e 4944 6174 6133 |........UNIData3
00000000000d7050: 3800 0000 0000 0000 0000 0000 0000 0000 |8...............
…
00000000000d7200: 5061 5274 4924 3230 0000 000d e0f8 1000 |PaRtI$20........
00000000000d7210: 0000 0000 0f42 3e00 0000 0000 6d25 9ecc |.....B>.....m%..
00000000000d7220: 0000 0000 00de 0f81 0000 0000 00df 03a4 |................
00000000000d7230: 0000 0001 0100 0000 0000 0400 169a cce6 |................
00000000000d7240: 0000 0000 0000 0000 554e 4944 6174 6133 |........UNIData3
00000000000d7250: 3900 0000 0000 0000 0000 0000 0000 0000 |9...............
The Partition Information contains information about the failed Expansion for SG41/41, LUN UNIData38 & 39.
Resolutions/workarounds/fixes:
This scenario is a potential Bug match for
- Bug 27723 - Failed cvupdatefs can prevent subsequent cvupdatefs run
- Bug 31411 - cvupdatefs fails after GUI config change fails
Custome hit a Gui Timeout while expanding the filesystem, this left the customer with partially updated filesystem. The number of stripe groups will be the old value and putting the old config file back will allow the filesystem to start, but attempting to run cvupdatefs will fail.
We used the undocumented -f option to force cvupdatefs and it updated the configuration successfull
Action Plan:
- STOP FS
- Apply Journals
- run “snmetadump” ( its unmanaged but we dont expect the MD to change from this stage )
- run Disk-Snapshot from FalconStor on MetaDataLun
- re-use org. FS Config File which contains the new SG40&41
- run “cvupdatefs –f”
What we learn from this case:
- Methodology of troubleshooting Filesystem Expansion failures
- Using cvfsdb to check for Partition Information / SG Information
This page was generated by the BrainKeeper Enterprise Wiki, © 2018 |