How to Handle Failing (Degraded) Disk Drives |
SR Information: No specific SRs. This is a general troubleshooting procedure.
Product / Software Version: Lattus 3.3.x and later
Problem Description: This topic describes how to handle failing (degraded) disk drives in all types of Storage Nodes and Controller Nodes.
|
This troubleshooting methodology includes the following procedures:
# lsscsi | grep sdk
[12:0:0:0] disk ATA WDC WD30EZRX-00M 80.0 /dev/sdk
# ls -l /sys/block/ | grep sd
lrwxrwxrwx 1 root root 0 Mar 10 17:06 sda -> ../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda
lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdb -> ../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sdb
lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdc -> ../devices/pci0000:00/0000:00:1f.2/host2/target2:0:0/2:0:0:0/block/sdc
lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdd -> ../devices/pci0000:00/0000:00:1f.2/host3/target3:0:0/3:0:0:0/block/sdd
lrwxrwxrwx 1 root root 0 Mar 10 17:06 sde -> ../devices/pci0000:00/0000:00:1f.2/host4/target4:0:0/4:0:0:0/block/sde
lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdf -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/host6/target6:0:0/6:0:0:0/block/sdf
lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdg -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/host7/target7:0:0/7:0:0:0/block/sdg
lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdh -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/host8/target8:0:0/8:0:0:0/block/sdh
lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdi -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:05.0/0000:05:00.0/host10/target10:0:0/10:0:0:0/block/sdi
lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdj -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:05.0/0000:05:00.0/host11/target11:0:0/11:0:0:0/block/sdj
lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdk -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:05.0/0000:05:00.0/host12/target12:0:0/12:0:0:0/block/sdk
lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdl -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:05.0/0000:05:00.0/host13/target13:0:0/13:0:0:0/block/sdl
# smartctl -x /dev/sdk
You will find events on the CMC and/or the disk may be displayed in the "degraded disk" section of the CMC.
Note: The following are steps performed from the node of the erroring disk.
# less /var/log/kern.log
# smartctl -l error /dev/sdk
# smartctl -t short /dev/sdk
# smartctl -l selftest /dev/sdk
# smartctl -t long /dev/sdk
Note: Multiple disk failures on the same SATA controller could require chassis replacement as the SATA controller is not a FRU.
Decommission the disk via CMC.
Order replacement disk and dispatch engineer per the customer's disk replacement agreement. Disks are typically left as decommissioned until a number of them are ready for replacement or if the decommissioned disk is part of the node's internal software array.
Collect details of the failed action from the decommission job via CMC.
Note: See CR55445 and CR44192. This is fixed in 3.5.1.
root@storage:~# df | grep dss6
/dev/sdf1 2907046556 2042450340 835293564 71% /mnt/dss/dss6
root@storage:~# lsof /mnt/dss/dss6
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
dss.bin 2358 dssdaemon 727w REG 8,81 7319697 99876877 /mnt/dss/dss6/blockstore/ns_v2/0/0/_tlog/14.tlog
root@storage:~# ps aux |grep 2358
1002 2358 0.4 12.6 3388728 1027872 ? Ssl Feb10 246:55 /opt/qbase3/bin/dss -d --storagedaemon /opt/qbase3/cfg/dss/storagedaemons/43f51b31-0c77-4f24-b17d-bd42dbc720b3.cfg
root@storage:~# qshell -c "print q.dss.storagedaemons.getStatus()"
{'0fd0057a-88ce-45d8-905f-d7e941d11a58': running, '43f51b31-0c77-4f24-b17d-bd42dbc720b3': running}
root@storage:~# qshell -c "q.dss.storagedaemons.restartOne('43f51b31-0c77-4f24-b17d-bd42dbc720b3')"
root@storage:~# lsof /mnt/dss/dss6
root@storage:~# umount /dev/sdc2
See steps above for a disk that did not get unmounted.
root@storage:~# lsscsi
root@storage:~# echo 1 > /sys/block/sdc/device/delete
In [1]: [ x for x in q.dss.manage.showLocationHierarchy().split('\n') if 'DECOMMISSIONED' in x ]
Out[1]: [' | | +- node 74: 10.15.22.81:23520, bs 453: DECOMMISSIONED, online']
In [2]: q.dss.manage.showBlockStore(453)
Out[2]: {'custom': {'decommission_date': 1415912152.9695449},
'id': 453,
'node_id': 74,
'path': '/mnt/dss/dss9/blockstore',
'seq': 3,
'status': 'DECOMMISSIONED',
'version': 1}
In [3]: q.dss.manage.abandonBlockStore(453, force=True)
This page was generated by the BrainKeeper Enterprise Wiki, © 2018 |