How to Handle Failing (Degraded) Disk Drives

Overview

SR Information: No specific SRs. This is a general troubleshooting procedure.

Product / Software Version: Lattus 3.3.x and later

Problem Description: This topic describes how to handle failing (degraded) disk drives in all types of Storage Nodes and Controller Nodes.

This troubleshooting methodology includes the following procedures:

Information to Collect About a Device
View or Check for Errors
What to Do After Confirming that the Disk Needs to be Replaced
What to Do if the Decommission Job Fails or if the Disk Continues to Generate Events After it Has Been Decommissioned

Information to Collect About a Device

List scsi device attributes:

# lsscsi | grep sdk

[12:0:0:0] disk ATA WDC WD30EZRX-00M 80.0 /dev/sdk

Determine which device is attached to which SATA controller and identify ATA number:

# ls -l /sys/block/ | grep sd

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sda -> ../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdb -> ../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sdb

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdc -> ../devices/pci0000:00/0000:00:1f.2/host2/target2:0:0/2:0:0:0/block/sdc

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdd -> ../devices/pci0000:00/0000:00:1f.2/host3/target3:0:0/3:0:0:0/block/sdd

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sde -> ../devices/pci0000:00/0000:00:1f.2/host4/target4:0:0/4:0:0:0/block/sde

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdf -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/host6/target6:0:0/6:0:0:0/block/sdf

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdg -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/host7/target7:0:0/7:0:0:0/block/sdg

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdh -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/host8/target8:0:0/8:0:0:0/block/sdh

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdi -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:05.0/0000:05:00.0/host10/target10:0:0/10:0:0:0/block/sdi

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdj -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:05.0/0000:05:00.0/host11/target11:0:0/11:0:0:0/block/sdj

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdk -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:05.0/0000:05:00.0/host12/target12:0:0/12:0:0:0/block/sdk

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdl -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:05.0/0000:05:00.0/host13/target13:0:0/13:0:0:0/block/sdl

Show all information for device:

# smartctl -x /dev/sdk

View or Check for Errors

You will find events on the CMC and/or the disk may be displayed in the "degraded disk" section of the CMC.

Note: The following are steps performed from the node of the erroring disk.

View kernel log for information about the disk drive and its errors:

# less /var/log/kern.log

Print the summary from SMART error log:

# smartctl -l error /dev/sdk

Run "short" selftest on device (2 minutes):

# smartctl -t short /dev/sdk

Get the results of the "short" test from device log:

# smartctl -l selftest /dev/sdk

Run "long" selftest on device (255 minutes):

# smartctl -t long /dev/sdk

Note: Multiple disk failures on the same SATA controller could require chassis replacement as the SATA controller is not a FRU.

What to Do After Confirming that the Disk Needs to be Replaced

Decommission the disk via CMC.

Order replacement disk and dispatch engineer per the customer's disk replacement agreement. Disks are typically left as decommissioned until a number of them are ready for replacement or if the decommissioned disk is part of the node's internal software array.

What to Do if the Decommission Job Fails or If the Disk Continues to Generate Events After it Has Been Decommissioned

Collect details of the failed action from the decommission job via CMC.

Problem #1: If the Disk Unmount Fails During Decommission Job

Note: See CR55445 and CR44192. This is fixed in 3.5.1.

Collect device path:

root@storage:~# df | grep dss6

/dev/sdf1 2907046556 2042450340 835293564 71% /mnt/dss/dss6

Check for files remaining open on the device:

root@storage:~# lsof /mnt/dss/dss6

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

dss.bin 2358 dssdaemon 727w REG 8,81 7319697 99876877 /mnt/dss/dss6/blockstore/ns_v2/0/0/_tlog/14.tlog

Capture the storagedaemon guid:

root@storage:~# ps aux |grep 2358

1002 2358 0.4 12.6 3388728 1027872 ? Ssl Feb10 246:55 /opt/qbase3/bin/dss -d --storagedaemon /opt/qbase3/cfg/dss/storagedaemons/43f51b31-0c77-4f24-b17d-bd42dbc720b3.cfg

Check the status of the storagedaemon:

root@storage:~# qshell -c "print q.dss.storagedaemons.getStatus()"

{'0fd0057a-88ce-45d8-905f-d7e941d11a58': running, '43f51b31-0c77-4f24-b17d-bd42dbc720b3': running}

Restart the storagedaemon:

root@storage:~# qshell -c "q.dss.storagedaemons.restartOne('43f51b31-0c77-4f24-b17d-bd42dbc720b3')"

Confirm that there are no more open files on the device:

root@storage:~# lsof /mnt/dss/dss6

Unmount the disk device:

root@storage:~# umount /dev/sdc2

Problem #2: Message: Kernel dmesg Errors Detected

See steps above for a disk that did not get unmounted.

Check if the disk device is displayed by lsscsi and delete if necessary:

root@storage:~# lsscsi

If the disk (sdc) is still visible, you will need to run this command:

root@storage:~# echo 1 > /sys/block/sdc/device/delete

Problem #3: Message: Blockstore '/mnt/dss/dss8/blockstore' has status DECOMMISSIONED For More than 10 Days

Use qshell on the management controller node to manually abandon a decommissioned block store:

In [1]: [ x for x in q.dss.manage.showLocationHierarchy().split('\n') if 'DECOMMISSIONED' in x ]

Out[1]: [' | | +- node 74: 10.15.22.81:23520, bs 453: DECOMMISSIONED, online']

In [2]: q.dss.manage.showBlockStore(453)
Out[2]: {'custom': {'decommission_date': 1415912152.9695449},
'id': 453,
'node_id': 74,
'path': '/mnt/dss/dss9/blockstore',
'seq': 3,
'status': 'DECOMMISSIONED',
'version': 1}

In [3]: q.dss.manage.abandonBlockStore(453, force=True)