How to Handle Failing (Degraded) Disk Drives

Overview

SR Information: No specific SRs. This is a general troubleshooting procedure.

 

Product / Software Version: Lattus 3.3.x and later 

 

Problem Description: This topic describes how to handle failing (degraded) disk drives in all types of Storage Nodes and Controller Nodes. 

 

 

 This troubleshooting methodology includes the following procedures:

 


Information to Collect About a Device

  1. List scsi device attributes:

# lsscsi | grep sdk

[12:0:0:0]   disk    ATA      WDC WD30EZRX-00M 80.0  /dev/sdk

 

  1. Determine which device is attached to which SATA controller and identify ATA number:

# ls -l /sys/block/ | grep sd

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sda -> ../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdb -> ../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sdb

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdc -> ../devices/pci0000:00/0000:00:1f.2/host2/target2:0:0/2:0:0:0/block/sdc

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdd -> ../devices/pci0000:00/0000:00:1f.2/host3/target3:0:0/3:0:0:0/block/sdd

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sde -> ../devices/pci0000:00/0000:00:1f.2/host4/target4:0:0/4:0:0:0/block/sde

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdf -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/host6/target6:0:0/6:0:0:0/block/sdf

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdg -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/host7/target7:0:0/7:0:0:0/block/sdg

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdh -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/host8/target8:0:0/8:0:0:0/block/sdh

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdi -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:05.0/0000:05:00.0/host10/target10:0:0/10:0:0:0/block/sdi

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdj -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:05.0/0000:05:00.0/host11/target11:0:0/11:0:0:0/block/sdj

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdk -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:05.0/0000:05:00.0/host12/target12:0:0/12:0:0:0/block/sdk

lrwxrwxrwx 1 root root 0 Mar 10 17:06 sdl -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:05.0/0000:05:00.0/host13/target13:0:0/13:0:0:0/block/sdl

 

  1. Show all information for device:

# smartctl -x /dev/sdk

 


View or Check for Errors

You will find events on the CMC and/or the disk may be displayed in the "degraded disk" section of the CMC.

 

Note: The following are steps performed from the node of the erroring disk.

 

  1. View kernel log for information about the disk drive and its errors:

# less /var/log/kern.log

 

  1. Print the summary from SMART error log:

# smartctl -l error /dev/sdk

 

  1. Run "short" selftest on device (2 minutes):

# smartctl -t short /dev/sdk

 

  1. Get the results of the "short" test from device log:

# smartctl -l selftest /dev/sdk

 

  1. Run "long" selftest on device (255 minutes):

# smartctl -t long /dev/sdk

 

Note: Multiple disk failures on the same SATA controller could require chassis replacement as the SATA controller is not a FRU.

 


What to Do After Confirming that the Disk Needs to be Replaced

Decommission the disk via CMC.

 

Order replacement disk and dispatch engineer per the customer's disk replacement agreement. Disks are typically left as decommissioned until a number of them are ready for replacement or if the decommissioned disk is part of the node's internal software array.

 


What to Do if the Decommission Job Fails or If the Disk Continues to Generate Events After it Has Been Decommissioned

Collect details of the failed action from the decommission job via CMC.

Problem #1: If the Disk Unmount Fails During Decommission Job

Note: See CR55445 and CR44192. This is fixed in 3.5.1.

 

  1. Collect device path:

root@storage:~# df | grep dss6

/dev/sdf1      2907046556 2042450340 835293564  71% /mnt/dss/dss6

 

  1. Check for files remaining open on the device:

root@storage:~# lsof /mnt/dss/dss6

COMMAND  PID      USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME

dss.bin 2358 dssdaemon  727w   REG   8,81  7319697 99876877 /mnt/dss/dss6/blockstore/ns_v2/0/0/_tlog/14.tlog

 

  1. Capture the storagedaemon guid:

root@storage:~# ps aux |grep 2358

1002      2358  0.4 12.6 3388728 1027872 ?     Ssl  Feb10 246:55 /opt/qbase3/bin/dss -d --storagedaemon /opt/qbase3/cfg/dss/storagedaemons/43f51b31-0c77-4f24-b17d-bd42dbc720b3.cfg
 

 

  1. Check the status of the storagedaemon:

root@storage:~# qshell -c "print q.dss.storagedaemons.getStatus()"

{'0fd0057a-88ce-45d8-905f-d7e941d11a58': running, '43f51b31-0c77-4f24-b17d-bd42dbc720b3': running}

 

  1. Restart the storagedaemon:

root@storage:~# qshell -c "q.dss.storagedaemons.restartOne('43f51b31-0c77-4f24-b17d-bd42dbc720b3')"

 

  1. Confirm that there are no more open files on the device:

root@storage:~# lsof /mnt/dss/dss6

 

  1. Unmount the disk device:

root@storage:~# umount /dev/sdc2  

Problem #2: Message: Kernel dmesg Errors Detected

See steps above for a disk that did not get unmounted.
 

  1. Check if the disk device is displayed by lsscsi and delete if necessary:

root@storage:~# lsscsi
 

  1. If the disk (sdc) is still visible, you will need to run this command:

root@storage:~# echo 1 > /sys/block/sdc/device/delete

Problem #3: Message: Blockstore '/mnt/dss/dss8/blockstore' has status DECOMMISSIONED For More than 10 Days
  1. Use qshell on the management controller node to manually abandon a decommissioned block store:

In [1]: [ x for x in q.dss.manage.showLocationHierarchy().split('\n') if 'DECOMMISSIONED' in x ]

Out[1]: ['      |  |  +- node 74: 10.15.22.81:23520, bs 453: DECOMMISSIONED, online']
 

In [2]: q.dss.manage.showBlockStore(453)
Out[2]: {'custom': {'decommission_date': 1415912152.9695449},
'id': 453,
'node_id': 74,
'path': '/mnt/dss/dss9/blockstore',
'seq': 3,
'status': 'DECOMMISSIONED',
'version': 1}
 

In [3]: q.dss.manage.abandonBlockStore(453, force=True) 

 

 


 



This page was generated by the BrainKeeper Enterprise Wiki, © 2018