Handling disk failures (DRAFT)

Normally within a Lattus environment, when a disk on a controller or storage node is showing errors, it will be marked as degraded in the CMC.

From there you can either run a S.M.A.R.T. test to check for errors, reset the disk, or decommission it.

In certain events it is possible that there are impending errors visible on a disk, which are not reported by the CMC.

In such a case the disk would have to be manually decommissioned to be replaced by a new working disk.

Here are the steps to mark the disk as degraded and decommission it using qshell:

api = i.config.cloudApiConnection.find('main')
mguid = api.machine.find(name='<name_of_node_with_bad_disk>')['result'][0]
dguid = api.disk.list(machineguid=mguid, serial_number='<serial_number_of_bad_disk>')['result'][0]['guid']
api.disk.updateModelProperties(dguid, status='DEGRADED')
api.machine.decommission_disks(mguid, [dguid])

You can get the serial number either via

hdparm -I /dev/...

api = i.config.cloudApiConnection.find('main')
mguid = api.machine.find(name='<name_of_node_with_bad_disk>')['result'][0]
api.disk.list(machineguid=mguid, name='<name+of+bad_disk>')

Now you can replace the disk.