How to Handle Kernel dmesg Events

Overview

This topic describes how to address some of the kernel dmesg events that Technical Support sees in Lattus systems. A kernel dmesg event is a generic event that has numerous causes and varied required actions. Below are of some root causes and recommended actions.

Event: Error count since last fsck - Run fsck on the identified filesystem

Details:

EXT4-fs (sdc2): error count since last fsck: 1
EXT4-fs (sdc2): initial error at time 1492861768: __ext4_get_inode_loc:3955: inode 153899646: block 615514803
EXT4-fs (sdc2): last error at time 1492861768: __ext4_get_inode_loc:3955: inode 153899646: block 615514803

Actions:

Identify any files open in the filesystem.

# lsof /dev/sdc2 | grep dss.bin | awk '{print $2}'
2166

Identify the storage daemon associated with the open file.

# ps -ef | grep 2166 | grep -v grep | awk -F "/" '{print $11}')| awk -F "." '{print $1}'
a80279ba-c969-483a-981f-704baa76f644

Restart the storage daemon.

# qshell -c "q.dss.storagedaemons.restartOne('a80279ba-c969-483a-981f-704baa76f644')"

Verify mounted filesystem.

#df -h | grep sdc2
/dev/sdc2 2.7T 2.1T 574G 79% /mnt/dss/dss3

Unmount the filesystem.

# umount /dev/sdc2

Run fsck on the filesystem to repair

# fsck -Ma /dev/sdc2
fsck from util-linux 2.20.1
dss3 contains a file system with errors, check forced.
dss3: 480825/179871744 files (82.6% non-contiguous), 561977731/719458929 blocks

Run fsck again to verify status of the filesystem

# fsck -Ma /dev/sdc2;
fsck from util-linux 2.20.1
dss3: clean, 480825/179871744 files, 561977731/719458929 blocks

Remount the filesystem

# mount /dev/sdc2

Verify mounted filesystem

# df -h | grep sdc
/dev/sdc2 2.7T 2.1T 574G 79% /mnt/dss/dss3

Note: Use the command with great CAUTION! There are changes in multiple places that need to be made. This can be done with multiple disks (or single) by modifying the following line.

# for n in a b; do (for x in $(for i in $(lsof /dev/sd${n}1 | grep dss.bin | awk '{print $2}'); do (ps -ef | grep $i | grep -v grep | awk -F "/" '{print $11}')| awk -F "." '{print $1}'; done); do (qshell -c "q.dss.storagedaemons.restartOne('$x')"); done; df -h | grep sd${n}; umount /dev/sd${n}1; fsck -Ma /dev/sd${n}1; fsck -Ma /dev/sd${n}1; mount /dev/sd${n}1; df -h | grep sd${n}); done

Event: ext4_journal_check_start - Run fsck on the identified filesystem

Details:

EXT4-fs (sdg1): error count: 2
EXT4-fs (sdg1): initial error at 1447102452: ext4_journal_check_start:56
EXT4-fs (sdg1): last error at 1447102452: ext4_journal_check_start:56

Actions:

Complete the Error count since last fsck - Run fsck on the identified filesystem procedure.

Event: Unrecovered read error - No action required

Details:

res 41/40:00:98:9b:8b/00:00:96:01:00/40 Emask 0x409 (media error) <F>
ata1.00: error: { UNC }
Sense Key : Medium Error [current] [descriptor]
Add. Sense: Unrecovered read error - auto reallocate failed
res 41/40:00:08:9c:8b/00:00:96:01:00/40 Emask 0x409 (media error) <F>
ata1.00: error: { UNC }
Sense Key : Medium Error [current] [descriptor]
Add. Sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sda, sector 6820699144

Actions:

No action required.

Event: Call Trace - No action required

Details:

Call Trace:
Call Trace:
Call Trace:

Actions:

No action required.

Event: Interface fatal error - Manually set disk to degraded and decommission

Details:

ata9.00: irq_stat 0x08000008, interface fatal error
ata9: SError: { 10B8B Dispar BadCRC }
res 40/00:d4:e0:21:5b/00:00:c8:00:00/40 Emask 0x10 (ATA bus error)
res 40/00:d4:e0:21:5b/00:00:c8:00:00/40 Emask 0x10 (ATA bus error)
machinename:storage111

Actions:

Log onto the affected storage node.
Determine the disk from the ATA bus number.

# ls -l /sys/block/ | grep ata9
lrwxrwxrwx 1 root root 0 Apr 19 13:13 sdh -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/ata9/host8/target8:0:0/8:0:0:0/block/sdh

Return the management controller node.
Enter qshell.

# qshell

Connect to the Api database.

In [1]: ca=i.config.cloudApiConnection.find('main

Set the variable for the machine guid.

In [2]: mguid=ca.machine.find('storage111')['result'][0]

Set the variable for the disk guid.

In [3]: dguid=ca.disk.find(machineguid=mguid, name='sdh')['result'][0]

Update the model to set disk status to degraded.

In [4]: ca.disk.updateModelProperties(dguid, status=str(q.enumerators.diskstatustype.DEGRADED))
Out[4]: {'jobguid': None, 'result': '5bf2162e-a44c-4fa0-b534-06de92d29a25'}

Decommission the disk using the CMC.

Event: Hardware error from APEI Generic Hardware Error

Details:

{8}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1

{8}[Hardware Error]: It has been corrected by h/w and requires no further action

{8}[Hardware Error]: event severity: corrected

{8}[Hardware Error]: Error 0, type: corrected

{8}[Hardware Error]: fru_text: CorrectedErr

{8}[Hardware Error]: section_type: memory error

[Firmware Warn]: error section length is too small

Action:

No action required

Event: Deleted inode referenced

Details:

EXT4-fs error (device sdc2): ext4_lookup:1430: inode #153755541: comm dss.bin: deleted inode referenced: 153755715

machinename:storage120

Actions:

The disk should be listed in the Degraded Disks area of CMC. Decommission the disk and replace per usual procedures.