How to Handle Kernel dmesg Events

Overview

This topic describes how to address some of the kernel dmesg events that Technical Support sees in Lattus systems. A kernel dmesg event is a generic event that has numerous causes and varied required actions. Below are of some root causes and recommended actions.
 


Event: Error count since last fsck - Run fsck on the identified filesystem

Details:


EXT4-fs (sdc2): error count since last fsck: 1
EXT4-fs (sdc2): initial error at time 1492861768: __ext4_get_inode_loc:3955: inode 153899646: block 615514803
EXT4-fs (sdc2): last error at time 1492861768: __ext4_get_inode_loc:3955: inode 153899646: block 615514803

 

Actions:


  1. Identify any files open in the filesystem.

# lsof /dev/sdc2 | grep dss.bin | awk '{print $2}'
2166

 

  1. Identify the storage daemon associated with the open file.

# ps -ef | grep 2166  | grep -v grep | awk -F "/" '{print $11}')| awk -F "." '{print $1}'
a80279ba-c969-483a-981f-704baa76f644

 

  1. Restart the storage daemon.

# qshell -c "q.dss.storagedaemons.restartOne('a80279ba-c969-483a-981f-704baa76f644')"

 

  1. Verify mounted filesystem.

#df -h | grep sdc2
/dev/sdc2       2.7T  2.1T  574G  79% /mnt/dss/dss3

 

  1. Unmount the filesystem.

# umount /dev/sdc2

 

  1. Run fsck on the filesystem to repair

# fsck -Ma /dev/sdc2
fsck from util-linux 2.20.1
dss3 contains a file system with errors, check forced.
dss3: 480825/179871744 files (82.6% non-contiguous), 561977731/719458929 blocks

 

  1. Run fsck again to verify status of the filesystem

# fsck -Ma /dev/sdc2;
fsck from util-linux 2.20.1
dss3: clean, 480825/179871744 files, 561977731/719458929 blocks

 

  1. Remount the filesystem

# mount /dev/sdc2

 

  1. Verify mounted filesystem

# df -h | grep sdc
/dev/sdc2       2.7T  2.1T  574G  79% /mnt/dss/dss3

 

Note: Use the command with great CAUTION! There are changes in multiple places that need to be made. This can be done with multiple disks (or single) by modifying the following line. 

 

# for n in a b; do (for x in $(for i in $(lsof /dev/sd${n}1 | grep dss.bin | awk '{print $2}'); do (ps -ef | grep $i | grep -v grep | awk -F "/" '{print $11}')| awk -F "." '{print $1}'; done); do (qshell -c "q.dss.storagedaemons.restartOne('$x')"); done; df -h | grep sd${n}; umount /dev/sd${n}1; fsck -Ma /dev/sd${n}1; fsck -Ma /dev/sd${n}1; mount /dev/sd${n}1; df -h | grep sd${n}); done

 


Event: ext4_journal_check_start - Run fsck on the identified filesystem

Details:


EXT4-fs (sdg1): error count: 2
EXT4-fs (sdg1): initial error at 1447102452: ext4_journal_check_start:56
EXT4-fs (sdg1): last error at 1447102452: ext4_journal_check_start:56

 

Actions:


Complete the Error count since last fsck - Run fsck on the identified filesystem procedure.
 


Event: Unrecovered read error - No action required

Details:


res 41/40:00:98:9b:8b/00:00:96:01:00/40 Emask 0x409 (media error) <F>
 ata1.00: error: { UNC }
 Sense Key : Medium Error [current] [descriptor]
 Add. Sense: Unrecovered read error - auto reallocate failed
        res 41/40:00:08:9c:8b/00:00:96:01:00/40 Emask 0x409 (media error) <F>
 ata1.00: error: { UNC }
 Sense Key : Medium Error [current] [descriptor]
 Add. Sense: Unrecovered read error - auto reallocate failed
 end_request: I/O error, dev sda, sector 6820699144

 

Actions:


No action required.
 


Event: Call Trace - No action required

Details:


 Call Trace:
 Call Trace:
 Call Trace:

 

Actions:


No action required.
 


Event: Interface fatal error - Manually set disk to degraded and decommission

Details: 


 ata9.00: irq_stat 0x08000008, interface fatal error
 ata9: SError: { 10B8B Dispar BadCRC }
          res 40/00:d4:e0:21:5b/00:00:c8:00:00/40 Emask 0x10 (ATA bus error)
          res 40/00:d4:e0:21:5b/00:00:c8:00:00/40 Emask 0x10 (ATA bus error)
 machinename:storage111

 

 

Actions:

 

  1. Log onto the affected storage node.

     

  2. Determine the disk from the ATA bus number.

# ls -l /sys/block/ | grep ata9
lrwxrwxrwx 1 root root 0 Apr 19 13:13 sdh -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/ata9/host8/target8:0:0/8:0:0:0/block/sdh

 

  1. Return the management controller node.

     

  2. Enter qshell.

# qshell
 

 

  1. Connect to the Api database.

In [1]: ca=i.config.cloudApiConnection.find('main

 

  1. Set the variable for the machine guid.

In [2]: mguid=ca.machine.find('storage111')['result'][0]
 

 

  1. Set the variable for the disk guid.

In [3]: dguid=ca.disk.find(machineguid=mguid, name='sdh')['result'][0]

 

  1. Update the model to set disk status to degraded.

In [4]: ca.disk.updateModelProperties(dguid, status=str(q.enumerators.diskstatustype.DEGRADED))
Out[4]: {'jobguid': None, 'result': '5bf2162e-a44c-4fa0-b534-06de92d29a25'}

 

  1. Decommission the disk using the CMC.
     

Event: Hardware error from APEI Generic Hardware Error

Details:

 

{8}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1

{8}[Hardware Error]: It has been corrected by h/w and requires no further action

{8}[Hardware Error]: event severity: corrected

{8}[Hardware Error]:  Error 0, type: corrected

{8}[Hardware Error]:  fru_text: CorrectedErr

{8}[Hardware Error]:   section_type: memory error

[Firmware Warn]: error section length is too small

 

 

Action:

No action required



Event: Deleted inode referenced

Details: 

 

EXT4-fs error (device sdc2): ext4_lookup:1430: inode #153755541: comm dss.bin: deleted inode referenced: 153755715

machinename:storage120

 

Actions:

 

The disk should be listed in the Degraded Disks area of CMC. Decommission the disk and replace per usual procedures.


 


 

 



This page was generated by the BrainKeeper Enterprise Wiki, © 2018