Tracking Down Disk Errors in the Collect Log (DRAFT)

This topic describes how to track down disk errors in the collect log. Important fields are shown in RED text.

Go to /os-info/messages. This shows the disk errors reported. Grab the device name. (TS: I edited this step. Please verify.)

-bash-3.2$ grep sense messages

Jan 26 15:55:50 SM74D001 kernel: sdw: Current: sense key: Recovered Error

Jan 29 14:28:41 SM74D001 kernel: sdw: Current: sense key: Recovered Error

Jan 30 16:50:40 SM74D001 kernel: sdax: Current: sense key: Recovered Error

Jan 30 17:22:18 SM74D001 kernel: sdn: Current: sense key: Recovered Error

Feb 1 18:15:13 SM74D001 kernel: sdah: Current: sense key: Recovered Error

Feb 9 06:39:31 SM74D001 kernel: sdn: Current: sense key: Recovered Error

Feb 9 18:53:44 SM74D001 kernel: sdak: Current: sense key: Recovered Error

Feb 9 21:22:02 SM74D001 kernel: sdm: Current: sense key: Recovered Error

Feb 11 16:20:31 SM74D001 kernel: sdaq: Current: sense key: Recovered Error

Check in /snfs-info/nssdgb.out. Go to the bottom of the file and search for last instance of sdaq.

sdaq:

[1024 12:29:08] 0x2b93ff0b7b00 NOTICE PortMapper: CVFS Volume Cvfs_200600A0B850D8B2_4_19 on device: /dev/sdaq (blk 0x42a0 raw 0x42a0) con: 14 lun: 4 state: 0x204 inquiry [LSI VirtualDisk 0760] controller # '200400A0B850D912' serial # '600A0B800050D8B200000A3E4A8C0121' Size: 11500910559 Sector Size: 512

Search the /hw-info/array* files for d9:12 from the above controller number to find the array in question.

-bash-3.2$ grep d9:12 array*

array4a: World-wide node identifier: 20:04:00:a0:b8:50:d9:12

array4a: MAC address: 00:a0:b8:50:d9:12

From the nssdgb.out output it reports this as lun: 4 . Search the array4a configuration file in hw-info for the mappings to find the volume at lun 4.

LUN 4:

Volume Name LUN Controller Accessible by Volume status

Access Volume 7 A,B Storage Array Optimal

TRAY_0_VOL_1 3 B Storage Array Optimal

TRAY_0_VOL_2 4 B Storage Array Optimal

TRAY_0_VOL_3 5 B Storage Array Optimal

TRAY_1_VOL_1 6 A Storage Array Optimal

TRAY_1_VOL_2 8 A Storage Array Optimal

TRAY_1_VOL_3 9 A Storage Array Optimal

TRAY_2_VOL_1 10 B Storage Array Optimal

TRAY_2_VOL_2 11 B Storage Array Optimal

TRAY_2_VOL_3 12 B Storage Array Optimal

TRAY_85_VOL_1 0 A Storage Array Optimal

TRAY_85_VOL_2 1 A Storage Array Optimal

TRAY_85_VOL_3 2 A Storage Array Optimal

The array4a configuration file in hw-info also shows the drives associated with Tray_0_VOL_2

Tray_0_Vol_2:

Associated volumes and free capacity

Volume Capacity

TRAY_0_VOL_1 102.000 GB

TRAY_0_VOL_2 5.356 TB

Associated drives - present (in piece order)

Tray Slot

0 2

0 3

0 4

0 5

0 6

0 7

0 8

0 9

Search the array-4.log file (majorEventLog.txt in the LSI collect) for errors during this time for the subset of drives in the volume.

Date/Time: 2/12/12 5:26:17 AM

Sequence number: 5769

Event type: 100A

Event category: Error

Priority: Informational

Description: Drive returned CHECK CONDITION

Event specific codes: b/88/3

Component type: Drive

Component location: Tray 0, Slot 9 <-----------

Logged by: Controller in slot B