Hardware: 3ware log Guide for analysis and identification of events

 

Overview

 

This article provides information that may help on the 3ware log analysis troubleshooting and diagnostic

 

 

Quick access to all controller events consolidated by type

 

3ware Controller logs are available under collect and 3ware collect. The difference is that 3ware collect has specific files resulting from filters that separate specific events such as AEN timeout, rebuild, disk failures, etc.

 

After you unzip the the 3ware log, an easy place to look at separated events and system info is:

 

/3wcollect-<Dxi_SN#>-YYYY-MM-DD-hh-mm/node1-3wcollection/lsigetlunix/lsi.linux.*.*.*

 

 

Example:

 

$ cd /3wcollect-CX1331BVE00423-2014-07-29-09-22-48/node1-3wcollection/lsigetlunix/lsi.linux.950PDX01.072914.092254/3ware

 

$ ls -1

3ware_driver_messages.txt

3ware_driver_x04_9000_EVENT.txt

3ware_driver_x04_x0002_Degraded_unit.txt

3ware_driver_x04_x0005_Rebuild_completed.txt

3ware_driver_x04_x0007_Initialize_completed.txt

3ware_driver_x04_x0009_Drive_timeout_detected.txt

3ware_driver_x04_x000B_Rebuild_started.txt

3ware_driver_x04_x000C_Initialize_started.txt

3ware_driver_x04_x0019_Drive_removed.txt

3ware_driver_x04_x001A_Drive_inserted.txt

3ware_driver_x04_x0023_Sector_repair_completed.txt

3ware_driver_x04_x0026_Drive_ECC_error_reported.txt

3ware_driver_x04_x0029_Verify_started.txt

3ware_driver_x04_x002B_Verify_completed.txt

3ware_driver_x04_x0053_Battery_capacity_test_is_overdue.txt

3ware_driver_x04_x0055_Battery_charging_started.txt

3ware_driver_x04_x0056_Battery_charging_completed.txt

3w-sas.ko

Controller_C0.txt

Controller_C1.txt

SMARTCTL

tw_mgmt.log

Versions_3w.txt

 

 

This makes analysis easy when you need to quickly check for events that you suspect occurred during an issue. For example, a customer reported performance issues for a while and you have several messages files to filter and 3 controllers to check.

 

In this case, you can verify if any drive timeout event was detected under one of the controllers looking at the file 3ware_driver_x04_x0019_Drive_removed.txt

 

$ cat 3ware_driver_x04_x0009_Drive_timeout_detected.txt

Nov 13 20:06:48 950PDX01 kernel: 3w-sas: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:encl=0, slot=4.

Nov 29 23:11:09 950PDX01 kernel: 3w-sas: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:encl=0, slot=4.

Jan 30 13:45:11 950PDX01 kernel: 3w-sas: scsi1: AEN: ERROR (0x04:0x0009): Drive timeout detected:encl=0, slot=7.

 

 

Or, in cases where corruption was detected, so you may want to check if repair or ecc errors were reported:

 

$ cat 3ware_driver_x04_x0026_Drive_ECC_error_reported.txt

Jul 10 07:56:07 950PDX01 kernel: 3w-sas: scsi1: AEN: ERROR (0x04:0x0026): Drive ECC error reported:encl=0, slot=9, unit=1.

$ cat 3ware_driver_x04_x0023_Sector_repair_completed.txt

Jul 10 07:56:07 950PDX01 kernel: 3w-sas: scsi1: AEN: WARNING (0x04:0x0023): Sector repair completed:encl=0, slot=9, LBA=0x6874BA03.

 

The files Controller_C#.txt will have the output of most of the tw_cli commands (all, diag, aens, events, etc.) generated at the time the 3ware log bundle was collected.

 

Note that diag and event logs will be short because they are saved on the Controller which has limited storage. So, if you need to look for past events, you may not find them under those files. In this case the minicollect becomes handy.

 

 

Minicollects

 

Due to the controller's limited storage to keep the log history, a 3ware script is executed on the DXi where controller logs are generated and stored under the DXi local disk.

 

The minicollect directory is included under the 3ware log bungle as well as the DXi collect log:

 

/var/log/DXi/minicollect

 

/3wcollect-CX1331BVE00423-2014-07-29-09-22-48/node1-3wcollection

 

/scratch/collect/node1-collection/hw-info/minicollect

 

 

How minicollects are generated

 

You'll also notice a variable number of logs per controller. This is because controller logs are not always gathered. See the HW Engineering clarification about minicollect:

 

"minicollects are run by 3ware_getstatus.pl, the script that polls the 3ware arrays (you can run it by hand, /opt/DXi/scripts/3ware_getstatus.pl.) It runs a minicollect at the end if an ALARM is generated or if the status of any component (drives, volumes, BBUs, controllers etc.) isn't OK"

 

Based on the information above you may want to more closely analyze the logs from the controller that has generated a large number of logs.

 

Remember that sometimes a disk that shows failed state will keep triggering the collection of minicollect for that controller until it is replaced and status back to normal. So sometimes a quick filtering using UNIX commands can help you to identify an issue.

 

AEN events

 

AEN are controller events reported to the operating system.

 

Because not all events registered under the 3ware event and diagnostic logs are reported as AEN, you need to make sure to analyze the diagnostic and event logs under the minicollect and the latest log collected under the log bundle.

 

Toward the end of this article you'll find information about events that are not reported as AEN but are critical and can help you take proactive actions to avoid issues and even data corruption or downtime.

 

AEN events are cleared depending on the maintenance executed on the controller, one of them being a controller replacement. After you add a new controller, all the AEN events from the old are lost unless the 3ware script saved a minicollect prior to the replacement. Because of that, always remember to gather a DXi and storage collect prior to any controller or disk maintenance.

 

 

SMART DATA

 

Other information you may find helpful but hard to analyze is the SMART data available under the 3ware and collect log bundle.

 

SMART data collects disk statistics that may help troubleshooting thehealth of the disk, The file name is:

 

RAIDstatus_3ware-smartctl-ia-dev_twa#-device_3ware_##

 

 

This is some of the information you can find under the SMART file:

 

Model Family:     Hitachi Ultrastar A7K2000

Device Model:     Hitachi HUA722010CLA330

Serial Number:    JPW9J0J82SW4JV

Firmware Version: JP4OA3EA

User Capacity:    1,000,204,886,016 bytes

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   8

ATA Standard is:  ATA-8-ACS revision 4

Local Time is:    Fri Nov  7 09:43:20 2014 EST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

. . .

The section you want to take a look is more to the end:

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000b   095   095   016    Pre-fail  Always       -       327688

  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       93

  3 Spin_Up_Time            0x0007   118   118   024    Pre-fail  Always       -       320 (Average 320)

  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       20

  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0

  8 Seek_Time_Performance   0x0005   140   140   020    Pre-fail  Offline      -       30

  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       21442

 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       20

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       176

193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       176

194 Temperature_Celsius     0x0002   193   193   000    Old_age   Always       -       31 (Min/Max 20/34)

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

 

 

The type, format and threshold of the data will depend on each disk manufacturer.

 

For most of the disks, you will find basic smart data such as power, hours, temperature, CRC error count, etc. When it comes to SSD the fields are different.

 

In order to identify if a disk has an issue based on the SMART number, you need the disk manufacturer's specification for the disk. For example, on SSDs you have the wearout field which can be used to measure the life time of the disk:

 

233 Media_Wearout_Indicator 0x0032   098   098   000    Old_age   Always       -       0

 

Depending on the disk and manufacturer, the wearout can start with a high count number and decrease along time and usage up to a threshold the manufacturer determines where it's recommended  replace the ssd. Other manufacturers start the wearout field with the counter of 0 (zero) and the number increases until it reaches the threshold specified for replacement.

 

The disk manufacturer specification for SMART data are not available for Quantum analysis and you may only rarely find related manufacturer documentation.

 

For now this is the information we have provided by HW Engineering:

 

"The official limit for sector reallocations is drive specific and on the order of 1,000. It is defined in Quantum.conf"

 

http://en.wikipedia.org/wiki/S.M.A.R.T. 

 

 

So the conclusion about SMART is: You may try to find information, but it's better involve HW Engineering who can reach out the manufacturer of the disk for analysis of the data.

 

 

Logs and Events

 

In this section are some of the events that you may need to give special attention because they can help you to be proactive and avoid disastrous upgrades, capacity expansion or data corruption.

 

 

UPS enable/disable

 

Where it's reported: diag and event logs (no AEN or alerts are issued)

 

This is something I haven't given much attention prior to starting to work with a large amount of 3ware cases and log analysis.

 

UPS are BBU commands that will disable or enable the power failure protection. Once, I got a case where UPS was bouncing between enable/disable every second in a production machine. I started to pay attention on the other cases and confirmed some controllers would not present this behavior, some would present it in a moderate amount. You may find different behavior in different machines or even different controllers under the same machine.

 

For the cases where you find UPS bouncing enable/disable every second in an up and running machine, you will want to troubleshoot by starting reseating the BBU connection. If the issue persists, replace the BBU, and worst scenario replace the controller.

 

Engineering was consulted about the UPS events and they provided clarification:

 

"Those UPS events are benign.

Response from Avago technologies:
To minimize the exposure to unclean shutdowns the controller has an activity called Autoclean that will flush all uncommitted data to disk, mark the array as clean and cause the BBU to be disabled if there have been no writes for 40 seconds.  You can confirm this by looking at the logs while writes are occurring and then letting the controller sit idle for a minute – you should see “Disable UPS”.

 

 from <http://bugzilla.quantum.com/bugzilla/show_bug.cgi?id=40925> "

 

 

What you need to observe

 

Unfortunately, the 3ware log won't provide a timestamp for the UPS events making this more difficult to troubleshoot.

In this case you will have to rely on the other events which were reported by the controller before and after the UPS disable/enable messages in order to determine when the UPS was disabled and enabled. The log bellow is a good example where the events were occurring in during high i/o activity:

 

DcbMgr::UpdateStatus:         (finish 16:41:14)

UPS is disabled.

UPS is enabled.

DcbMgr::UpdateStatus: UNIT  2 (time   16:41:19)

DcbMgr::UpdateStatus:         (finish 16:41:19)

DcbMgr::UpdateStatus: UNIT  3 (time   16:41:19)

DcbMgr::UpdateStatus:         (finish 16:41:19)

Auto Clean: 2

DcbMgr::UpdateStatus: UNIT  2 (time   16:41:54)

DcbMgr::UpdateStatus:         (finish 16:41:54)

Auto Clean: 3

DcbMgr::UpdateStatus: UNIT  3 (time   16:42:09)

DcbMgr::UpdateStatus:         (finish 16:42:09)

UPS is disabled.

UPS is enabled.

DcbMgr::UpdateStatus: UNIT  3 (time   16:42:34)

DcbMgr::UpdateStatus:         (finish 16:42:34)

Auto Clean: 3

DcbMgr::UpdateStatus: UNIT  3 (time   16:43:09)

DcbMgr::UpdateStatus:         (finish 16:43:09)

UPS is disabled.

UPS is enabled.

DcbMgr::UpdateStatus: UNIT  3 (time   16:43:34)

DcbMgr::UpdateStatus:         (finish 16:43:34)

Auto Clean: 3

DcbMgr::UpdateStatus: UNIT  3 (time   16:44:09)

DcbMgr::UpdateStatus:         (finish 16:44:09)

UPS is disabled.

[K:7d]UPS is enabled.

DcbMgr::UpdateStatus: UNIT  3 (time   16:44:34)

DcbMgr::UpdateStatus:         (finish 16:44:34)

 

Troubleshooting get more difficult for cases where the UPS events occur in a moderate or low amount of time, especially because you can't tail the 3ware diagnostic log.

In this case you can try to execute a script to collect the diag log and iostat (with timestamp) periodically for the time the system is running and doing I/.

 

 

UPS not ready

 

Where it's reported: diag and event logs (no AEN or alerts are issued)

 

This is the worst you can get when it comes to UPS. If a controller reports "UPS not ready" when the system is up, this means "no power failure protection."

 

In this case troubleshooting is the same mentioned before: reseat the BBU connection If the issue persists, replace the BBU, and worst scenario replace the controller.

 

 

Flash issues

 

Where it's reported: diagnostic and event logs, and depending on the type of flash event, it may be reported as AEN. There will be no alerts issued.

 

When the controller flash is lost, the controller will try to recover the flash filesytem from its backup. When the recover is successful, an AEN message is issued and posted to the OS:

 

Messages log:

Nov  4 00:54:37 griffon6 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0040): Flash file system repaired:.

 

3ware AEN:

c0   [Tue Nov 04 2014 01:54:37]  INFO      Flash file system repaired

 

Diag log:

Repair completed successfully

Send AEN (code, time): 0040h, 02/18/2014 12:44:14

Flash file system repaired

 

Although the message "flash file system repaired"  may sound like good news, you need to check the history of this controller under minicollect in order to verify if more flash events occurred. In this example we can determine controller had several repairs, so in this case replacement of the controller needs to be done immediately:

 

c0_20140210-093135_0d3227ff210758b5e6da773256c28175.log.txt:c0   [Thu Sep 19 2013 16:40:44]  INFO      Flash file system repaired

c0_20140210-093135_0d3227ff210758b5e6da773256c28175.log.txt:c0   [Mon Nov 04 2013 23:46:52]  INFO      Flash file system repaired

c0_20140218-124806_356b8fb990242452762916c3c08270ea.log.txt:c0   [Tue Feb 18 2014 10:55:58]  INFO      Flash file system repaired

c0_20140218-124806_356b8fb990242452762916c3c08270ea.log.txt:c0   [Tue Feb 18 2014 12:44:16]  INFO      Flash file system repaired

c0_20140610-140333_0255c8d8d67395cf01b5988a69bfddd3.log.txt:c0   [Mon Apr 21 2014 10:02:36]  INFO      Flash file system repaired

c0_20140630-094712_96a50b2a3f16731c6f78544d9da74c23.log.txt:c0   [Thu Jun 26 2014 01:24:16]  INFO      Flash file system repaired

c0_20141104-005854_148f9ac9414a6e354b383a2e899bd8ab.log.txt:c0   [Tue Nov 04 2014 01:54:37]  INFO      Flash file system repaired

 

 

Supposing you had one or two AEN events of "flash repaired" you may think the case isn't that bad until you confirm that for several times the controller reported under diagnostic and event log that the repair was unsuccessful. The events of unsuccessful repair are not registered as AEN.

 

This one is a condition condition where there were several unsuccessful attempts (yellow) and finally the controller was able to recover the flash:

 

Repairing the flash file system..Flash Log: Header

   Version: 0x 0

   Status: Valid

No  BackupSector BackedUpSector

 1  25           30

!!!FlashWrite Error!!! retry# 0

!!!FlashWrite Error!!! retry# 1

!!!FlashWrite Error!!! retry# 2

!!!FlashWrite Error!!! retry# 3

 Done

Repair completed successfully

Send AEN (code, time): 0040h, 02/18/2014 12:44:14

Flash file system repaired

(EC:0x40, SK=0x00, ASC=0x00, ASCQ=0x00, SEV=03, Type=0x71)

 

 

This is another example where the controller keeps trying repeatedly to do a repair, but due to its backup directory being corrupt, the repair is unsuccessful. Note how many times the repair failed until on the next log flash finally was repaired:

 

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..

c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted

c0_20140110-062730_b332fd9c189715174ed0587b339a0821.log.txt:c0   [Thu Sep 19 2013 16:40:44]  INFO      Flash file system repaired

c0_20140110-062730_b332fd9c189715174ed0587b339a0821.log.txt:c0   [Mon Nov 04 2013 23:46:52]  INFO      Flash file system repaired

 

 

Cases like the above require proactive controller replacement. But before you execute the shutdown of the machine for the replacement, you need to make sure to reset the controller following the TSB 4496 https://qsweb.quantum.com/wiki/tiki-read_article.php?articleId=4496  

 

The TSB procedure will help avoid issues during the boot process. Note that this is not a fix. If the controller is unstable due to flash corruption, the controller must be replaced.

 

Resetting a controller will load a fresh flash, but this doesn't guarantee that the issue won't re-occur.

 

Engineering is working to have the hardware monitor check for Flash isues and possibly issue an alter. If you encounter a case, please make sure to document your SR under the following bug:

 

Bug 39763 - 3ware HW monitor should check for Flash problems

 

 

 

 

 

 

 



This page was generated by the BrainKeeper Enterprise Wiki, © 2018