Hardware: 3ware log Guide for analysis and identification of events |
Overview
This article provides information that may help on the 3ware log analysis troubleshooting and diagnostic
Quick access to all controller events consolidated by type
3ware Controller logs are available under collect and 3ware collect. The difference is that 3ware collect has specific files resulting from filters that separate specific events such as AEN timeout, rebuild, disk failures, etc.
After you unzip the the 3ware log, an easy place to look at separated events and system info is:
/3wcollect-<Dxi_SN#>-YYYY-MM-DD-hh-mm/node1-3wcollection/lsigetlunix/lsi.linux.*.*.*
Example:
$ cd /3wcollect-CX1331BVE00423-2014-07-29-09-22-48/node1-3wcollection/lsigetlunix/lsi.linux.950PDX01.072914.092254/3ware
$ ls -1
3ware_driver_messages.txt
3ware_driver_x04_9000_EVENT.txt
3ware_driver_x04_x0002_Degraded_unit.txt
3ware_driver_x04_x0005_Rebuild_completed.txt
3ware_driver_x04_x0007_Initialize_completed.txt
3ware_driver_x04_x0009_Drive_timeout_detected.txt
3ware_driver_x04_x000B_Rebuild_started.txt
3ware_driver_x04_x000C_Initialize_started.txt
3ware_driver_x04_x0019_Drive_removed.txt
3ware_driver_x04_x001A_Drive_inserted.txt
3ware_driver_x04_x0023_Sector_repair_completed.txt
3ware_driver_x04_x0026_Drive_ECC_error_reported.txt
3ware_driver_x04_x0029_Verify_started.txt
3ware_driver_x04_x002B_Verify_completed.txt
3ware_driver_x04_x0053_Battery_capacity_test_is_overdue.txt
3ware_driver_x04_x0055_Battery_charging_started.txt
3ware_driver_x04_x0056_Battery_charging_completed.txt
3w-sas.ko
Controller_C0.txt
Controller_C1.txt
SMARTCTL
tw_mgmt.log
Versions_3w.txt
This makes analysis easy when you need to quickly check for events that you suspect occurred during an issue. For example, a customer reported performance issues for a while and you have several messages files to filter and 3 controllers to check.
In this case, you can verify if any drive timeout event was detected under one of the controllers looking at the file 3ware_driver_x04_x0019_Drive_removed.txt
$ cat 3ware_driver_x04_x0009_Drive_timeout_detected.txt
Nov 13 20:06:48 950PDX01 kernel: 3w-sas: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:encl=0, slot=4.
Nov 29 23:11:09 950PDX01 kernel: 3w-sas: scsi0: AEN: ERROR (0x04:0x0009): Drive timeout detected:encl=0, slot=4.
Jan 30 13:45:11 950PDX01 kernel: 3w-sas: scsi1: AEN: ERROR (0x04:0x0009): Drive timeout detected:encl=0, slot=7.
Or, in cases where corruption was detected, so you may want to check if repair or ecc errors were reported:
$ cat 3ware_driver_x04_x0026_Drive_ECC_error_reported.txt
Jul 10 07:56:07 950PDX01 kernel: 3w-sas: scsi1: AEN: ERROR (0x04:0x0026): Drive ECC error reported:encl=0, slot=9, unit=1.
$ cat 3ware_driver_x04_x0023_Sector_repair_completed.txt
Jul 10 07:56:07 950PDX01 kernel: 3w-sas: scsi1: AEN: WARNING (0x04:0x0023): Sector repair completed:encl=0, slot=9, LBA=0x6874BA03.
The files Controller_C#.txt will have the output of most of the tw_cli commands (all, diag, aens, events, etc.) generated at the time the 3ware log bundle was collected.
Note that diag and event logs will be short because they are saved on the Controller which has limited storage. So, if you need to look for past events, you may not find them under those files. In this case the minicollect becomes handy.
Minicollects
Due to the controller's limited storage to keep the log history, a 3ware script is executed on the DXi where controller logs are generated and stored under the DXi local disk.
The minicollect directory is included under the 3ware log bungle as well as the DXi collect log:
/var/log/DXi/minicollect
/3wcollect-CX1331BVE00423-2014-07-29-09-22-48/node1-3wcollection
/scratch/collect/node1-collection/hw-info/minicollect
How minicollects are generated
You'll also notice a variable number of logs per controller. This is because controller logs are not always gathered. See the HW Engineering clarification about minicollect:
"minicollects are run by 3ware_getstatus.pl, the script that polls the 3ware arrays (you can run it by hand, /opt/DXi/scripts/3ware_getstatus.pl.) It runs a minicollect at the end if an ALARM is generated or if the status of any component (drives, volumes, BBUs, controllers etc.) isn't OK"
Based on the information above you may want to more closely analyze the logs from the controller that has generated a large number of logs.
Remember that sometimes a disk that shows failed state will keep triggering the collection of minicollect for that controller until it is replaced and status back to normal. So sometimes a quick filtering using UNIX commands can help you to identify an issue.
AEN events
AEN are controller events reported to the operating system.
Because not all events registered under the 3ware event and diagnostic logs are reported as AEN, you need to make sure to analyze the diagnostic and event logs under the minicollect and the latest log collected under the log bundle.
Toward the end of this article you'll find information about events that are not reported as AEN but are critical and can help you take proactive actions to avoid issues and even data corruption or downtime.
AEN events are cleared depending on the maintenance executed on the controller, one of them being a controller replacement. After you add a new controller, all the AEN events from the old are lost unless the 3ware script saved a minicollect prior to the replacement. Because of that, always remember to gather a DXi and storage collect prior to any controller or disk maintenance.
SMART DATA
Other information you may find helpful but hard to analyze is the SMART data available under the 3ware and collect log bundle.
SMART data collects disk statistics that may help troubleshooting thehealth of the disk, The file name is:
RAIDstatus_3ware-smartctl-ia-dev_twa#-device_3ware_##
This is some of the information you can find under the SMART file:
Model Family: Hitachi Ultrastar A7K2000
Device Model: Hitachi HUA722010CLA330
Serial Number: JPW9J0J82SW4JV
Firmware Version: JP4OA3EA
User Capacity: 1,000,204,886,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Fri Nov 7 09:43:20 2014 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
. . .
The section you want to take a look is more to the end:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 095 095 016 Pre-fail Always - 327688
2 Throughput_Performance 0x0005 136 136 054 Pre-fail Offline - 93
3 Spin_Up_Time 0x0007 118 118 024 Pre-fail Always - 320 (Average 320)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 20
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 140 140 020 Pre-fail Offline - 30
9 Power_On_Hours 0x0012 097 097 000 Old_age Always - 21442
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 20
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 176
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 176
194 Temperature_Celsius 0x0002 193 193 000 Old_age Always - 31 (Min/Max 20/34)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
The type, format and threshold of the data will depend on each disk manufacturer.
For most of the disks, you will find basic smart data such as power, hours, temperature, CRC error count, etc. When it comes to SSD the fields are different.
In order to identify if a disk has an issue based on the SMART number, you need the disk manufacturer's specification for the disk. For example, on SSDs you have the wearout field which can be used to measure the life time of the disk:
233 Media_Wearout_Indicator 0x0032 098 098 000 Old_age Always - 0
Depending on the disk and manufacturer, the wearout can start with a high count number and decrease along time and usage up to a threshold the manufacturer determines where it's recommended replace the ssd. Other manufacturers start the wearout field with the counter of 0 (zero) and the number increases until it reaches the threshold specified for replacement.
The disk manufacturer specification for SMART data are not available for Quantum analysis and you may only rarely find related manufacturer documentation.
For now this is the information we have provided by HW Engineering:
"The official limit for sector reallocations is drive specific and on the order of 1,000. It is defined in Quantum.conf"
http://en.wikipedia.org/wiki/S.M.A.R.T.
So the conclusion about SMART is: You may try to find information, but it's better involve HW Engineering who can reach out the manufacturer of the disk for analysis of the data.
Logs and Events
In this section are some of the events that you may need to give special attention because they can help you to be proactive and avoid disastrous upgrades, capacity expansion or data corruption.
UPS enable/disable
Where it's reported: diag and event logs (no AEN or alerts are issued)
This is something I haven't given much attention prior to starting to work with a large amount of 3ware cases and log analysis.
UPS are BBU commands that will disable or enable the power failure protection. Once, I got a case where UPS was bouncing between enable/disable every second in a production machine. I started to pay attention on the other cases and confirmed some controllers would not present this behavior, some would present it in a moderate amount. You may find different behavior in different machines or even different controllers under the same machine.
For the cases where you find UPS bouncing enable/disable every second in an up and running machine, you will want to troubleshoot by starting reseating the BBU connection. If the issue persists, replace the BBU, and worst scenario replace the controller.
Engineering was consulted about the UPS events and they provided clarification:
"Those UPS events are benign.
Response from Avago technologies:
To minimize the exposure to unclean shutdowns the controller has an activity called Autoclean that will flush all uncommitted data to disk, mark the array as clean and cause the BBU to be disabled if there have been no writes for 40 seconds. You can confirm this by looking at the logs while writes are occurring and then letting the controller sit idle for a minute – you should see “Disable UPS”.
from <http://bugzilla.quantum.com/bugzilla/show_bug.cgi?id=40925> "
What you need to observe
Unfortunately, the 3ware log won't provide a timestamp for the UPS events making this more difficult to troubleshoot.
In this case you will have to rely on the other events which were reported by the controller before and after the UPS disable/enable messages in order to determine when the UPS was disabled and enabled. The log bellow is a good example where the events were occurring in during high i/o activity:
DcbMgr::UpdateStatus: (finish 16:41:14)
UPS is disabled.
UPS is enabled.
DcbMgr::UpdateStatus: UNIT 2 (time 16:41:19)
DcbMgr::UpdateStatus: (finish 16:41:19)
DcbMgr::UpdateStatus: UNIT 3 (time 16:41:19)
DcbMgr::UpdateStatus: (finish 16:41:19)
Auto Clean: 2
DcbMgr::UpdateStatus: UNIT 2 (time 16:41:54)
DcbMgr::UpdateStatus: (finish 16:41:54)
Auto Clean: 3
DcbMgr::UpdateStatus: UNIT 3 (time 16:42:09)
DcbMgr::UpdateStatus: (finish 16:42:09)
UPS is disabled.
UPS is enabled.
DcbMgr::UpdateStatus: UNIT 3 (time 16:42:34)
DcbMgr::UpdateStatus: (finish 16:42:34)
Auto Clean: 3
DcbMgr::UpdateStatus: UNIT 3 (time 16:43:09)
DcbMgr::UpdateStatus: (finish 16:43:09)
UPS is disabled.
UPS is enabled.
DcbMgr::UpdateStatus: UNIT 3 (time 16:43:34)
DcbMgr::UpdateStatus: (finish 16:43:34)
Auto Clean: 3
DcbMgr::UpdateStatus: UNIT 3 (time 16:44:09)
DcbMgr::UpdateStatus: (finish 16:44:09)
UPS is disabled.
[K:7d]UPS is enabled.
DcbMgr::UpdateStatus: UNIT 3 (time 16:44:34)
DcbMgr::UpdateStatus: (finish 16:44:34)
Troubleshooting get more difficult for cases where the UPS events occur in a moderate or low amount of time, especially because you can't tail the 3ware diagnostic log.
In this case you can try to execute a script to collect the diag log and iostat (with timestamp) periodically for the time the system is running and doing I/.
UPS not ready
Where it's reported: diag and event logs (no AEN or alerts are issued)
This is the worst you can get when it comes to UPS. If a controller reports "UPS not ready" when the system is up, this means "no power failure protection."
In this case troubleshooting is the same mentioned before: reseat the BBU connection If the issue persists, replace the BBU, and worst scenario replace the controller.
Flash issues
Where it's reported: diagnostic and event logs, and depending on the type of flash event, it may be reported as AEN. There will be no alerts issued.
When the controller flash is lost, the controller will try to recover the flash filesytem from its backup. When the recover is successful, an AEN message is issued and posted to the OS:
Messages log:
Nov 4 00:54:37 griffon6 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0040): Flash file system repaired:.
3ware AEN:
c0 [Tue Nov 04 2014 01:54:37] INFO Flash file system repaired
Diag log:
Repair completed successfully
Send AEN (code, time): 0040h, 02/18/2014 12:44:14
Flash file system repaired
Although the message "flash file system repaired" may sound like good news, you need to check the history of this controller under minicollect in order to verify if more flash events occurred. In this example we can determine controller had several repairs, so in this case replacement of the controller needs to be done immediately:
c0_20140210-093135_0d3227ff210758b5e6da773256c28175.log.txt:c0 [Thu Sep 19 2013 16:40:44] INFO Flash file system repaired
c0_20140210-093135_0d3227ff210758b5e6da773256c28175.log.txt:c0 [Mon Nov 04 2013 23:46:52] INFO Flash file system repaired
c0_20140218-124806_356b8fb990242452762916c3c08270ea.log.txt:c0 [Tue Feb 18 2014 10:55:58] INFO Flash file system repaired
c0_20140218-124806_356b8fb990242452762916c3c08270ea.log.txt:c0 [Tue Feb 18 2014 12:44:16] INFO Flash file system repaired
c0_20140610-140333_0255c8d8d67395cf01b5988a69bfddd3.log.txt:c0 [Mon Apr 21 2014 10:02:36] INFO Flash file system repaired
c0_20140630-094712_96a50b2a3f16731c6f78544d9da74c23.log.txt:c0 [Thu Jun 26 2014 01:24:16] INFO Flash file system repaired
c0_20141104-005854_148f9ac9414a6e354b383a2e899bd8ab.log.txt:c0 [Tue Nov 04 2014 01:54:37] INFO Flash file system repaired
Supposing you had one or two AEN events of "flash repaired" you may think the case isn't that bad until you confirm that for several times the controller reported under diagnostic and event log that the repair was unsuccessful. The events of unsuccessful repair are not registered as AEN.
This one is a condition condition where there were several unsuccessful attempts (yellow) and finally the controller was able to recover the flash:
Repairing the flash file system..Flash Log: Header
Version: 0x 0
Status: Valid
No BackupSector BackedUpSector
1 25 30
!!!FlashWrite Error!!! retry# 0
!!!FlashWrite Error!!! retry# 1
!!!FlashWrite Error!!! retry# 2
!!!FlashWrite Error!!! retry# 3
Done
Repair completed successfully
Send AEN (code, time): 0040h, 02/18/2014 12:44:14
Flash file system repaired
(EC:0x40, SK=0x00, ASC=0x00, ASCQ=0x00, SEV=03, Type=0x71)
This is another example where the controller keeps trying repeatedly to do a repair, but due to its backup directory being corrupt, the repair is unsuccessful. Note how many times the repair failed until on the next log flash finally was repaired:
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Repairing the flash file system..
c0_20140110-062310_93e835794361d24471514e1a89841a51.log.txt:Bad directory, repair aborted
c0_20140110-062730_b332fd9c189715174ed0587b339a0821.log.txt:c0 [Thu Sep 19 2013 16:40:44] INFO Flash file system repaired
c0_20140110-062730_b332fd9c189715174ed0587b339a0821.log.txt:c0 [Mon Nov 04 2013 23:46:52] INFO Flash file system repaired
Cases like the above require proactive controller replacement. But before you execute the shutdown of the machine for the replacement, you need to make sure to reset the controller following the TSB 4496 https://qsweb.quantum.com/wiki/tiki-read_article.php?articleId=4496
The TSB procedure will help avoid issues during the boot process. Note that this is not a fix. If the controller is unstable due to flash corruption, the controller must be replaced.
Resetting a controller will load a fresh flash, but this doesn't guarantee that the issue won't re-occur.
Engineering is working to have the hardware monitor check for Flash isues and possibly issue an alter. If you encounter a case, please make sure to document your SR under the following bug:
Bug 39763 - 3ware HW monitor should check for Flash problems
|
|
|
|
This page was generated by the BrainKeeper Enterprise Wiki, © 2018 |