Buffer ECC Error Reported on 3ware RAID Controller (DRAFT) |
SR Information: 1624600
Product / Software Version and Configuration: DXi 2.2.1 on DXi6701, 3ware hardware Array Controller
Problem Description: "Buffer ECC error corrected" reported on Controller
Related PTR: 14677 |
A customer was concerned after receiving a number of Controller "Buffer ECC error corrected" RAS alerts.
From the RAS alerts (see srvcLog.hist below), we determined the affected controller as CONTROLLER_C3.
1) Searching DXi logs (srvcLog.hist and messages file) using grep show the following:
a) $ grep "Buffer ECC" /usr/adic/SRVCLOG/logs/srvcLog.hist (on live system) or node1-collection/app-info/srvcLog.hist (from DXi collect logs)
//srvcLog.hist:
Oct 8 03:54:58 2013 1 UNKNOWN 77 C3ALARM0 CONTROLLER_C3 77 124 Buffer ECC error corrected: address=0xD034820
Oct 8 03:54:58 2013 1 UNKNOWN 77 C3ALARM1 CONTROLLER_C3 77 124 Buffer ECC error corrected: address=0xD034820
Oct 12 10:12:58 2013 1 UNKNOWN 77 C3ALARM0 CONTROLLER_C3 77 124 Buffer ECC error corrected: address=0xD034820
Oct 14 12:14:58 2013 1 UNKNOWN 77 C3ALARM0 CONTROLLER_C3 77 124 Buffer ECC error corrected: address=0xD034800
Oct 14 12:14:58 2013 1 UNKNOWN 77 C3ALARM1 CONTROLLER_C3 77 124 Buffer ECC error corrected: address=0xD034820
b) $ grep "Buffer ECC" /var/log/messages (on live system) or node1-collection/os-info/messages (from DXi collect logs)
//messages:
Oct 8 03:47:27 B1TSGPI6K01 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0039): Buffer ECC error corrected:address=0xD034820.
Oct 8 03:47:27 B1TSGPI6K01 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0039): Buffer ECC error corrected:address=0xD034820.
Oct 8 03:54:58 B1TSGPI6K01 hwmond: E0000(1)<00077>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 77 VINST: C3ALARM0 VPINST: CONTROLLER_C3 EVENT: 124 TEXT: Buffer ECC error corrected: address=0xD034820
Oct 8 03:54:58 B1TSGPI6K01 hwmond: E0000(1)<00077>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 77 VINST: C3ALARM1 VPINST: CONTROLLER_C3 EVENT: 124 TEXT: Buffer ECC error corrected: address=0xD034820
Oct 12 10:05:44 B1TSGPI6K01 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0039): Buffer ECC error corrected:address=0xD034820.
Oct 12 10:12:58 B1TSGPI6K01 hwmond: E0000(1)<00077>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 77 VINST: C3ALARM0 VPINST: CONTROLLER_C3 EVENT: 124 TEXT: Buffer ECC error corrected: address=0xD034820
Oct 14 12:05:42 B1TSGPI6K01 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0039): Buffer ECC error corrected:address=0xD034800.
Oct 14 12:05:43 B1TSGPI6K01 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0039): Buffer ECC error corrected:address=0xD034820.
Oct 14 12:14:58 B1TSGPI6K01 hwmond: E0000(1)<00077>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 77 VINST: C3ALARM0 VPINST: CONTROLLER_C3 EVENT: 124 TEXT: Buffer ECC error corrected: address=0xD034800
Oct 14 12:14:58 B1TSGPI6K01 hwmond: E0000(1)<00077>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 77 VINST: C3ALARM1 VPINST: CONTROLLER_C3 EVENT: 124 TEXT: Buffer ECC error corrected: address=0xD034820
2) Also examining the 3Ware logs and searching for "Buffer ECC" should confirm the specific controller having the problem. In this case, Controller 3 (C3) is the at fault Controller.
a) $ grep "Buffer ECC error corrected" node1-3wcollection/3wdiags*
node1-3wcollection/3wdiags_c3.log:c3 [Tue Oct 08 2013 03:47:27] WARNING Buffer ECC error corrected: address=0xD034820
node1-3wcollection/3wdiags_c3.log:c3 [Tue Oct 08 2013 03:47:27] WARNING Buffer ECC error corrected: address=0xD034820
node1-3wcollection/3wdiags_c3.log:c3 [Sat Oct 12 2013 10:05:44] WARNING Buffer ECC error corrected: address=0xD034820
node1-3wcollection/3wdiags_c3.log:c3 [Mon Oct 14 2013 12:05:42] WARNING Buffer ECC error corrected: address=0xD034800
node1-3wcollection/3wdiags_c3.log:c3 [Mon Oct 14 2013 12:05:42] WARNING Buffer ECC error corrected: address=0xD034820
node1-3wcollection/3wdiags_c3.log:Buffer ECC error corrected
node1-3wcollection/3wdiags_c3.log:Buffer ECC error corrected
b) You can then directly examine the Controller 3 log (3wdiags_c3.log). The log is located at /var/log/DXi/3wdiags_c3.log (on a live system) or ../node1-3wcollection/3wdiags_c3.log (from a 3ware collect).
From the log we see the same error messages.
//3wdiags_c3.log:
...
/opt/DXi/3ware/tw_cli /c3/bbu show all
Ctl Date Severity AEN Message
------------------------------------------------------------------------------
c3 [Mon Oct 07 2013 13:24:30] INFO Battery capacity test is overdue
c3 [Tue Oct 08 2013 03:47:27] WARNING Buffer ECC error corrected: address=0xD034820
c3 [Tue Oct 08 2013 03:47:27] WARNING Buffer ECC error corrected: address=0xD034820
c3 [Thu Oct 10 2013 00:05:23] INFO Verify started: unit=0
c3 [Thu Oct 10 2013 00:05:24] INFO Verify started: unit=1
c3 [Thu Oct 10 2013 01:18:42] INFO Verify completed: unit=0
c3 [Fri Oct 11 2013 00:02:08] INFO Verify started: unit=1
c3 [Fri Oct 11 2013 02:51:51] INFO Verify completed: unit=1
c3 [Sat Oct 12 2013 10:05:44] WARNING Buffer ECC error corrected: address=0xD034820
c3 [Mon Oct 14 2013 12:05:42] WARNING Buffer ECC error corrected: address=0xD034800
c3 [Mon Oct 14 2013 12:05:42] WARNING Buffer ECC error corrected: address=0xD034820
c3 [Mon Oct 14 2013 13:13:24] INFO Battery charging started
c3 [Mon Oct 14 2013 13:13:26] INFO Battery charging completed
c3 [Mon Oct 14 2013 13:21:55] INFO Battery capacity test is overdue
Diagnostic Information on Controller //B1TSGPI6K01/c3 ...
---------------------------------------------------------
…
E=0216 T=12:05:33 : Corrected SBUF ECC
E=0216 T=12:05:33 : Buffer address 0x0D034800
Send AEN (code, time): 0039h, 10/14/2013 12:05:33
Buffer ECC error corrected
(EC:0x39, SK=0x01, ASC=0x40, ASCQ=0x00, SEV=02, Type=0x71)
address=0xD034800
DQS FC=110004
DQS LE=1201004
E=0216 T=12:05:33 : Recovery complete without retries
E=0216 T=12:05:33 : Corrected SBUF ECC
E=0216 T=12:05:33 : Buffer address 0x0D034820
Send AEN (code, time): 0039h, 10/14/2013 12:05:33
Buffer ECC error corrected
(EC:0x39, SK=0x01, ASC=0x40, ASCQ=0x00, SEV=02, Type=0x71)
address=0xD034820
DQS FC=110004
DQS LE=1201004
E=0216 T=12:05:33 : Recovery complete without retries
This is a serious enough issue that the controller card should be replaced.
Refer to comment 3 of PTR 14677 and this LSI (3Ware) KB article https://www.3ware.com/3warekb/article.aspx?id=15424
A support ticket should probably also be opened with LSI (3Ware) and/or the card be sent for failure analysis.
This page was generated by the BrainKeeper Enterprise Wiki, © 2018 |