Buffer ECC Error Reported on 3ware RAID Controller (DRAFT)

SR Information: 1624600

 

Product / Software Version and Configuration: DXi 2.2.1 on DXi6701, 3ware hardware Array Controller

 

Problem Description: "Buffer ECC error corrected" reported on Controller

 

Related PTR: 14677

Overview

A customer was concerned after receiving a number of Controller "Buffer ECC error corrected" RAS alerts.

 

 


Symptoms

From the RAS alerts (see srvcLog.hist below), we determined the affected controller as CONTROLLER_C3.

 

1) Searching DXi logs (srvcLog.hist and messages file) using grep show the following:

 

a) $ grep "Buffer ECC" /usr/adic/SRVCLOG/logs/srvcLog.hist (on live system) or node1-collection/app-info/srvcLog.hist (from DXi collect logs)
 

 

//srvcLog.hist:
Oct  8 03:54:58 2013 1 UNKNOWN 77 C3ALARM0 CONTROLLER_C3 77 124 Buffer ECC error corrected: address=0xD034820
Oct  8 03:54:58 2013 1 UNKNOWN 77 C3ALARM1 CONTROLLER_C3 77 124 Buffer ECC error corrected: address=0xD034820
Oct 12 10:12:58 2013 1 UNKNOWN 77 C3ALARM0 CONTROLLER_C3 77 124 Buffer ECC error corrected: address=0xD034820
Oct 14 12:14:58 2013 1 UNKNOWN 77 C3ALARM0 CONTROLLER_C3 77 124 Buffer ECC error corrected: address=0xD034800
Oct 14 12:14:58 2013 1 UNKNOWN 77 C3ALARM1 CONTROLLER_C3 77 124 Buffer ECC error corrected: address=0xD034820

 

 

b) $ grep "Buffer ECC" /var/log/messages (on live system) or node1-collection/os-info/messages (from DXi collect logs)
 

//messages:
 

Oct  8 03:47:27 B1TSGPI6K01 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0039): Buffer ECC error corrected:address=0xD034820.
Oct  8 03:47:27 B1TSGPI6K01 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0039): Buffer ECC error corrected:address=0xD034820.
Oct  8 03:54:58 B1TSGPI6K01 hwmond: E0000(1)<00077>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 77 VINST: C3ALARM0 VPINST: CONTROLLER_C3 EVENT: 124 TEXT: Buffer ECC error corrected: address=0xD034820
Oct  8 03:54:58 B1TSGPI6K01 hwmond: E0000(1)<00077>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 77 VINST: C3ALARM1 VPINST: CONTROLLER_C3 EVENT: 124 TEXT: Buffer ECC error corrected: address=0xD034820
Oct 12 10:05:44 B1TSGPI6K01 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0039): Buffer ECC error corrected:address=0xD034820.
Oct 12 10:12:58 B1TSGPI6K01 hwmond: E0000(1)<00077>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 77 VINST: C3ALARM0 VPINST: CONTROLLER_C3 EVENT: 124 TEXT: Buffer ECC error corrected: address=0xD034820
Oct 14 12:05:42 B1TSGPI6K01 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0039): Buffer ECC error corrected:address=0xD034800.
Oct 14 12:05:43 B1TSGPI6K01 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0039): Buffer ECC error corrected:address=0xD034820.
Oct 14 12:14:58 B1TSGPI6K01 hwmond: E0000(1)<00077>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 77 VINST: C3ALARM0 VPINST: CONTROLLER_C3 EVENT: 124 TEXT: Buffer ECC error corrected: address=0xD034800
Oct 14 12:14:58 B1TSGPI6K01 hwmond: E0000(1)<00077>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 77 VINST: C3ALARM1 VPINST: CONTROLLER_C3 EVENT: 124 TEXT: Buffer ECC error corrected: address=0xD034820

 

 

2) Also examining the 3Ware logs and searching for "Buffer ECC" should confirm the specific controller having the problem. In this case, Controller 3 (C3) is the at fault Controller.

 

a) $ grep "Buffer ECC error corrected" node1-3wcollection/3wdiags*
 

node1-3wcollection/3wdiags_c3.log:c3   [Tue Oct 08 2013 03:47:27]  WARNING   Buffer ECC error corrected: address=0xD034820
node1-3wcollection/3wdiags_c3.log:c3   [Tue Oct 08 2013 03:47:27]  WARNING   Buffer ECC error corrected: address=0xD034820
node1-3wcollection/3wdiags_c3.log:c3   [Sat Oct 12 2013 10:05:44]  WARNING   Buffer ECC error corrected: address=0xD034820
node1-3wcollection/3wdiags_c3.log:c3   [Mon Oct 14 2013 12:05:42] WARNING   Buffer ECC error corrected: address=0xD034800
node1-3wcollection/3wdiags_c3.log:c3   [Mon Oct 14 2013 12:05:42] WARNING   Buffer ECC error corrected: address=0xD034820
node1-3wcollection/3wdiags_c3.log:Buffer ECC error corrected
node1-3wcollection/3wdiags_c3.log:Buffer ECC error corrected

 

b) You can then directly examine the Controller 3 log (3wdiags_c3.log). The log is located at /var/log/DXi/3wdiags_c3.log (on a live system) or ../node1-3wcollection/3wdiags_c3.log (from a 3ware collect).

 

From the log we see the same error messages.

 

//3wdiags_c3.log:

... 

/opt/DXi/3ware/tw_cli /c3/bbu show all


Ctl  Date                        Severity  AEN Message
------------------------------------------------------------------------------
c3   [Mon Oct 07 2013 13:24:30]  INFO      Battery capacity test is overdue
c3   [Tue Oct 08 2013 03:47:27]  WARNING   Buffer ECC error corrected: address=0xD034820
c3   [Tue Oct 08 2013 03:47:27]  WARNING   Buffer ECC error corrected: address=0xD034820
c3   [Thu Oct 10 2013 00:05:23]  INFO      Verify started: unit=0
c3   [Thu Oct 10 2013 00:05:24]  INFO      Verify started: unit=1
c3   [Thu Oct 10 2013 01:18:42]  INFO      Verify completed: unit=0
c3   [Fri Oct 11 2013 00:02:08]  INFO      Verify started: unit=1
c3   [Fri Oct 11 2013 02:51:51]  INFO      Verify completed: unit=1
c3   [Sat Oct 12 2013 10:05:44]  WARNING   Buffer ECC error corrected: address=0xD034820
c3   [Mon Oct 14 2013 12:05:42]  WARNING   Buffer ECC error corrected: address=0xD034800
c3   [Mon Oct 14 2013 12:05:42]  WARNING   Buffer ECC error corrected: address=0xD034820
c3   [Mon Oct 14 2013 13:13:24]  INFO      Battery charging started
c3   [Mon Oct 14 2013 13:13:26]  INFO      Battery charging completed
c3   [Mon Oct 14 2013 13:21:55]  INFO      Battery capacity test is overdue

 

Diagnostic Information on Controller //B1TSGPI6K01/c3 ...
---------------------------------------------------------

E=0216 T=12:05:33     : Corrected SBUF ECC
E=0216 T=12:05:33     : Buffer address 0x0D034800
Send AEN (code, time): 0039h, 10/14/2013 12:05:33
Buffer ECC error corrected
(EC:0x39, SK=0x01, ASC=0x40, ASCQ=0x00, SEV=02, Type=0x71)
address=0xD034800
DQS FC=110004
DQS LE=1201004
E=0216 T=12:05:33     : Recovery complete without retries

E=0216 T=12:05:33     : Corrected SBUF ECC
E=0216 T=12:05:33     : Buffer address 0x0D034820
Send AEN (code, time): 0039h, 10/14/2013 12:05:33
Buffer ECC error corrected
(EC:0x39, SK=0x01, ASC=0x40, ASCQ=0x00, SEV=02, Type=0x71)
address=0xD034820
DQS FC=110004
DQS LE=1201004
E=0216 T=12:05:33     : Recovery complete without retries

 


Solution

This is a serious enough issue that the controller card should be replaced.

 

Refer to comment 3 of PTR 14677 and this LSI (3Ware) KB article https://www.3ware.com/3warekb/article.aspx?id=15424 Link will open in new window.
 
A support ticket should probably also be opened with LSI (3Ware) and/or the card be sent for failure analysis.

 


 

 



This page was generated by the BrainKeeper Enterprise Wiki, © 2018