Deleting an LSI/3ware "Ghost" Raid Unit After a Controller Malfunction

SR Information: SR1609118. 

 

Product / Software Version: Issue found on DXi6500 with 2.2.1.2 software. However, it can affect other DXi platforms that use the LSI/3ware controllers, such as the 9690 and 9750. 

 

Problem Description: Five 1-TB drives show degraded state across multiple controllers after monthly verification(s) were completed, and hwmond was having problems talking to the 3ware controllers. 

 

Reference (possible): PTRs 30615, 32638, 32818, and 3480

Solution Summary 

Correct RAID array “ghost/invalid” units and upgrade FW version from 2.2.1.2 to 2.2.13, as detailed below.

 


Special Note

Working with RAID sets can be a very sensitive task. If you need assistance for some reason, or you have any questions, please contact Service Engineering before running commands like the ones in this article. Take extra precautions when deleting units.

 


Overview

This article gives procedures for analyzing and solving a problem created on a DXi system when an LSI/3ware "ghost" RAID unit was created after a controller malfunction. The main sections are as follows:

 

1.0  Case Scenario

1.1  Identifying the Problem
1.2  DXi Reboot and Sample Error Listings
1.3  Communication Errors in the Logs
1.4  3ware Controller Errors and Drive Errors
1.5  RAS Alerts Issued After Bootup

 

2.0  Identifying and Fixing the Problem

2.1  Identifying the Correct (Valid Units) RAIDs for Each Controller
2.2  Identifying the Correct Number of Volumes (Valid Units) per LSI/3ware Controller Card
2.3  Identifying and Fixing Incorrect or Foreign Volumes, Testing Drives, and Adding Back the Good Drives
2.4  Checking to Ensure That All Is OK

 

3.0  Requesting Additional Assistance
 


1.0 Case Scenario

A “near” DCB problem was encountered due to the 3ware controller being severely busy, and hwmod failed to communicate with the 3ware controllers.

 

  1. The DXi 3ware controller was doing its monthly verify, which runs every 10th of the month.
  2. Hwmond encountered problems communicating with 3ware controllers, and some “tw_cli: page allocation failures” occured.
  3. The DXi then received a Termination signal to halt, due to the problems resulting from the miscommunication.
  4. Manual corrective action had to be taken to correct this problem.

1.1 Identifying the Problem

Our priorities are the following:

 

  1. Identify what components failed.
  2. Determine if the components actually failed and/or are bad, or if they were “faulted” due to other issues such as a bad or faulty controller or a FW defect (as in this case).
  3. Correct the problem by replacing the bad parts, or by “reviving the parts that are deemed good and usable." 

First, make sure the DXi is not in a reboot loop:

 

  1. Log onto the DXi via the serial connection, or via ssh.

     

  2. Run the command "uptime" or tail the “tail –f /var/log/messages” to seeif the DXi is powering up or down.

     

  3. If the DXi is in a reboot loop, give the command "chkconfig heartbeat off". This prevents any possible data corruption and even full data loss.

     

  4. When you have finished fixing all of the 3ware-related problems by following the procedures in this article, give the command "chkconfig heartbeat on" before you reboot the DXi. This will ensure that everything comes up normally and will ensure that when the DXi is rebooted by the customer, it will come online all 

1.2 DXi Reboot and Sample Error Listings

The DXi will now reboot several times. The boldfaced items below explain the detailed listings that follow them.

 

Before the DXi reboots, you will see several tw_cli page allocation errors:

 

Sep 10 14:40:24 si-bkupdedup05 kernel: tw_cli: page allocation failure. order:0, mode:0x10d0

Sep 10 14:40:24 si-bkupdedup05 kernel:

Sep 10 14:40:24 si-bkupdedup05 kernel: Call Trace:

Sep 10 14:40:24 si-bkupdedup05 kernel:  [<ffffffff8000f504>] __alloc_pages+0x2b5/0x2ce

Sep 10 14:40:24 si-bkupdedup05 kernel:  [<ffffffff800728fb>] dma_alloc_pages+0xa3/0x106

Sep 10 14:40:24 si-bkupdedup05 kernel:  [<ffffffff8002207f>] dma_alloc_coherent+0x79/0x1c3

Sep 10 14:40:24 si-bkupdedup05 kernel:  [<ffffffff880b7fa4>] :3w_9xxx:twa_chrdev_ioctl+0xc6/0x674

Sep 10 14:40:24 si-bkupdedup05 kernel:  [<ffffffff8015b635>] list_add+0xc/0xe

Sep 10 14:40:24 si-bkupdedup05 kernel:  [<ffffffff800496a1>] chrdev_open+0x0/0x183

Sep 10 14:40:24 si-bkupdedup05 kernel:  [<ffffffff80042262>] do_ioctl+0x55/0x6b

Sep 10 14:40:25 si-bkupdedup05 kernel:  [<ffffffff80030306>] vfs_ioctl+0x457/0x4b9

Sep 10 14:40:25 si-bkupdedup05 kernel:  [<ffffffff800b85fd>] audit_syscall_entry+0x180/0x1b3

Sep 10 14:40:25 si-bkupdedup05 kernel:  [<ffffffff8004c97d>] sys_ioctl+0x59/0x78

Sep 10 14:40:25 si-bkupdedup05 kernel:  [<ffffffff8005e28d>] tracesys+0xd5/0xe0  

1.3 Communication Errors in the Logs

Due to the communication errors, you will start seeing errors in the logs:

 

mountd[30827]: export request from 127.0.0.1 fails.

Sep 10 15:01:03 si-bkupdedup05 kernel: Kernel logging (proc) stopped.

Sep 10 15:01:03 si-bkupdedup05 kernel: Kernel log daemon terminating.

Sep 10 15:01:04 si-bkupdedup05 exiting on signal 15

 

The last 3ware verification is now complete:

 

Sep 10 19:01:03 si-bkupdedup05 kernel: klogd 1.4.1, log source = /proc/kmsg started.

Sep 10 19:27:05 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x002B): Verify completed:unit=1.

Sep 10 19:30:05 si-bkupdedup05 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x002B): Verify completed:unit=3.

Sep 10 19:40:31 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x002B): Verify completed:unit=1.


The DXi gets a request to shut down:

 

Sep 10 20:01:03 si-bkupdedup05 kernel: Kernel logging (proc) stopped.

Sep 10 20:01:03 si-bkupdedup05 kernel: Kernel log daemon terminating.

Sep 10 20:01:04 si-bkupdedup05 exiting on signal 15

1.4 3ware Controller Errors and Drive Errors

When the DXi reboots, many errors are seen on several 3ware controllers and drives on (c1, C2 and C3):

 

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=0.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=1.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=2.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=3.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=4.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=5.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=6.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=7.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=8.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=9.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=10.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=11.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0062): Enclosure removed:encl=0.

Sep 11 11:33:50 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit:unit=2, vport=30.

Sep 11 11:33:50 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit:unit=2, vport=25.

Sep 11 11:33:50 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit:unit=0, vport=9.

Sep 11 11:33:55 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=0.

Sep 11 11:33:55 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=1.

Sep 11 11:33:55 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=2.

Sep 11 11:33:55 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=3.

Sep 11 11:33:55 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=4.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=5.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=6.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=7.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=8.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=9.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=10.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=11.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0062): Enclosure removed:encl=0.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, vport=19.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, vport=18.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: ERROR (0x04:0x0002): Degraded unit:unit=0, vport=9.

Sep 11 11:34:02 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=0.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=1.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=2.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=3.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=4.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=5.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=6.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=7.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=8.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=9.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=10.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=11.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0062): Enclosure removed:encl=0.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, vport=19.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, vport=18.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: ERROR (0x04:0x0002): Degraded unit:unit=0, vport=9.

Sep 11 11:34:10 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x001E): Unit inoperable:unit=0.

Sep 11 11:34:10 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x001E): Unit inoperable:unit=2.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=0.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=1.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=2.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=3.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=4.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=5.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=6.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=7.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=8.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=9.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=10.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=11.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0062): Enclosure removed:encl=1.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit:unit=3, vport=31.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit:unit=3, vport=29.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, vport=12.

Sep 11 11:34:16 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: ERROR (0x04:0x001E): Unit inoperable:unit=0.

Sep 11 11:34:16 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: ERROR (0x04:0x001E): Unit inoperable:unit=1.

Sep 11 11:34:23 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: ERROR (0x04:0x001E): Unit inoperable:unit=0.

Sep 11 11:34:23 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: ERROR (0x04:0x001E): Unit inoperable:unit=1.

 

As the DXi continues to boot, additional drives are detected by the 3ware controller(s). This can be an indication that the drives were not ready when the controller scanned, a cable connectivity problem, a possible bad/slow drive, or a drive with errors that may need to be replaced.  You can look at the 3ware Controller event logs to determine if a drive needs to be replaced.

 

Sep 11 11:37:17 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=11.Sep 11 11:37:19 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001F): Unit operational:unit=3.
 

Then the DXi rescans and finds the original units, PLUS additional units (some drives that it found to have a signature but not enough information to determine if they are for an existing RAID/unit or a foreign unit, so it identifies them as foreign and assigns the following unit number of Ux). These extra units will not have enough drives to make it a RAID6 or 1x2 mirror, so they will be indentified as “inoperable” as they are incomplete!

 

Sep 11 11:35:25 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x0063): Enclosure added:encl=0.

Sep 11 11:35:26 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=0.

Sep 11 11:35:27 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=1.

Sep 11 11:35:27 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001F): Unit operational:unit=0.

Sep 11 11:35:33 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:35:36 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=2.

Sep 11 11:35:36 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=3.

Sep 11 11:35:36 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=4.

Sep 11 11:35:36 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=5.

Sep 11 11:35:36 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=7.

Sep 11 11:35:36 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=8.

Sep 11 11:35:36 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=9.

Sep 11 11:35:37 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=10.

Sep 11 11:35:37 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=11.

Sep 11 11:35:37 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001F): Unit operational:unit=2.

Sep 11 11:35:41 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=6.

Sep 11 11:35:43 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001F): Unit operational:unit=2.

Sep 11 11:35:43 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:35:53 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:36:00 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x0063): Enclosure added:encl=0.

Sep 11 11:36:02 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=0.

Sep 11 11:36:02 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=1.

Sep 11 11:36:02 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001F): Unit operational:unit=0.

Sep 11 11:36:03 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:36:10 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=2.

Sep 11 11:36:10 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=3.

Sep 11 11:36:10 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=5.

Sep 11 11:36:10 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=6.

Sep 11 11:36:10 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=7.

Sep 11 11:36:10 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=9.

Sep 11 11:36:10 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=10.

Sep 11 11:36:11 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=11.

Sep 11 11:36:11 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001F): Unit operational:unit=1.

Sep 11 11:36:13 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:36:16 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=8.

Sep 11 11:36:17 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001F): Unit operational:unit=1.

Sep 11 11:36:18 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=4.

Sep 11 11:36:20 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001F): Unit operational:unit=1.

Sep 11 11:36:23 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:36:30 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x0063): Enclosure added:encl=0.

Sep 11 11:36:32 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=0.

Sep 11 11:36:32 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=1.

Sep 11 11:36:32 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001F): Unit operational:unit=0.

Sep 11 11:36:33 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:36:40 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=2.

Sep 11 11:36:40 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=3.

Sep 11 11:36:40 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=4.

Sep 11 11:36:40 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=5.

Sep 11 11:36:40 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=6.

Sep 11 11:36:40 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=7.

Sep 11 11:36:40 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=9.

Sep 11 11:36:40 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=10.

Sep 11 11:36:40 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=11.

Sep 11 11:36:41 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001F): Unit operational:unit=1.

Sep 11 11:36:43 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:36:49 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=8.

Sep 11 11:36:50 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001F): Unit operational:unit=1.

Sep 11 11:36:53 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:37:02 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x0063): Enclosure added:encl=1.

Sep 11 11:37:03 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=0.

Sep 11 11:37:03 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=1.

Sep 11 11:37:03 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:37:04 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001F): Unit operational:unit=1.

Sep 11 11:37:12 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=2.

Sep 11 11:37:12 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=3.

Sep 11 11:37:12 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=4.

Sep 11 11:37:12 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=5.

Sep 11 11:37:12 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=6.

Sep 11 11:37:12 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=7.

Sep 11 11:37:12 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=8.

Sep 11 11:37:12 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=9.

Sep 11 11:37:12 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=10.

Sep 11 11:37:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001F): Unit operational:unit=3.

 

When this happened, the DXi was shut down to get some manual assistance. We had the QFE reseat all 3ware HBAs, to ensure a good connection.

 

Sep 11 11:39:12 si-bkupdedup05 shutdown[11273]: shutting down for system halt

Sep 11 11:39:12 si-bkupdedup05 init: Switching to runlevel: 0

Sep 11 11:39:13 si-bkupdedup05 xinetd[10044]: Exiting... 

1.5 RAS Alerts Issued After Bootup

After bootup, the following RAS alerts are issued, indicating multiple RAID/unit failures and drive failures:


Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00021>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 21 VINST: C1E0 VPINST: C1E0 EVENT: 7 TEXT: The RAID chassis C1E0 has failed. Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00023>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 23 VINST: C1E0SLT6 VPINST: C1E0 EVENT: 118 TEXT: [Hitachi HUA722010CLA330] Needs replacement or has been replaced and is being rebuilt.
 

Some drives are seen as foreign units, so the raidsets that the drives belonged to show up as “degraded” or “inoperable” ...

 

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00070>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 70 VINST: C1U1V0 VPINST: C1E0 EVENT: 115 TEXT: DEGRADED

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00070>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 70 VINST: C1U3V0 VPINST: C1E0 EVENT: 115 TEXT: DEGRADED

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00021>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 21 VINST: C1E1 VPINST: C1E1 EVENT: 7 TEXT: The RAID chassis C1E1 has failed.

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00023>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 23 VINST: C1E1SLT11 VPINST: C1E1 EVENT: 118 TEXT: [WDC WD1002FBYS-02A6B0] Needs replacement or has been replaced and is being rebuilt.

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00070>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 70 VINST: C1U1V0 VPINST: C1E1 EVENT: 115 TEXT: DEGRADED

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00070>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 70 VINST: C1U3V0 VPINST: C1E1 EVENT: 115 TEXT: DEGRADED

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00021>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 21 VINST: C2E0 VPINST: C2E0 EVENT: 7 TEXT: The RAID chassis C2E0 has failed.

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00023>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 23 VINST: C2E0SLT8 VPINST: C2E0 EVENT: 118 TEXT: [Hitachi HUA722010CLA330] Needs replacement or has been replaced and is being rebuilt.

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00070>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 70 VINST: C2U1V0 VPINST: C2E0 EVENT: 115 TEXT: DEGRADED

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00021>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 21 VINST: C3E0 VPINST: C3E0 EVENT: 7 TEXT: The RAID chassis C3E0 has failed.

 

...and some foreign drives are identified:
 

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00023>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 23 VINST: C3E0SLT4 VPINST: C3E0 EVENT: 118 TEXT: [Hitachi HUA722010CLA330] Needs replacement or has been replaced and is being rebuilt.

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00023>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 23 VINST: C3E0SLT8 VPINST: C3E0 EVENT: 118 TEXT: [WDC WD1002FBYS-02A6B0] Needs replacement or has been replaced and is being rebuilt.

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00070>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 70 VINST: C3U1V0 VPINST: C3E0 EVENT: 115 TEXT: DEGRADED

 


2.0 Identifying and Fixing the Problem

This section shows how you can identify the hardware that is causing the problem, and apply a fix.

2.1 Identifying the Correct (Valid Units) RAIDs for Each Controller

On any DXi6xxx with 3ware controllers, the node will have the four controllers shown below, by default. You can see this by giving the command

 

 /opt/DXi/3ware/tw_cli /c0 show

 

Results:

 

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy

-----------------------------------------------------------------------------

u0    RAID-1    OK             -       -       -       931.312   RiW    OFF   

u1    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

u2    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

u3    RAID-6    OK             -       -       256K    7450.5    RiW    OFF   
 

NOTE: All drives are on C0E0

 

Controllers 1, 2 and 3 may have multiple enclosures, as in the examples below.

 

C1 has 4 units, 2  for each array/EM.

 

You can see this for C1 by giving the following command:

 

 /opt/DXi/3ware/tw_cli /c1 show

 

Results:

 

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy

------------------------------------------------------------------------------

u0    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

u1    RAID-6    OK             -       -       256K    7450.5    RiW    OFF   

u2    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

u3    RAID-6    OK             -       -       256K    7450.5    RiW    OFF 

 

NOTE: This controller has two enclosures, E0 and E1, so any foreign incomplete/inoperable units would follow with U4, U5 etc.

 

NOTE: All port listings have been cut to make this document shorter.

 

VPort Status         Unit Size      Type  Phy Encl-Slot    Model

p8    OK             u0   59.62 GB  SATA  -   /c1/e0/slt0  SSDSA2SH064G1GC INT

p10   OK             u2   59.62 GB  SATA  -   /c1/e1/slt0  SSDSA2SH064G1GC INT

p12   OK             u1   931.51 GB SATA  -   /c1/e0/slt2  Hitachi HUA722010CL

p13   OK             u3   931.51 GB SATA  -   /c1/e1/slt2  Hitachi HUA722010CL

 
You can see this for C2 by giving the following command:

 

/opt/DXi/3ware/tw_cli /c2 show

 

Results:  One Array/EM, only two units

 

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy

------------------------------------------------------------------------------

u0    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

u1    RAID-6    OK             -       -       256K    7450.5    RiW    OFF   
 

You can see this for C3 by giving the following command:


/opt/DXi/3ware/tw_cli /c3 show

 

Results: One Array/EM , only two units

 

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy

------------------------------------------------------------------------------

u0    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

U1    RAID-6    OK             -       -       256K    7450.5    RiW    OFF   

2.2 Identifying the Correct Number of Volumes (Valid Units) per LSI/3ware Controller Card

Please note the following:

 

  1. The NODE will always have 4 units: U0 (BOOT), U1 (SSD1), U2 (SSD2) and U3 (DATA) . 

     

  2. The ID’s and /dev numbers can be seen in the /opt/DXi/3ware/mapfile.txt file.

     

  3. Each Array/EM on C1, C2 and C3 will each have the following:

     

    ● 1 (one) 1x2 mirror (60G, 100G or 200G SSD drives) (SSD)

    ● 1  (one) 1x10 RAID6 (1TB, 2TB or 3TB drives) (DATA)  

  1.  A controller with 1 Array/EM will have U0 and U1.

  1. A controller with 2 Arrays/EMs will have:

     

    ● Array/EM #1 U0 and U1

    ● Array/EM #2 U2 and U3  

2.3 Identifying and Fixing Incorrect or Foreign Volumes, Testing Drives, and Adding Back the Good Drives

At this point, we must identify and fix the incorrect or foreign volumes (invalid units) per LSI/3ware controller card, determine if the drives are still good, and if so, add them back to the corresponding raidset(s).

 

In this SR example, the following volumes per 3ware controller card were Identified to be foreign:

 

Controller: C1

 

Problem: P30 and P31 became U4

Action:  Need to delete unit U4 and make drives part of U3

Command given: (before fix):

 

 /opt/DXi/3ware/tw_cli /c1 show

 

Results:

 

 

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy

------------------------------------------------------------------------------

u0    RAID-1    OK             -       -        -      55.8691   RiW    OFF   

u1    RAID-6    OK             -       -       256K    7450.5    RiW    OFF   

u2    RAID-1    OK             -       -        -      55.8691   RiW    OFF   

        u3    RAID-6    DEGRADED       -       256K    7450.5    RiW    OFF #two drives missing P30 and P31  

        u4    RAID-6    INOPERABLE     -       256K    7450.5    Ri     OFF #should not exist

      ~

 

P30 and P31 should have been part of U3, but as they were identified as a “foreign unit” and were labeled as U4:

 

p30   OK             u4   931.51 GB SATA  -   /c1/e0/slt6  Hitachi HUA722010CL

p31   OK             u4   931.51 GB SATA  -   /c1/e1/slt11 WDC WD1002FBYS-02A6

 

From the output above, we know the following:

 

● C1 has two EMs on it, so there should only be U0, U1, U2 and U3.
● U4 has ONLY two drives in a RAID 6 configuration, so this unit is NOT part of the original units or raidsets.

 
To troubleshoot this issue, first look at the 3ware logs and make sure that there were no errors or issues with the two drives on Ports p30 and p31. From the RCA done on this SR, we determined that this was due to the FW version of v22 on the 3ware controllers, which was corrected in FW 2.2.1.3. So, a FW uprade was requested and performed.

 

Knowing this, and since no errors were found in the logs, we then did the following:

 

1. Remove the drive in question – this will remove the drive from the controller and will NOT keep the DCB information, meaning that it will become a “new” drive.

 

/opt/DXi/3ware/tw_cli /c1/p30 remove

/opt/DXi/3ware/tw_cli /c1 show

/opt/DXi/3ware/tw_cli /c1 rescan

 

The drive will then show up as follows:

 

p30   -           u1   931.51 GB SATA  -   /c1/e0/slt6  Hitachi HUA722010CL

 

2. Add the drive back into the RAID that it belongs to:

 

/opt/DXi/3ware/tw_cli /c1/u1 start rebuild disk=30

 

3. Rescan to ensure that no other drives become foreign or fail:

 

/opt/DXi/3ware/tw_cli /c1 rescan

/opt/DXi/3ware/tw_cli /c1 show

 

4. When you see the drive “rebuilding" into U3, you can do the same thing with drive in p31:

 

/opt/DXi/3ware/tw_cli /c1/p31 remove

/opt/DXi/3ware/tw_cli /c1 show

 

The drive will show up as follows:

 

p31   OK            -   931.51 GB SATA  -   /c1/e0/slt6  Hitachi HUA722010

 

5.   Add the drive back into the raid that it belongs to.

 

/opt/DXi/3ware/tw_cli /c1/u1 start rebuild disk=31

 

6. Rescan to ensure that no other drives become foreign or fail, and that the two drives are “rebuilding”.

 

/opt/DXi/3ware/tw_cli /c1 rescan

 

7. Take a look at the results. 

 

/opt/DXi/3ware/tw_cli /c1 show

 

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy

------------------------------------------------------------------------------

u0    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

u1    RAID-6    OK             -       -       256K    7450.5    RiW    OFF   

u2    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

u3    RAID-6    REBUILDING     2%       -      256K    7450.5    RiW    OFF  

 

Note: No U4 present any longer

 

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

 

p30   DEGRADED           u3   931.51 GB SATA  -   /c1/e0/slt6  Hitachi HUA722010CL

 

p31   DEGRADED           u3   931.51 GB SATA  -   /c1/e1/slt11 WDC WD1002FBYS-02A6

 

 

Controller: C2

 

Problem: P19 became U3

Action: Need to remove U3 drive and make part of U2

 

BEFORE:

 

/opt/DXi/3ware/tw_cli /c2 

 

 Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy

------------------------------------------------------------------------------

u0    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

u1    RAID-6    DEGRADED       -       -       256K    7450.5    RiW    OFF 

u2    RAID-6    INOPERABLE     -       -       256K    7450.5    RiW    OFF 

------------------------------------------------------------------------------

p18   OK            u1   931.51 GB SATA  -   /c2/e0/slt11 Hitachi HUA722010CL

p19   OK            u2   931.51 GB SATA  -   /c2/e0/slt8  Hitachi HUA722010CL


1.   Remove the drive in question – this will remove the drive from the controller and will NOT keep the DCB information, meaning that this drive will become a “new” drive.

 

/opt/DXi/3ware/tw_cli /c2/p19 remove

 

/opt/DXi/3ware/tw_cli /c2 show

 

Drive will show up as:

 

p19   OK             -   931.51 GB SATA  -   /c2/e0/slt8  Hitachi HUA722010CL


2.   Add the drive back into the raid that it belongs to.

 

/opt/DXi/3ware/tw_cli /c2/u1 start rebuild disk=19


3.   Rescan to ensure that no other drives become foreign or fail and that the drive are “rebuilding”.

 

/opt/DXi/3ware/tw_cli /c2 rescan


AFTER:

 

/opt/DXi/3ware/tw_cli /c2 show

 

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy

------------------------------------------------------------------------------

u0    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

u1    RAID-6    REBUILDING     2%       -       256K    7450.5    RiW    OFF 

 

Note: U2 is no longer present 

 

------------------------------------------------------------------------------

p18   OK             u1   931.51 GB SATA  -   /c2/e0/slt11 Hitachi HUA722010CL

p19   DEGRADED                u2   931.51 GB SATA  -   /c2/e0/slt8  Hitachi HUA722010CL

 

Controller: C3

 

Problem: P18 and P19 became U3

Action:  Need to remove U3 disks and make them part of U2

 

BEFORE:

 

/opt/DXi/3ware/tw_cli /c3 show

 

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy

------------------------------------------------------------------------------

u0    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

u1    RAID-6    OK             -       -       256K    7450.5    RiW    OFF   

u3    RAID-6    INOPERABLE     -       -       256K    7450.5    Ri     OFF

    ~

 

P18 and P19 should have been part of U2, but as they were identified as a “foreign unit” they were labeled as U3:

 

p18   OK             u3   931.51 GB SATA  -   /c3/e0/slt8  WDC WD1002FBYS-02A6

p19   OK             u3   931.51 GB SATA  -   /c3/e0/slt4  Hitachi HUA722010CL

 

From the output above, we know that C3 has one EM on it, so there should only be U0 and U1. U2 has ONLY two drives in a RAID 6 configuration, so this unit is NOT part of the original units or raid sets.

 

First, look at the 3ware logs, and make sure that there were no errors or issues with the two drives on Port p18 and p19. From the RCA done on this SR, we determined that the drives were incorrectly identified because the 3ware controllers had FW v22. This problem was corrected on FW 2.2.1.3, so a FW upgrade was requested and performed. Knowing this, and since no errors were found in the logs, we proceeded to do the following.

 

1. Remove the drive in question – this will remove the drive from the controller and will NOT keep the DCB information, so this drive will become a “new” drive.

 

/opt/DXi/3ware/tw_cli /c3/p18 remove

/opt/DXi/3ware/tw_cli /c3 show

/opt/DXi/3ware/tw_cli /c3 rescan

 

2.   Add the drive back into the RAID that it belongs to.

 

/opt/DXi/3ware/tw_cli /c3/u1 start rebuild disk=18

 

3.   Rescan to ensure that no other drives become foreign or fail.

 

/opt/DXi/3ware/tw_cli /c3 show

/opt/DXi/3ware/tw_cli /c3 show

 

The drive will show up as:

 

p18   OK             -   931.51 GB SATA  -   /c3/e0/slt8  WDC WD1002FBYS-02A6

 

4. Once you see the drive “rebuilding" into U2, you can do the same thing with p19.

 

/opt/DXi/3ware/tw_cli /c3/p19 remove

/opt/DXi/3ware/tw_cli /c3 show

 

The drive will show up as:

 

p19   OK             -   931.51 GB SATA  -   /c3/e0/slt4  WDC WD1002FBYS-02A6

 

5.   Add the drive back into the RAID it belongs to.

 

/opt/DXi/3ware/tw_cli /c3/u1 start rebuild disk=19

 

6.   Rescan to ensure that no other drives become foreign or fail, and that the two drives are “rebuilding".

 

/opt/DXi/3ware/tw_cli /c3 rescan
 

AFTER:

 

/opt/DXi/3ware/tw_cli /c1 show

 

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy

------------------------------------------------------------------------------

u0    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

u1    RAID-6    OK             -       -       256K    7450.5    RiW    OFF   

u2    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

u3    RAID-6    REBUILDING     2%       -      256K    7450.5    RiW    OFF  

 

Note: No U4 is no longer present

 

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

p18   DEGRADED             u1   931.51 GB SATA  -   /c3/e0/slt8  WDC WD1002FBYS-02A6

p19   DEGRADED             u1   931.51 GB SATA  -   /c3/e0/slt4  Hitachi HUA722010CL 

2.4 Checking to Ensure That All Is OK

As a final check, run a “show” command against all controllers to ensure that all is OK:


/opt/DXi/3ware/tw_cli /c0 show

               

NOTE: Make sure that all units have an "OK", "REBUILDING", or INITIALIZE status.

                 

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy

                  ------------------------------------------------------------------------------

u0    RAID-1    OK             -       -       -       931.312   RiW    OFF   

u1    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

u2    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

u3    RAID-6    OK             -       -       256K    7450.5    RiW    OFF  
 

All Drives should show an “OK” status (Only port P8 is shown here. There can be as many as 31 ports, depending on the number of Arrays/EM’s):

 

VPort Status         Unit Size      Type  Phy Encl-Slot    Model

------------------------------------------------------------------------------

p8    OK             u1   59.62 GB  SATA  -   /c0/e0/slt5  SSDSA2SH064G1GC INT

~

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest

---------------------------------------------------------------------------

bbu   On           Yes       OK        OK       OK       0      xx-xxx-xxxx 

 

Look at the status for C1:

 

/opt/DXi/3ware/tw_cli /c1 show

 

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy


------------------------------------------------------------------------------

u0    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

u1    RAID-6    OK             -       -       256K    7450.5    RiW    OFF   

u2    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

u3    RAID-6    OK             -       -       256K    7450.5    RiW    OFF  

 

All Drives should show an “OK”  or "DEGRADED" status if the unit is being rebuilt (Only port P8 is shown here. There can be as many as 31 ports, depending on the number of Arrays/EM’s):

 

VPort Status         Unit Size      Type  Phy Encl-Slot    Model

------------------------------------------------------------------------------

p8    OK             u0   59.62 GB  SATA  -   /c1/e0/slt0  SSDSA2SH064G1GC INT

~

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest

---------------------------------------------------------------------------

bbu   On           Yes       OK        OK       OK       0      xx-xxx-xxxx 

 

Look at the status for C2:

 

/opt/DXi/3ware/tw_cli /c2 show

 

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy

------------------------------------------------------------------------------

u0    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

u1    RAID-6    OK             -       -       256K    7450.5    RiW    OFF   

                   

All Drives should show an “OK” status (Only port P8 is shown here. There can be as many as 31 ports, depending on the number of Arrays/EM’s):
   

VPort Status         Unit Size      Type  Phy Encl-Slot    Model

 ------------------------------------------------------------------------------

p8    OK             u0   59.62 GB  SATA  -   /c2/e0/slt0  SSDSA2SH064G1GC INT

~

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest

---------------------------------------------------------------------------

bbu   On           Yes       OK        OK       OK       0      xx-xxx-xxxx 

 

Look at the status for C3:

 

/opt/DXi/3ware/tw_cli /c3 show

 

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy

------------------------------------------------------------------------------

u0    RAID-1    OK             -       -       -       55.8691   RiW    OFF   

u1    RAID-6    OK             -       -       256K    7450.5    RiW    OFF   

                   

All Drives should show an “OK” status (Only port P8 is shown here. There can be as many as 31 ports, depending on the number of Arrays/EM’s):
   

VPort Status         Unit Size      Type  Phy Encl-Slot    Model

 ------------------------------------------------------------------------------

p8    OK             u0   59.62 GB  SATA  -   /c3/e0/slt0  SSDSA2SH064G1GC INT

~

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest

---------------------------------------------------------------------------

bbu   On           Yes       OK        OK       OK       0      xx-xxx-xxxx 

 

NOTE: Don’t forget to issue the command “chkconfig heartbeat on” before you reboot the DXi. This will ensure that everything comes up normally and will ensure that when the DXi is rebooted by the customer, it will come  online all the way up and does not go into diagnostics mode. 

 


3.0 Requesting Additional Assistance

If you need further assistance, please contact Service Engineering before running any commands in question. Take extra precautions when you delete units. If you accidentally delete a unit, it will be NON-RECOVERABLE, and data loss will be occur! 

 

 

 



This page was generated by the BrainKeeper Enterprise Wiki, © 2018