Deleting an LSI/3ware "Ghost" Raid Unit After a Controller Malfunction

SR Information: SR1609118.

Product / Software Version: Issue found on DXi6500 with 2.2.1.2 software. However, it can affect other DXi platforms that use the LSI/3ware controllers, such as the 9690 and 9750.

Problem Description: Five 1-TB drives show degraded state across multiple controllers after monthly verification(s) were completed, and hwmond was having problems talking to the 3ware controllers.

Reference (possible): PTRs 30615, 32638, 32818, and 3480

Solution Summary

Correct RAID array “ghost/invalid” units and upgrade FW version from 2.2.1.2 to 2.2.13, as detailed below.

Special Note

Working with RAID sets can be a very sensitive task. If you need assistance for some reason, or you have any questions, please contact Service Engineering before running commands like the ones in this article. Take extra precautions when deleting units.

If you accidentally DELETE a unit, it will be NON-RECOVERABLE, and DATA LOSS WILL OCCUR!
Please make sure to delete a unit ONLY when you are 100% sure that it is a foreign unit/RAID, and it is NOT part of the original configuration.
As an example, a RAID6 configuration for the DXi takes 10 drives, so any configuration with less than 10 drives may be a questionable unit.
Quantum strongly recommends that you contact Service Engineering with any questions BEFORE you delete a unit, or if you are uncertain about anything.

Overview

This article gives procedures for analyzing and solving a problem created on a DXi system when an LSI/3ware "ghost" RAID unit was created after a controller malfunction. The main sections are as follows:

1.0 Case Scenario

1.1 Identifying the Problem
1.2 DXi Reboot and Sample Error Listings
1.3 Communication Errors in the Logs
1.4 3ware Controller Errors and Drive Errors
1.5 RAS Alerts Issued After Bootup

2.0 Identifying and Fixing the Problem

2.1 Identifying the Correct (Valid Units) RAIDs for Each Controller
2.2 Identifying the Correct Number of Volumes (Valid Units) per LSI/3ware Controller Card
2.3 Identifying and Fixing Incorrect or Foreign Volumes, Testing Drives, and Adding Back the Good Drives
2.4 Checking to Ensure That All Is OK

3.0 Requesting Additional Assistance

1.0 Case Scenario

A “near” DCB problem was encountered due to the 3ware controller being severely busy, and hwmod failed to communicate with the 3ware controllers.

The DXi 3ware controller was doing its monthly verify, which runs every 10th of the month.
Hwmond encountered problems communicating with 3ware controllers, and some “tw_cli: page allocation failures” occured.
The DXi then received a Termination signal to halt, due to the problems resulting from the miscommunication.
Manual corrective action had to be taken to correct this problem.

1.1 Identifying the Problem

Our priorities are the following:

Identify what components failed.
Determine if the components actually failed and/or are bad, or if they were “faulted” due to other issues such as a bad or faulty controller or a FW defect (as in this case).
Correct the problem by replacing the bad parts, or by “reviving the parts that are deemed good and usable."

First, make sure the DXi is not in a reboot loop:

Log onto the DXi via the serial connection, or via ssh.
Run the command "uptime" or tail the “tail –f /var/log/messages” to seeif the DXi is powering up or down.
If the DXi is in a reboot loop, give the command "chkconfig heartbeat off". This prevents any possible data corruption and even full data loss.
When you have finished fixing all of the 3ware-related problems by following the procedures in this article, give the command "chkconfig heartbeat on" before you reboot the DXi. This will ensure that everything comes up normally and will ensure that when the DXi is rebooted by the customer, it will come online all

1.2 DXi Reboot and Sample Error Listings

The DXi will now reboot several times. The boldfaced items below explain the detailed listings that follow them.

Before the DXi reboots, you will see several tw_cli page allocation errors:

Sep 10 14:40:24 si-bkupdedup05 kernel: tw_cli: page allocation failure. order:0, mode:0x10d0

Sep 10 14:40:24 si-bkupdedup05 kernel:

Sep 10 14:40:24 si-bkupdedup05 kernel: Call Trace:

Sep 10 14:40:24 si-bkupdedup05 kernel: [<ffffffff8000f504>] __alloc_pages+0x2b5/0x2ce

Sep 10 14:40:24 si-bkupdedup05 kernel: [<ffffffff800728fb>] dma_alloc_pages+0xa3/0x106

Sep 10 14:40:24 si-bkupdedup05 kernel: [<ffffffff8002207f>] dma_alloc_coherent+0x79/0x1c3

Sep 10 14:40:24 si-bkupdedup05 kernel: [<ffffffff880b7fa4>] :3w_9xxx:twa_chrdev_ioctl+0xc6/0x674

Sep 10 14:40:24 si-bkupdedup05 kernel: [<ffffffff8015b635>] list_add+0xc/0xe

Sep 10 14:40:24 si-bkupdedup05 kernel: [<ffffffff800496a1>] chrdev_open+0x0/0x183

Sep 10 14:40:24 si-bkupdedup05 kernel: [<ffffffff80042262>] do_ioctl+0x55/0x6b

Sep 10 14:40:25 si-bkupdedup05 kernel: [<ffffffff80030306>] vfs_ioctl+0x457/0x4b9

Sep 10 14:40:25 si-bkupdedup05 kernel: [<ffffffff800b85fd>] audit_syscall_entry+0x180/0x1b3

Sep 10 14:40:25 si-bkupdedup05 kernel: [<ffffffff8004c97d>] sys_ioctl+0x59/0x78

Sep 10 14:40:25 si-bkupdedup05 kernel: [<ffffffff8005e28d>] tracesys+0xd5/0xe0

1.3 Communication Errors in the Logs

Due to the communication errors, you will start seeing errors in the logs:

mountd[30827]: export request from 127.0.0.1 fails.

Sep 10 15:01:03 si-bkupdedup05 kernel: Kernel logging (proc) stopped.

Sep 10 15:01:03 si-bkupdedup05 kernel: Kernel log daemon terminating.

Sep 10 15:01:04 si-bkupdedup05 exiting on signal 15

The last 3ware verification is now complete:

Sep 10 19:01:03 si-bkupdedup05 kernel: klogd 1.4.1, log source = /proc/kmsg started.

Sep 10 19:27:05 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x002B): Verify completed:unit=1.

Sep 10 19:30:05 si-bkupdedup05 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x002B): Verify completed:unit=3.

Sep 10 19:40:31 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x002B): Verify completed:unit=1.

The DXi gets a request to shut down:

Sep 10 20:01:03 si-bkupdedup05 kernel: Kernel logging (proc) stopped.

Sep 10 20:01:03 si-bkupdedup05 kernel: Kernel log daemon terminating.

Sep 10 20:01:04 si-bkupdedup05 exiting on signal 15

1.4 3ware Controller Errors and Drive Errors

When the DXi reboots, many errors are seen on several 3ware controllers and drives on (c1, C2 and C3):

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=0.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=1.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=2.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=3.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=4.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=5.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=6.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=7.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=8.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=9.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=10.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=11.

Sep 11 11:33:49 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0062): Enclosure removed:encl=0.

Sep 11 11:33:50 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit:unit=2, vport=30.

Sep 11 11:33:50 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit:unit=2, vport=25.

Sep 11 11:33:50 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit:unit=0, vport=9.

Sep 11 11:33:55 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=0.

Sep 11 11:33:55 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=1.

Sep 11 11:33:55 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=2.

Sep 11 11:33:55 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=3.

Sep 11 11:33:55 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=4.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=5.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=6.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=7.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=8.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=9.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=10.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=11.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: WARNING (0x04:0x0062): Enclosure removed:encl=0.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, vport=19.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, vport=18.

Sep 11 11:33:56 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: ERROR (0x04:0x0002): Degraded unit:unit=0, vport=9.

Sep 11 11:34:02 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=0.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=1.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=2.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=3.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=4.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=5.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=6.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=7.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=8.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=9.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=10.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0019): Drive removed:encl=0, slot=11.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: WARNING (0x04:0x0062): Enclosure removed:encl=0.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, vport=19.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, vport=18.

Sep 11 11:34:03 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: ERROR (0x04:0x0002): Degraded unit:unit=0, vport=9.

Sep 11 11:34:10 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x001E): Unit inoperable:unit=0.

Sep 11 11:34:10 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x001E): Unit inoperable:unit=2.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=0.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=1.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=2.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=3.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=4.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=5.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=6.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=7.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=8.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=9.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=10.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0019): Drive removed:encl=1, slot=11.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0062): Enclosure removed:encl=1.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit:unit=3, vport=31.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit:unit=3, vport=29.

Sep 11 11:34:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit:unit=1, vport=12.

Sep 11 11:34:16 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: ERROR (0x04:0x001E): Unit inoperable:unit=0.

Sep 11 11:34:16 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: ERROR (0x04:0x001E): Unit inoperable:unit=1.

Sep 11 11:34:23 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: ERROR (0x04:0x001E): Unit inoperable:unit=0.

Sep 11 11:34:23 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: ERROR (0x04:0x001E): Unit inoperable:unit=1.

As the DXi continues to boot, additional drives are detected by the 3ware controller(s). This can be an indication that the drives were not ready when the controller scanned, a cable connectivity problem, a possible bad/slow drive, or a drive with errors that may need to be replaced. You can look at the 3ware Controller event logs to determine if a drive needs to be replaced.

Sep 11 11:37:17 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=11.Sep 11 11:37:19 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001F): Unit operational:unit=3.

Then the DXi rescans and finds the original units, PLUS additional units (some drives that it found to have a signature but not enough information to determine if they are for an existing RAID/unit or a foreign unit, so it identifies them as foreign and assigns the following unit number of Ux). These extra units will not have enough drives to make it a RAID6 or 1x2 mirror, so they will be indentified as “inoperable” as they are incomplete!

Sep 11 11:35:25 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x0063): Enclosure added:encl=0.

Sep 11 11:35:26 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=0.

Sep 11 11:35:27 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=1.

Sep 11 11:35:27 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001F): Unit operational:unit=0.

Sep 11 11:35:33 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:35:36 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=2.

Sep 11 11:35:36 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=3.

Sep 11 11:35:36 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=4.

Sep 11 11:35:36 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=5.

Sep 11 11:35:36 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=7.

Sep 11 11:35:36 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=8.

Sep 11 11:35:36 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=9.

Sep 11 11:35:37 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=10.

Sep 11 11:35:37 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=11.

Sep 11 11:35:37 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001F): Unit operational:unit=2.

Sep 11 11:35:41 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=6.

Sep 11 11:35:43 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001F): Unit operational:unit=2.

Sep 11 11:35:43 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:35:53 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:36:00 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x0063): Enclosure added:encl=0.

Sep 11 11:36:02 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=0.

Sep 11 11:36:02 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=1.

Sep 11 11:36:02 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001F): Unit operational:unit=0.

Sep 11 11:36:03 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:36:10 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=2.

Sep 11 11:36:10 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=3.

Sep 11 11:36:10 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=5.

Sep 11 11:36:10 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=6.

Sep 11 11:36:10 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=7.

Sep 11 11:36:10 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=9.

Sep 11 11:36:10 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=10.

Sep 11 11:36:11 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=11.

Sep 11 11:36:11 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001F): Unit operational:unit=1.

Sep 11 11:36:13 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:36:16 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=8.

Sep 11 11:36:17 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001F): Unit operational:unit=1.

Sep 11 11:36:18 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=4.

Sep 11 11:36:20 si-bkupdedup05 kernel: 3w-9xxx: scsi3: AEN: INFO (0x04:0x001F): Unit operational:unit=1.

Sep 11 11:36:23 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:36:30 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x0063): Enclosure added:encl=0.

Sep 11 11:36:32 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=0.

Sep 11 11:36:32 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=1.

Sep 11 11:36:32 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001F): Unit operational:unit=0.

Sep 11 11:36:33 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:36:40 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=2.

Sep 11 11:36:40 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=3.

Sep 11 11:36:40 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=4.

Sep 11 11:36:40 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=5.

Sep 11 11:36:40 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=6.

Sep 11 11:36:40 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=7.

Sep 11 11:36:40 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=9.

Sep 11 11:36:40 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=10.

Sep 11 11:36:40 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=11.

Sep 11 11:36:41 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001F): Unit operational:unit=1.

Sep 11 11:36:43 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:36:49 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001A): Drive inserted:encl=0, slot=8.

Sep 11 11:36:50 si-bkupdedup05 kernel: 3w-9xxx: scsi2: AEN: INFO (0x04:0x001F): Unit operational:unit=1.

Sep 11 11:36:53 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:37:02 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x0063): Enclosure added:encl=1.

Sep 11 11:37:03 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=0.

Sep 11 11:37:03 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=1.

Sep 11 11:37:03 si-bkupdedup05 cvlabel: using /usr/cvfs/config/raid-strings for raid type information

Sep 11 11:37:04 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001F): Unit operational:unit=1.

Sep 11 11:37:12 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=2.

Sep 11 11:37:12 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=3.

Sep 11 11:37:12 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=4.

Sep 11 11:37:12 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=5.

Sep 11 11:37:12 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=6.

Sep 11 11:37:12 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=7.

Sep 11 11:37:12 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=8.

Sep 11 11:37:12 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=9.

Sep 11 11:37:12 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001A): Drive inserted:encl=1, slot=10.

Sep 11 11:37:13 si-bkupdedup05 kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x001F): Unit operational:unit=3.

When this happened, the DXi was shut down to get some manual assistance. We had the QFE reseat all 3ware HBAs, to ensure a good connection.

Sep 11 11:39:12 si-bkupdedup05 shutdown[11273]: shutting down for system halt

Sep 11 11:39:12 si-bkupdedup05 init: Switching to runlevel: 0

Sep 11 11:39:13 si-bkupdedup05 xinetd[10044]: Exiting...

1.5 RAS Alerts Issued After Bootup

After bootup, the following RAS alerts are issued, indicating multiple RAID/unit failures and drive failures:

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00021>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 21 VINST: C1E0 VPINST: C1E0 EVENT: 7 TEXT: The RAID chassis C1E0 has failed. Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00023>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 23 VINST: C1E0SLT6 VPINST: C1E0 EVENT: 118 TEXT: [Hitachi HUA722010CLA330] Needs replacement or has been replaced and is being rebuilt.

Some drives are seen as foreign units, so the raidsets that the drives belonged to show up as “degraded” or “inoperable” ...

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00070>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 70 VINST: C1U1V0 VPINST: C1E0 EVENT: 115 TEXT: DEGRADED

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00070>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 70 VINST: C1U3V0 VPINST: C1E0 EVENT: 115 TEXT: DEGRADED

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00021>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 21 VINST: C1E1 VPINST: C1E1 EVENT: 7 TEXT: The RAID chassis C1E1 has failed.

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00023>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 23 VINST: C1E1SLT11 VPINST: C1E1 EVENT: 118 TEXT: [WDC WD1002FBYS-02A6B0] Needs replacement or has been replaced and is being rebuilt.

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00070>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 70 VINST: C1U1V0 VPINST: C1E1 EVENT: 115 TEXT: DEGRADED

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00070>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 70 VINST: C1U3V0 VPINST: C1E1 EVENT: 115 TEXT: DEGRADED

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00021>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 21 VINST: C2E0 VPINST: C2E0 EVENT: 7 TEXT: The RAID chassis C2E0 has failed.

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00023>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 23 VINST: C2E0SLT8 VPINST: C2E0 EVENT: 118 TEXT: [Hitachi HUA722010CLA330] Needs replacement or has been replaced and is being rebuilt.

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00070>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 70 VINST: C2U1V0 VPINST: C2E0 EVENT: 115 TEXT: DEGRADED

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00021>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 21 VINST: C3E0 VPINST: C3E0 EVENT: 7 TEXT: The RAID chassis C3E0 has failed.

...and some foreign drives are identified:

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00023>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 23 VINST: C3E0SLT4 VPINST: C3E0 EVENT: 118 TEXT: [Hitachi HUA722010CLA330] Needs replacement or has been replaced and is being rebuilt.

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00023>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 23 VINST: C3E0SLT8 VPINST: C3E0 EVENT: 118 TEXT: [WDC WD1002FBYS-02A6B0] Needs replacement or has been replaced and is being rebuilt.

Sep 11 12:15:35 si-bkupdedup05 hwmond: E0000(1)<00070>:SRVCLOG RCOMP: 1 RINST: UNKNOWN VCOMP: 70 VINST: C3U1V0 VPINST: C3E0 EVENT: 115 TEXT: DEGRADED

2.0 Identifying and Fixing the Problem

This section shows how you can identify the hardware that is causing the problem, and apply a fix.

2.1 Identifying the Correct (Valid Units) RAIDs for Each Controller

On any DXi6xxx with 3ware controllers, the node will have the four controllers shown below, by default. You can see this by giving the command

/opt/DXi/3ware/tw_cli /c0 show

Results:

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy

-----------------------------------------------------------------------------

u0 RAID-1 OK - - - 931.312 RiW OFF

u1 RAID-1 OK - - - 55.8691 RiW OFF

u2 RAID-1 OK - - - 55.8691 RiW OFF

u3 RAID-6 OK - - 256K 7450.5 RiW OFF

NOTE: All drives are on C0E0

Controllers 1, 2 and 3 may have multiple enclosures, as in the examples below.

C1 has 4 units, 2 for each array/EM.

You can see this for C1 by giving the following command:

/opt/DXi/3ware/tw_cli /c1 show

Results:

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy

------------------------------------------------------------------------------

u0 RAID-1 OK - - - 55.8691 RiW OFF

u1 RAID-6 OK - - 256K 7450.5 RiW OFF

u2 RAID-1 OK - - - 55.8691 RiW OFF

u3 RAID-6 OK - - 256K 7450.5 RiW OFF

NOTE: This controller has two enclosures, E0 and E1, so any foreign incomplete/inoperable units would follow with U4, U5 etc.

NOTE: All port listings have been cut to make this document shorter.

VPort Status Unit Size Type Phy Encl-Slot Model

p8 OK u0 59.62 GB SATA - /c1/e0/slt0 SSDSA2SH064G1GC INT

p10 OK u2 59.62 GB SATA - /c1/e1/slt0 SSDSA2SH064G1GC INT

p12 OK u1 931.51 GB SATA - /c1/e0/slt2 Hitachi HUA722010CL

p13 OK u3 931.51 GB SATA - /c1/e1/slt2 Hitachi HUA722010CL

You can see this for C2 by giving the following command:

/opt/DXi/3ware/tw_cli /c2 show

Results: One Array/EM, only two units

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy

------------------------------------------------------------------------------

u0 RAID-1 OK - - - 55.8691 RiW OFF

u1 RAID-6 OK - - 256K 7450.5 RiW OFF

You can see this for C3 by giving the following command:

/opt/DXi/3ware/tw_cli /c3 show

Results: One Array/EM , only two units

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy

------------------------------------------------------------------------------

u0 RAID-1 OK - - - 55.8691 RiW OFF

U1 RAID-6 OK - - 256K 7450.5 RiW OFF

2.2 Identifying the Correct Number of Volumes (Valid Units) per LSI/3ware Controller Card

Please note the following:

The NODE will always have 4 units: U0 (BOOT), U1 (SSD1), U2 (SSD2) and U3 (DATA) .
The ID’s and /dev numbers can be seen in the /opt/DXi/3ware/mapfile.txt file.
Each Array/EM on C1, C2 and C3 will each have the following:

● 1 (one) 1x2 mirror (60G, 100G or 200G SSD drives) (SSD)

● 1 (one) 1x10 RAID6 (1TB, 2TB or 3TB drives) (DATA)

A controller with 1 Array/EM will have U0 and U1.

A controller with 2 Arrays/EMs will have:

● Array/EM #1 U0 and U1

● Array/EM #2 U2 and U3

2.3 Identifying and Fixing Incorrect or Foreign Volumes, Testing Drives, and Adding Back the Good Drives

At this point, we must identify and fix the incorrect or foreign volumes (invalid units) per LSI/3ware controller card, determine if the drives are still good, and if so, add them back to the corresponding raidset(s).

In this SR example, the following volumes per 3ware controller card were Identified to be foreign:

Controller: C1

Problem: P30 and P31 became U4

Action: Need to delete unit U4 and make drives part of U3

Command given: (before fix):

/opt/DXi/3ware/tw_cli /c1 show

Results:

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy

------------------------------------------------------------------------------

u0 RAID-1 OK - - - 55.8691 RiW OFF

u1 RAID-6 OK - - 256K 7450.5 RiW OFF

u2 RAID-1 OK - - - 55.8691 RiW OFF

u3 RAID-6 DEGRADED - 256K 7450.5 RiW OFF #two drives missing P30 and P31

u4 RAID-6 INOPERABLE - 256K 7450.5 Ri OFF #should not exist

P30 and P31 should have been part of U3, but as they were identified as a “foreign unit” and were labeled as U4:

p30 OK u4 931.51 GB SATA - /c1/e0/slt6 Hitachi HUA722010CL

p31 OK u4 931.51 GB SATA - /c1/e1/slt11 WDC WD1002FBYS-02A6

From the output above, we know the following:

● C1 has two EMs on it, so there should only be U0, U1, U2 and U3.
● U4 has ONLY two drives in a RAID 6 configuration, so this unit is NOT part of the original units or raidsets.

To troubleshoot this issue, first look at the 3ware logs and make sure that there were no errors or issues with the two drives on Ports p30 and p31. From the RCA done on this SR, we determined that this was due to the FW version of v22 on the 3ware controllers, which was corrected in FW 2.2.1.3. So, a FW uprade was requested and performed.

Knowing this, and since no errors were found in the logs, we then did the following:

1. Remove the drive in question – this will remove the drive from the controller and will NOT keep the DCB information, meaning that it will become a “new” drive.

/opt/DXi/3ware/tw_cli /c1/p30 remove

/opt/DXi/3ware/tw_cli /c1 show

/opt/DXi/3ware/tw_cli /c1 rescan

The drive will then show up as follows:

p30 - u1 931.51 GB SATA - /c1/e0/slt6 Hitachi HUA722010CL

2. Add the drive back into the RAID that it belongs to:

/opt/DXi/3ware/tw_cli /c1/u1 start rebuild disk=30

3. Rescan to ensure that no other drives become foreign or fail:

/opt/DXi/3ware/tw_cli /c1 rescan

/opt/DXi/3ware/tw_cli /c1 show

4. When you see the drive “rebuilding" into U3, you can do the same thing with drive in p31:

/opt/DXi/3ware/tw_cli /c1/p31 remove

/opt/DXi/3ware/tw_cli /c1 show

The drive will show up as follows:

p31 OK - 931.51 GB SATA - /c1/e0/slt6 Hitachi HUA722010

5. Add the drive back into the raid that it belongs to.

/opt/DXi/3ware/tw_cli /c1/u1 start rebuild disk=31

6. Rescan to ensure that no other drives become foreign or fail, and that the two drives are “rebuilding”.

/opt/DXi/3ware/tw_cli /c1 rescan

7. Take a look at the results.

/opt/DXi/3ware/tw_cli /c1 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy

------------------------------------------------------------------------------

u0 RAID-1 OK - - - 55.8691 RiW OFF

u1 RAID-6 OK - - 256K 7450.5 RiW OFF

u2 RAID-1 OK - - - 55.8691 RiW OFF

u3 RAID-6 REBUILDING 2% - 256K 7450.5 RiW OFF

Note: No U4 present any longer

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

p30 DEGRADED u3 931.51 GB SATA - /c1/e0/slt6 Hitachi HUA722010CL

p31 DEGRADED u3 931.51 GB SATA - /c1/e1/slt11 WDC WD1002FBYS-02A6

Controller: C2

Problem: P19 became U3

Action: Need to remove U3 drive and make part of U2

BEFORE:

/opt/DXi/3ware/tw_cli /c2

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy

------------------------------------------------------------------------------

u0 RAID-1 OK - - - 55.8691 RiW OFF

u1 RAID-6 DEGRADED - - 256K 7450.5 RiW OFF

u2 RAID-6 INOPERABLE - - 256K 7450.5 RiW OFF

------------------------------------------------------------------------------

p18 OK u1 931.51 GB SATA - /c2/e0/slt11 Hitachi HUA722010CL

p19 OK u2 931.51 GB SATA - /c2/e0/slt8 Hitachi HUA722010CL

1. Remove the drive in question – this will remove the drive from the controller and will NOT keep the DCB information, meaning that this drive will become a “new” drive.

/opt/DXi/3ware/tw_cli /c2/p19 remove

/opt/DXi/3ware/tw_cli /c2 show

Drive will show up as:

p19 OK - 931.51 GB SATA - /c2/e0/slt8 Hitachi HUA722010CL

2. Add the drive back into the raid that it belongs to.

/opt/DXi/3ware/tw_cli /c2/u1 start rebuild disk=19

3. Rescan to ensure that no other drives become foreign or fail and that the drive are “rebuilding”.

/opt/DXi/3ware/tw_cli /c2 rescan

AFTER:

/opt/DXi/3ware/tw_cli /c2 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy

------------------------------------------------------------------------------

u0 RAID-1 OK - - - 55.8691 RiW OFF

u1 RAID-6 REBUILDING 2% - 256K 7450.5 RiW OFF

Note: U2 is no longer present

------------------------------------------------------------------------------

p18 OK u1 931.51 GB SATA - /c2/e0/slt11 Hitachi HUA722010CL

p19 DEGRADED u2 931.51 GB SATA - /c2/e0/slt8 Hitachi HUA722010CL

Controller: C3

Problem: P18 and P19 became U3

Action: Need to remove U3 disks and make them part of U2

BEFORE:

/opt/DXi/3ware/tw_cli /c3 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy

------------------------------------------------------------------------------

u0 RAID-1 OK - - - 55.8691 RiW OFF

u1 RAID-6 OK - - 256K 7450.5 RiW OFF

u3 RAID-6 INOPERABLE - - 256K 7450.5 Ri OFF

P18 and P19 should have been part of U2, but as they were identified as a “foreign unit” they were labeled as U3:

p18 OK u3 931.51 GB SATA - /c3/e0/slt8 WDC WD1002FBYS-02A6

p19 OK u3 931.51 GB SATA - /c3/e0/slt4 Hitachi HUA722010CL

From the output above, we know that C3 has one EM on it, so there should only be U0 and U1. U2 has ONLY two drives in a RAID 6 configuration, so this unit is NOT part of the original units or raid sets.

First, look at the 3ware logs, and make sure that there were no errors or issues with the two drives on Port p18 and p19. From the RCA done on this SR, we determined that the drives were incorrectly identified because the 3ware controllers had FW v22. This problem was corrected on FW 2.2.1.3, so a FW upgrade was requested and performed. Knowing this, and since no errors were found in the logs, we proceeded to do the following.

1. Remove the drive in question – this will remove the drive from the controller and will NOT keep the DCB information, so this drive will become a “new” drive.

/opt/DXi/3ware/tw_cli /c3/p18 remove

/opt/DXi/3ware/tw_cli /c3 show

/opt/DXi/3ware/tw_cli /c3 rescan

2. Add the drive back into the RAID that it belongs to.

/opt/DXi/3ware/tw_cli /c3/u1 start rebuild disk=18

3. Rescan to ensure that no other drives become foreign or fail.

/opt/DXi/3ware/tw_cli /c3 show

The drive will show up as:

p18 OK - 931.51 GB SATA - /c3/e0/slt8 WDC WD1002FBYS-02A6

4. Once you see the drive “rebuilding" into U2, you can do the same thing with p19.

/opt/DXi/3ware/tw_cli /c3/p19 remove

/opt/DXi/3ware/tw_cli /c3 show

The drive will show up as:

p19 OK - 931.51 GB SATA - /c3/e0/slt4 WDC WD1002FBYS-02A6

5. Add the drive back into the RAID it belongs to.

/opt/DXi/3ware/tw_cli /c3/u1 start rebuild disk=19

6. Rescan to ensure that no other drives become foreign or fail, and that the two drives are “rebuilding".

/opt/DXi/3ware/tw_cli /c3 rescan

AFTER:

/opt/DXi/3ware/tw_cli /c1 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy

------------------------------------------------------------------------------

u0 RAID-1 OK - - - 55.8691 RiW OFF

u1 RAID-6 OK - - 256K 7450.5 RiW OFF

u2 RAID-1 OK - - - 55.8691 RiW OFF

u3 RAID-6 REBUILDING 2% - 256K 7450.5 RiW OFF

Note: No U4 is no longer present

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

p18 DEGRADED u1 931.51 GB SATA - /c3/e0/slt8 WDC WD1002FBYS-02A6

p19 DEGRADED u1 931.51 GB SATA - /c3/e0/slt4 Hitachi HUA722010CL

2.4 Checking to Ensure That All Is OK

As a final check, run a “show” command against all controllers to ensure that all is OK:

/opt/DXi/3ware/tw_cli /c0 show

NOTE: Make sure that all units have an "OK", "REBUILDING", or INITIALIZE status.

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy

------------------------------------------------------------------------------

u0 RAID-1 OK - - - 931.312 RiW OFF

u1 RAID-1 OK - - - 55.8691 RiW OFF

u2 RAID-1 OK - - - 55.8691 RiW OFF

u3 RAID-6 OK - - 256K 7450.5 RiW OFF

All Drives should show an “OK” status (Only port P8 is shown here. There can be as many as 31 ports, depending on the number of Arrays/EM’s):

VPort Status Unit Size Type Phy Encl-Slot Model

------------------------------------------------------------------------------

p8 OK u1 59.62 GB SATA - /c0/e0/slt5 SSDSA2SH064G1GC INT

Name OnlineState BBUReady Status Volt Temp Hours LastCapTest

---------------------------------------------------------------------------

bbu On Yes OK OK OK 0 xx-xxx-xxxx

Look at the status for C1:

/opt/DXi/3ware/tw_cli /c1 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy

------------------------------------------------------------------------------

u0 RAID-1 OK - - - 55.8691 RiW OFF

u1 RAID-6 OK - - 256K 7450.5 RiW OFF

u2 RAID-1 OK - - - 55.8691 RiW OFF

u3 RAID-6 OK - - 256K 7450.5 RiW OFF

All Drives should show an “OK” or "DEGRADED" status if the unit is being rebuilt (Only port P8 is shown here. There can be as many as 31 ports, depending on the number of Arrays/EM’s):

VPort Status Unit Size Type Phy Encl-Slot Model

------------------------------------------------------------------------------

p8 OK u0 59.62 GB SATA - /c1/e0/slt0 SSDSA2SH064G1GC INT

Name OnlineState BBUReady Status Volt Temp Hours LastCapTest

---------------------------------------------------------------------------

bbu On Yes OK OK OK 0 xx-xxx-xxxx

Look at the status for C2:

/opt/DXi/3ware/tw_cli /c2 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy

------------------------------------------------------------------------------

u0 RAID-1 OK - - - 55.8691 RiW OFF

u1 RAID-6 OK - - 256K 7450.5 RiW OFF

All Drives should show an “OK” status (Only port P8 is shown here. There can be as many as 31 ports, depending on the number of Arrays/EM’s):

VPort Status Unit Size Type Phy Encl-Slot Model

------------------------------------------------------------------------------

p8 OK u0 59.62 GB SATA - /c2/e0/slt0 SSDSA2SH064G1GC INT

Name OnlineState BBUReady Status Volt Temp Hours LastCapTest

---------------------------------------------------------------------------

bbu On Yes OK OK OK 0 xx-xxx-xxxx

Look at the status for C3:

/opt/DXi/3ware/tw_cli /c3 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy

------------------------------------------------------------------------------

u0 RAID-1 OK - - - 55.8691 RiW OFF

u1 RAID-6 OK - - 256K 7450.5 RiW OFF

All Drives should show an “OK” status (Only port P8 is shown here. There can be as many as 31 ports, depending on the number of Arrays/EM’s):

VPort Status Unit Size Type Phy Encl-Slot Model

------------------------------------------------------------------------------

p8 OK u0 59.62 GB SATA - /c3/e0/slt0 SSDSA2SH064G1GC INT

Name OnlineState BBUReady Status Volt Temp Hours LastCapTest

---------------------------------------------------------------------------

bbu On Yes OK OK OK 0 xx-xxx-xxxx

NOTE: Don’t forget to issue the command “chkconfig heartbeat on” before you reboot the DXi. This will ensure that everything comes up normally and will ensure that when the DXi is rebooted by the customer, it will come online all the way up and does not go into diagnostics mode.

3.0 Requesting Additional Assistance

If you need further assistance, please contact Service Engineering before running any commands in question. Take extra precautions when you delete units. If you accidentally delete a unit, it will be NON-RECOVERABLE, and data loss will be occur!