Duplicate Volume Found During Hardware Expansion

SR Information: 1576782

Product / Software Version: 2.2.1 - Issue found on DXi 45xx and possibly detectable in other platforms (excluding issue related to bug 33488)

Problem Description: Data Loss during rebuild due to puncture stripe

Related PTRs:

Bug 33225 - JBOD cleaning script can delete the wrong devices

Overview

The DXi will fail to execute expansion if it finds labels under the new disks added.

This article shows how you can identify and handle the issue, as well as what data you need to collect before you escalate to your backline in case assistance is needed.

Note: The log information, command output, and data collection outlined below were were gathered from an existing SR. Obviously, you may see different device information or results. In case of doubts, we recommend that you escalate the case to your backline team.

This article covers the following topics:

Additional Information
Symptom (How to Identify the Problem)
Resolution (Workaround or Fix)

Additional Information

Before and after any expansion, it's highly recommended that you collect the following information. which will help diagnose issues that may occur:

Gather the output of the following commands:

# /usr/cvfs/bin/cvlabel –L

# /opt/DXi/3ware/3waretool.sh --map

(applicable for DXi systems with a 3ware controller)

# find /sys -name block:sd\*

# hexdump -C device -s 00000200 -n 20000 > device.out

(where device is the device node of the disk /dev/sd## used for the snfs filesystem. This depends of the number of devices -- you may want to write a shell script to collect this information)

Example:

hexdump -C /dev/sdm -s 00000200 -n 20000 > sdm.out

Gather the following logs:
- 3ware collect (or the equivalent storage collect that is available on the DXi platform you are working with)
- DXi collect

Gather copies of the folloiwng files:
- /usr/cvfs/config/vol0.cfg
/opt/DXi/3ware/mapfile*
(applicable for DXi systems with a 3ware controller)

Symptom (How to Identify the Problem)

The messages log will present the following event:

Note that the log information was collected from an existing SR. Your device information may be different.

(32484) Jun 24 15:15:02 ustchqtm1 systemupgrade: [CheckVolumes] : disable_foreign_volumes() - Disabling volume: /dev/sdm /c2/u0 0017522C9B7C2500F160 SSD

(32485) Jun 24 15:15:03 ustchqtm1 srvclogcli: E0000(1)<00008>:SRVCLOG RCOMP: 8 RINST: CheckVolumes VCOMP: 8 VINST: CheckVolumes VPINST: UNKNOWN EVENT: 41 TEXT: Found duplicate volume /c2/u0, disabling volume. Ticket creation time: 06/24 15:15:03 CDT

(32486) Jun 24 15:15:03 ustchqtm1 systemupgrade: [CheckVolumes] : disable_foreign_volumes() - Disabling volume: /dev/sdq /c2/u2 YGJA3Z1A9B7C3400A650 DATA

(32489) Jun 24 15:15:04 ustchqtm1 srvclogcli: E0000(1)<00008>:SRVCLOG RCOMP: 8 RINST: CheckVolumes VCOMP: 8 VINST: CheckVolumes VPINST: UNKNOWN EVENT: 41 TEXT: Found duplicate volume /c2/u2, disabling volume. Ticket creation time: 06/24 15:15:04 CDT

Here's an additional symptom that may or may not be found on a 3ware array, depending on the port to which the EM is connected:

For this case, please consult PTR 33225 for the workaround/fix.

Before the rescan: Output of "tw_cli /c2 show" reports Units U1 and U3 under Enclosure 0

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy

------------------------------------------------------------------------------

u1 RAID-1 OK - - - 186.254 RiW OFF

u3 RAID-6 OK - - 256K 14901.1 RiW OFF

VPort Status Unit Size Type Phy Encl-Slot Model

------------------------------------------------------------------------------

p10 OK u1 186.31 GB SATA - /c2/e0/slt0 INTEL SSDSA2BZ200G3

p11 OK u1 186.31 GB SATA - /c2/e0/slt1 INTEL SSDSA2BZ200G3

p12 OK u3 1.82 TB SATA - /c2/e0/slt2 Hitachi HUA723020AL

p13 OK u3 1.82 TB SATA - /c2/e0/slt3 Hitachi HUA723020AL

p14 OK u3 1.82 TB SATA - /c2/e0/slt4 Hitachi HUA723020AL

p15 OK u3 1.82 TB SATA - /c2/e0/slt5 Hitachi HUA723020AL

p16 OK u3 1.82 TB SATA - /c2/e0/slt6 Hitachi HUA723020AL

p17 OK u3 1.82 TB SATA - /c2/e0/slt7 Hitachi HUA723020AL

p18 OK u3 1.82 TB SATA - /c2/e0/slt8 Hitachi HUA723020AL

p19 OK u3 1.82 TB SATA - /c2/e0/slt9 Hitachi HUA723020AL

p20 OK u3 1.82 TB SATA - /c2/e0/slt10 Hitachi HUA723020AL

p21 OK u3 1.82 TB SATA - /c2/e0/slt11 Hitachi HUA723020AL

After the Rescan: Output of "tw_cli /c2 show" now reports Units U1 and U3 under Enclosure 1 and the new Units U0 and U2 under Enclosure 0

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy

------------------------------------------------------------------------------

u0 RAID-1 OK - - - 186.254 RiW OFF

u1 RAID-1 OK - - - 186.254 RiW OFF

u2 RAID-6 OK - - 256K 14901.1 RiW OFF

u3 RAID-6 OK - - 256K 14901.1 RiW OFF

VPort Status Unit Size Type Phy Encl-Slot Model

------------------------------------------------------------------------------

p8 OK u0 186.31 GB SATA - /c2/e0/slt0 STEC SSDSA M16ISD2-

p9 OK u0 186.31 GB SATA - /c2/e0/slt1 STEC SSDSA M16ISD2-

p10 OK u2 1.82 TB SATA - /c2/e0/slt2 Hitachi HUA723020AL

p11 OK u1 186.31 GB SATA - /c2/e1/slt0 INTEL SSDSA2BZ200G3

p12 OK u1 186.31 GB SATA - /c2/e1/slt1 INTEL SSDSA2BZ200G3

p13 OK u3 1.82 TB SATA - /c2/e1/slt2 Hitachi HUA723020AL

p14 OK u2 1.82 TB SATA - /c2/e0/slt3 Hitachi HUA723020AL

p15 OK u2 1.82 TB SATA - /c2/e0/slt4 Hitachi HUA723020AL

p16 OK u2 1.82 TB SATA - /c2/e0/slt5 Hitachi HUA723020AL

p17 OK u2 1.82 TB SATA - /c2/e0/slt6 Hitachi HUA723020AL

p18 OK u3 1.82 TB SATA - /c2/e1/slt3 Hitachi HUA723020AL

p19 OK u3 1.82 TB SATA - /c2/e1/slt4 Hitachi HUA723020AL

p20 OK u3 1.82 TB SATA - /c2/e1/slt5 Hitachi HUA723020AL

p21 OK u3 1.82 TB SATA - /c2/e1/slt6 Hitachi HUA723020AL

p22 OK u2 1.82 TB SATA - /c2/e0/slt7 Hitachi HUA723020AL

p23 OK u2 1.82 TB SATA - /c2/e0/slt8 Hitachi HUA723020AL

p24 OK u2 1.82 TB SATA - /c2/e0/slt9 Hitachi HUA723020AL

p25 OK u2 1.82 TB SATA - /c2/e0/slt10 Hitachi HUA723020AL

p26 OK u3 1.82 TB SATA - /c2/e1/slt7 Hitachi HUA723020AL

p27 OK u3 1.82 TB SATA - /c2/e1/slt8 Hitachi HUA723020AL

p28 OK u3 1.82 TB SATA - /c2/e1/slt9 Hitachi HUA723020AL

p29 OK u3 1.82 TB SATA - /c2/e1/slt10 Hitachi HUA723020AL

p30 OK u2 1.82 TB SATA - /c2/e0/slt11 Hitachi HUA723020AL

p31 OK u3 1.82 TB SATA - /c2/e1/slt11 Hitachi HUA723020AL

******

Resolution (Workaround or Fix)

There are two options to solve this issue.

Option 1:

If allowed, you may request another EM module to perform the expansion.

Option2:

Important Notes Before You Use This Option:

This workaround was executed specifically for a 67xx. If you find similar issues and cannot apply Option 1, please escalate to your backline.
If you also face a condition or symptom other than those covered in this article, please escalate the case to your backline.

There is a command that allows you to remove the label of the LUN. But before removing the label, you need to make sure of the following things:

Be certain about which LUNS you will be removing the label from.
Make sure that the LUNS (from the EM) haven't been added yet to the snfs configuration (vol0.cfg).

If you have doubts, please consult your senior engineer, or escalate to a backline engineer, providing the information that you collected as advised above.

How to check if the LUNS (from the EM) weren't added yet:

The binary to expand the snfs is cvupdatefs. The information about the expansion will be available in the file expandFS.log (the file is gathered by collect in the app-info directory).

Look at the exapandFS.log file and confirm that the expansion has not been done. If it was done, please escalate the case.

Also check the vol0.cfg file (gathered by collect and available in the snfs-info directory), to confirm that the LUNS of the EM have not been added to the snfs.

After you check that the snfs is clean (step 1 and 2) and there is no reference to the LUNS of the new EM, continue with the steps below.

Check to see if the mapfile.txt has the new LUNS with labels

In this case you can have two possible conditions:

4.1. The new LUNS of the new EM under mapfile.txt have labels, but the output of cvlabel -L shows no labels on those LUNS.

-- If this is the case, you can remove the new LUN entries, reboot the DXi, and try to expand again.

4.2. The new LUNS of the new EM under mapfile.txt have labels, and the output of cvlabel -L also shows labels on those LUNS.

-- If this is the case, proceed with the next step.

Now let's remove the snfs label.

ATTENTION: Do not execute this command on the wrong LUN:

# cvlabel -U /dev/sd##

Where /dev/sd## is the device node of the new LUN from the new EM that is being pointed to in this case, which already had a label during the expansion process.

At this point, you may reboot and try to do the expansion. If expansion doesn't work, redo the steps above, followed by cleaning the JBOD (new EM). The procedure to clean the JBOD is available in the Field Service Manual (but make sure you don't hit the issue reported under bug 33225).

Additional Note:

On the DXi, the following file may be created during expansion:

/opt/DXi/FilesystemExpansionInProgress

This file may be created during the expansion process. You'll see that it is an empty file.

Supposing that the expansion process was interrupted, but nothing wrong occurred, and you need to continue the expansion, if you create this file (using the touch command), the DXi will try to resume the expansion on the next reboot.

In the procedure above, you may want to check if the file exists.