SR3569222 Volume not on preferred path due to AVT/RDAC failover

SR Information: 3569222 Sky Creative

Problem Description: Volume Not On Preferred Path, MD Array has Amber light

Product / Software Version:

MDC:

SNFS 5.1.1

M330 Appliance

Overview

MD Array has Amber light switched on due to Volumes Not on Preferred Path

I/O Data Path Protection

When designing a storage area network (SAN), the duplication of host bus adapters (HBAs), cables, switches,

controllers, and other components provides redundancy and can prevent loss of data access in the event of a

component failure. This redundancy means that the host has one or more paths to each controller.

When creating volumes, a controller is assigned to own the volume and is referred to as the preferred owner.

The preferred owner may be selected to achieve load balancing across controllers. Most host multi-path drivers

will attempt to access each volume on a path to its preferred controller. However, if this preferred path becomes

unavailable, the multi-path driver on the host will failover to an alternate path. This failover might cause the volume

ownership to change to the alternate controller

Symptoms & Identifying the problem

## 1 ## Log Review:

Note: Usually the LSI Collect is within the Capture-State of the MDC located at /usr/adic/tmp/platform/hw-info/

This platform version is affected by Bug 51867 - Collect script on M-series is not gathering the LSI array collect information anymore

A manual execution of the LSI collect needs to be done with the following command from either node:

/usr/bin/SMcli -n Qarray1 -c "save storageArray supportData

file=\"/tmp/array-1-supportData\";" > /dev/null 2>&1

The output of this command will actually be a 7z file called

/tmp/array-1-supportData.7z.

recovery-guru-procedures.html

Failure Entry 1: NON_PREFERRED_PATH-Recovery Failure Type Code: 10

Storage array: Qarray1
Preferred owner: Controller in slot B
Current owner: Controller in slot A
Affected volume group: 1
Volume(s): TRAY_85_VOL_3, TRAY_85_VOL_4
Affected volume group: 3
Volume(s): TRAY_85_VOL_6

The Recovery Guru provides insight on what Volumes are affected and how to resolve the Problem but doesn’t provide a detailed Root Cause.

major-event-log.txt

Using the MELD Parser will allow a more granular view on the events.

We can see that we hit 2 Events prior the Volume transfer

Event 210A Controller Cache ot enabled or internally disabled
IO Shipping implicit Volume Transfer

Unfortunately the System Messages doesn’t date back that far, so we can’t analyze what happened at the host side.

## 2 ## Troubleshooting:

Verify the Volume Distribution and re-distribute using SMCli

[root@ostcssnmdcp01 stornext]# SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "show storageArray volumeDistribution;"

Volume name: TRAY_85_VOL_1

Current owner is controller in slot: A

Volume name: TRAY_85_VOL_2

Current owner is controller in slot: A

Volume name: TRAY_85_VOL_3

Current owner is controller in slot: A

Volume name: TRAY_85_VOL_4

Current owner is controller in slot: A

Volume name: TRAY_85_VOL_5

Current owner is controller in slot: A

Volume name: TRAY_85_VOL_6

Current owner is controller in slot: A

All Volumes are currently owned by Controller in Slot A

[root@ostcssnmdcp01 stornext]# SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "show storageArray preferredVolumeOwners;"

Volume name: TRAY_85_VOL_1

Preferred owner is controller in slot: A

Volume name: TRAY_85_VOL_2

Preferred owner is controller in slot: A

Volume name: TRAY_85_VOL_3

Preferred owner is controller in slot: B

Volume name: TRAY_85_VOL_4

Preferred owner is controller in slot: B

Volume name: TRAY_85_VOL_5

Preferred owner is controller in slot: A

Volume name: TRAY_85_VOL_6

Preferred owner is controller in slot: B

We can list the preferred Volume Owner using “show storageArray preferredVolumeOwners”

[root@ostcssnmdcp01 stornext]# SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "reset storageArray volumeDistribution;"

[root@ostcssnmdcp01 stornext]# SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "show storageArray volumeDistribution;"

Volume name: TRAY_85_VOL_1

Current owner is controller in slot: A

Volume name: TRAY_85_VOL_2

Current owner is controller in slot: A

Volume name: TRAY_85_VOL_3

Current owner is controller in slot: B

Volume name: TRAY_85_VOL_4

Current owner is controller in slot: B

Volume name: TRAY_85_VOL_5

Current owner is controller in slot: A

Volume name: TRAY_85_VOL_6

Current owner is controller in slot: B

Resetting the Volume Distribution and re-view it.

## 3 ## Root Cause:

What Caused the Problem?

There is a problem accessing the controller listed in the Recovery Guru Details area. Any volumes that have this controller assigned as their preferred path will be moved to the non-preferred path (alternate controller).

Possible causes include:

The controller failed a manually initiated diagnostic test and was placed Offline.
The controller was manually placed Offline using the Hardware > Controller > Advanced > Place > Offline menu option.
There are disconnected or faulty cables.
A Hub or Fabric switch is not functioning properly.
A host adapter has failed.
The storage array contains a defective RAID controller.

Let’s take a look into the NetApp KB

EF540 errors 'Volume not on preferred path due to AVT/RDAC failover'

Symptoms

EF540 reports the following error on three occasions and removes the volume from the preferred controller:

'Volume not on preferred path due to AVT/RDAC failover'

Cause

IO shipping feature on storage causes the volumes to transfer. Lack of the MPIO driver prevents proper handling of IO down multiple paths. There are other instances that can cause this issue and this is just one of them. Proper investigation of storage and host side logs helps narrow down to this conclusion.

With the Information available, we can’t clearly state what the root cause might have been since the host logs for that date & time are missing.

From the LSI Collect we may exclude that we have a faulty battery or faulty controller as this was a onetime incident probably caused by the IO Shipping feature.

If this condition re-occurs all relevant logs need to be collected in a timely manner and if RC remains unknown a Escalation should be considered.

What we learn from this case:

How to manually collect the LSI-Collect if missing from Capture State
Using the Recovery Guru to identify & resolve the Error Condition
Show and Re-Distribute Volumes with/to their preferred Owner using SMCli