SR3569222 Volume not on preferred path due to AVT/RDAC failover

SR Information: 3569222 Sky Creative

 

Problem Description: Volume Not On Preferred Path, MD Array has Amber light

 

Product / Software Version:

 

MDC:

SNFS 5.1.1

M330 Appliance

 

 Overview

MD Array has Amber light switched on due to Volumes Not on Preferred Path

 

I/O Data Path Protection

 

When designing a storage area network (SAN), the duplication of host bus adapters (HBAs), cables, switches,

controllers, and other components provides redundancy and can prevent loss of data access in the event of a

component failure. This redundancy means that the host has one or more paths to each controller.

 

When creating volumes, a controller is assigned to own the volume and is referred to as the preferred owner.

The preferred owner may be selected to achieve load balancing across controllers. Most host multi-path drivers

will attempt to access each volume on a path to its preferred controller. However, if this preferred path becomes

unavailable, the multi-path driver on the host will failover to an alternate path. This failover might cause the volume

ownership to change to the alternate controller

 

 

 

Symptoms & Identifying the problem

 

## 1 ## Log Review:

 

Note: Usually the LSI Collect is within the Capture-State of the MDC located at /usr/adic/tmp/platform/hw-info/

This platform version is affected by Bug 51867 - Collect script on M-series is not gathering the LSI array collect information anymore

 

A manual execution of the LSI collect needs to be done with the following command from either node:

 

    /usr/bin/SMcli -n Qarray1 -c "save storageArray supportData

file=\"/tmp/array-1-supportData\";" > /dev/null 2>&1

 

The output of this command will actually be a 7z file called

/tmp/array-1-supportData.7z.

 

 

recovery-guru-procedures.html

 

Failure Entry 1: NON_PREFERRED_PATH-Recovery Failure Type Code: 10


Storage array: Qarray1
Preferred owner: Controller in slot B
Current owner: Controller in slot A
Affected volume group: 1
Volume(s): TRAY_85_VOL_3, TRAY_85_VOL_4
Affected volume group: 3
Volume(s): TRAY_85_VOL_6

 

The Recovery Guru provides insight on what Volumes are affected and how to resolve the Problem but doesn’t provide a detailed Root Cause.

 

major-event-log.txt

 

Using the MELD Parser will allow a more granular view on the events.

 

PrefferedPath2.JPG

 

 

We can see that we hit 2 Events prior the Volume transfer

 

  • Event 210A Controller Cache ot enabled or internally disabled
  • IO Shipping implicit Volume Transfer

 

Unfortunately the System Messages doesn’t date back that far, so we can’t analyze what happened at the host side.

 

 

 

## 2 ## Troubleshooting:

 

Verify the Volume Distribution and re-distribute using SMCli

 

[root@ostcssnmdcp01 stornext]# SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "show storageArray volumeDistribution;"

Volume name: TRAY_85_VOL_1

              Current owner is controller in slot: A

Volume name: TRAY_85_VOL_2

              Current owner is controller in slot: A

Volume name: TRAY_85_VOL_3

              Current owner is controller in slot: A

Volume name: TRAY_85_VOL_4

              Current owner is controller in slot: A

Volume name: TRAY_85_VOL_5

              Current owner is controller in slot: A

Volume name: TRAY_85_VOL_6

              Current owner is controller in slot: A

 

All Volumes are currently owned by Controller in Slot A

 

[root@ostcssnmdcp01 stornext]# SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "show storageArray preferredVolumeOwners;"

Volume name: TRAY_85_VOL_1

              Preferred owner is controller in slot: A

Volume name: TRAY_85_VOL_2

              Preferred owner is controller in slot: A

Volume name: TRAY_85_VOL_3

              Preferred owner is controller in slot: B

Volume name: TRAY_85_VOL_4

              Preferred owner is controller in slot: B

Volume name: TRAY_85_VOL_5

              Preferred owner is controller in slot: A

Volume name: TRAY_85_VOL_6

              Preferred owner is controller in slot: B

 

We can list the preferred Volume Owner using “show storageArray preferredVolumeOwners

 

[root@ostcssnmdcp01 stornext]# SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "reset storageArray volumeDistribution;"

[root@ostcssnmdcp01 stornext]# SMcli Qarray1a -n Qarray1 -p Qa@Ar39! -S -c "show storageArray volumeDistribution;"

Volume name: TRAY_85_VOL_1

              Current owner is controller in slot: A

Volume name: TRAY_85_VOL_2

              Current owner is controller in slot: A

Volume name: TRAY_85_VOL_3

              Current owner is controller in slot: B

Volume name: TRAY_85_VOL_4

              Current owner is controller in slot: B

Volume name: TRAY_85_VOL_5

              Current owner is controller in slot: A

Volume name: TRAY_85_VOL_6

              Current owner is controller in slot: B

 

Resetting the Volume Distribution and re-view it.

 

 

## 3 ## Root Cause:

What Caused the Problem?

There is a problem accessing the controller listed in the Recovery Guru Details area. Any volumes that have this controller assigned as their preferred path will be moved to the non-preferred path (alternate controller). 

Possible causes include:

  • The controller failed a manually initiated diagnostic test and was placed Offline.
  • The controller was manually placed Offline using the Hardware > Controller > Advanced > Place > Offline menu option.
  • There are disconnected or faulty cables.
  • A Hub or Fabric switch is not functioning properly.
  • A host adapter has failed.
  • The storage array contains a defective RAID controller.

 

Let’s take a look into the NetApp KB

EF540 errors 'Volume not on preferred path due to AVT/RDAC failover'

Symptoms

EF540 reports the following error on three occasions and removes the volume from the preferred controller:

'Volume not on preferred path due to AVT/RDAC failover'

Cause

IO shipping feature on storage causes the volumes to transfer. Lack of the MPIO driver prevents proper handling of IO down multiple paths. There are other instances that can cause this issue and this is just one of them. Proper investigation of storage and host side logs helps narrow down to this conclusion.

 

With the Information available, we cant clearly state what the root cause might have been since the host logs for that date & time are missing. 

From the LSI Collect we may exclude that we have a faulty battery or faulty controller as this was a onetime incident probably caused by the IO Shipping feature.

If this condition re-occurs all relevant logs need to be collected in a timely manner and if RC remains unknown a Escalation should be considered.

 

 

What we learn from this case:

  • How to manually collect the LSI-Collect if missing from Capture State
  • Using the Recovery Guru to identify & resolve the Error Condition
  • Show and Re-Distribute Volumes with/to their preferred Owner using SMCli



This page was generated by the BrainKeeper Enterprise Wiki, © 2018