Troubleshooting HA

Question: How can I restart a file system without causing an HA failover?

Answer: To be clear, individual file-system failover must be distinguished from HA Reset of an entire MDC. When redundant FSMs are running on both MDCs in an HA Cluster, the active FSM can be on either MDC. In the case of managed file systems, the FSMs are started only on the Primary MDC, so these can be stopped and started at will without causing an HA Reset. Unmanaged file-system FSMs are started on both MDCs, so stopping an active unmanaged FSM will result in a single file system failover to the standby FSM on the peer MDC. An HA Reset occurs only when the failover is putting the file system in danger of data corruption from simultaneous write access to StorNext metadata and databases. This is typically the case for the HaShared file system, so take extra care with its FSM.

The recommended way for making configuration changes and restarting FSMs is to use the 'config' mode, which stops CVFS on one MDC and disables HA Reset on the other. CVFS will be restarted when returning to 'default' mode with both MDCs operating redundantly.

Run the following command at the CLI:

snhamgr config
Make your configuration changes, and then run the following command:

snhamgr start

If you are only restarting FSMs without making configuration changes, the following steps will restart an FSM. To restart an HaManaged FSM, use this cvadmin command:

fail <file system name>

To restart an HaUnmanaged FSM or the HaShared FSM:

snhamgr mode=locked # on the secondary

snhamgr mode=single # on the primary

cvadmin # on the primary

fail <file system name>

select # repeat until you observe the FSM has started and activated

snhamgr start # on the primary

Question: What Conditions Trigger a Failover in StorNext (File System only)

Answer: There could be several reasons why a failover is triggered. See HA Resets in the HA topic.

Question: What conditions trigger the voting process for StorNext file system failover?

Answer: Either a StorNext File System client or a Node Status Service (NSS) coordinator (the systems listed in the fsnameservers file) can initiate a vote.

An SNFS client triggers a vote when its TCP connection to a File System Manager (FSM) is disconnected. In many failure scenarios this loss of TCP connectivity is immediate, so it is often the primary initiator of a vote.

On Windows systems, StorNext provides a configuration option called Fast Failover that triggers a vote as a result of a 3 second FSM heartbeat loss. Occasionally, this is necessary because TCP disconnects can be delayed. There is also an NSS heartbeat between members and coordinators every half second. The NSS coordinator triggers a vote if the NSS heartbeat is absent for an FSM server for three seconds. Because the client triggers usually occur first, the coordinator trigger is not commonly seen.

Question: Why does the Primary MDC keep running without the HaShared file system failing over and without an HA Reset when I pull its only Ethernet cable? The HA Cluster appears to be hung.

In this situation the lab configuration is as follows:

MDC 1:

Hostname Shasta

10.35.1.110

MDC 2:

Hostname Tahoe

10.35.1.12

Two File Systems:

HaShared type: HAFS

HaManaged type: Reno3

There are no other client computers.

Shasta is the Primary MDC before the Ethernet cable is pulled.

At one point after the Ethernet was pulled, cvadmin on Tahoe showed:

Tahoe:/usr/cvfs/config # cvadmin

StorNext Administrator

Enter command(s)

For command help, enter "help" or "?".

List FSS

File System Services (* indicates service is in control of FS):

1>*HAFS[0] located on tahoe:50139 (pid 13326)

snadmin> select FSM "HAFS"

Admin Tap Connection to FSM failed: [errno 104]: Connection reset by peer

FSM may have too many connections active.

Cannot select FSS "HAFS"

snadmin> start reno3

Start FSS "reno3"

Cannot start FSS 'reno3' - failed (FSM cannot start on non-Primary server)

snadmin> activate reno3

Activate FSM "reno3"

Could not find File System Manager for "reno3" on Tahoe.

Cannot activate FSS reno3

Answer: The reason the failover and HA Reset did not occur is because the HaShared FSM on Shasta continues to be active, and this was detected in the ARB block through the SAN by the FSM on Tahoe.

Here's why. When the LAN connection is lost on Shasta, its active HaShared FSM continues to have one client: the Shasta MDC itself. On Tahoe, an election is held when the LAN heartbeats from Shasta's HAFS FSM stop, and Tahoe's FSM gets one vote from the client on Tahoe. The Tahoe FSM is told to activate, but cannot usurp the ARB with a 1-to-1 tie. So, it tries five times, then exits, and a new FSM is started in its place. You can observe this by running the cvadmin command and watching the FSM's PID change every 20 seconds or so.

In StorNext 4.x HA allows HaUnmanaged FSMs to failover without resetting the MDC if possible, and HaManaged FSMs do not fail over because they are only started on the primary MDC.

Starting with StorNext 4.x, HA requires configuring at least one more client (other than the MDCs) of the HaShared file system to break the tie. This allows StorNext to determine which MDC has LAN connectivity, and to elect its HaShared FSM with enough votes to usurp control. When an HA Cluster is configured this way, the disconnected MDC (Shasta) will reset because of the usurpation of the HaShared ARB.

After the reboot, CVFS will restart and attempt to elect its HaShared FSM because it is not getting heartbeats from its peer. However, these activation attempts fail to cause a second reset because the HaShared FSM never has enough votes to have a successful usurpation. (You can watch it repeatedly fail to usurp if you can get on the console and run the cvadmin command).

But what about the HaManaged Reno3 FSM? HaManaged FSMs are not started until the HaShared FSM activates and puts the MDC in Primary status. You can observe these blocked HaManaged FSMs with the cvadmin 'fsmlist' command, which displays the local FSMPM's internal FSM and process table. A remote FSMPM's table can also be viewed with 'fsmlist on <MDC name or address>'.

Finally, the message: 'Admin Tap Connection to FSM failed', is an intermittent response that occurred because the timing of the cvadmin select command was during the period after the FSM failed the fifth usurpation attempt and before the FSM was restarted (a ten-second delay). At other times, the response will show an activating FSM. Note that the cvadmin-displayed asterisk simply indicates that the FSM has been told to activate, not that is has been successful at usurpation and activation.

Question: Using the same configuration above (Shasta and Tahoe), an HA Reset occurs if I pull the fibre connection from Shasta when it is the Primary MDC, but it takes 30-40 seconds. Why does it take so long?

Answer: When the fibre connection is lost, Shasta's FSMs cannot maintain their brands on their ARB blocks, so the HA timers do not get restarted in the read, write, restart-timer ARB branding loop. After five seconds the timers would expire and reset the MDC. However, there is a second method for resetting the timers that uses the LAN.

Every two seconds, the FSMPM on an MDC with active HA monitored FSMs sends a request to its peer FSMPM with the list of locally active FSMs. The peer gives permission to reset those FSMs' HA timers if it is not activating them, and promises not to activate them for two seconds. When the reply returns within one second of the request, the timers are reset by the FSMPM. This allows the cluster to ride through brief periods when LUNs are too busy to maintain the ARB brand, but otherwise are operating correctly.

So why does the reset occur after 30-40 seconds? After this delay, the HBA returns errors to the FSM, and the FSM quits. When the HaShared FSM quits with the file system mounted locally, an HA Reset occurs to protect databases for the Storage Manager etc.

Question: How do I resolve a StorNext GUI login issue in my high availability environment?

Answer: When CVFS is running on only one MDC of an HA Cluster, attempting to connect a browser to the down MDC’s GUI produces a single page with a URL to the running MDC. Simply click the URL and login again.

When CVFS is down on both MDCs, the GUIs on both MDCs present a set of four troubleshooting pages. You can start CVFS from the CLI by running the following command: service cvfs start

Or, you can use the StorNext GUI’s Manage HA page and click the Enter Config Mode button, and then click the Exit Config Mode button. When the second step has completed, the HA Cluster will be running in Default-Default mode with MDC redundancy.

Question: The Secondary HA MDC system is in locked and stopped mode, as seen from the primary HA MDC. How can the secondary HA MDC be restored to the default mode of operation?

# /usr/adic/DSM/bin/snhamgr status

LocalMode=config

LocalStatus=primary

RemoteMode=locked

RemoteStatus=stopped

Answer: Do the following to restore a secondary HA MDC to the default mode of operation. If the primary HA MDC has status LocalMode=default and LocalStatus=primary, then go to Step 6.

On the primary MDC, verify the file system availability before exiting Config mode:

# /usr/adic/DSM/bin/cvadmin

Verify that all files systems listed in the fsmlist file are listed and have an asterisk (*) displayed to signify the primary MDC has activated its FSMs. If not, within cvadmin run:

snadmin> disks refresh

snadmin> select

If any of the file systems are not listed or do not display as activated with an asterisk (*), resolve this before making any other changes to the HA modes.
On the primary MDC, verify that the HaShared file system is mounted:

# /bin/mount | grep HAM

Example output:

/dev/cvfsctl1_HAFS on /usr/adic/HAM/shared type cvfs (rw,sparse=yes)
On the primary MDC, verify you can write and read to the HaShared file system:

# date > /usr/adic/HAM/shared/test_1.tmp

# cat /usr/adic/HAM/shared/test_1.tmp

Example output (the current date is displayed):

Wed Feb 17 10:15:10 CST 2021
On the primary MDC, remove the file from Step 3:

# rm /usr/adic/HAM/shared/test_1.tmp

Continue the steps below when you are able to successfully write and read to the file system.
On the primary MDC, set the HA mode back to default mode of operation (you might receive numerous output):

Note: This step restarts StorNext, and may prevent clients from working for an extended period of time.

# cd /usr/adic/DSM/bin/

# ./snhamgr mode=default

LocalMode=default

LocalStatus=stopped

RemoteMode=locked

RemoteStatus=stopped
1. On the primary MDC, start the SNFS services:
  
  # /sbin/service cvfs start
Repeat Step 1, Step 2, and Step 3 to verify file system functionality. If this passes then continue to next step.
On the secondary MDC, change the state of the backup server snhamgr to default (you might receive numerous output):

# cd /usr/adic/DSM/bin/

# ./snhamgr mode=default

LocalMode=default

LocalStatus=stopped

RemoteMode=default

RemoteStatus=primary
1. On the secondary MDC, start the SNFS services:
  
  # /sbin/service cvfs start
On the secondary MDC, verify the HA status:

# cd /usr/adic/DSM/bin

# ./snhamgr status

LocalMode=default

LocalStatus=running

RemoteMode=default

RemoteStatus=primary