Troubleshooting HA
This section contains troubleshooting suggestions for issues which pertain to StorNext HA (high availability) systems. For an in-depth look at HA systems and operation, see High Availability Systems.

Answer: To be clear, individual file-system failover must be distinguished from HA Reset of an entire MDC. When redundant FSMs are running on both MDCs in an HA Cluster, the active FSM can be on either MDC. In the case of managed file systems, the FSMs are started only on the Primary MDC, so these can be stopped and started at will without causing an HA Reset. Unmanaged file-system FSMs are started on both MDCs, so stopping an active unmanaged FSM will result in a single file system failover to the standby FSM on the peer MDC. An HA Reset occurs only when the failover is putting the file system in danger of data corruption from simultaneous write access to StorNext metadata and databases. This is typically the case for the HaShared file system, so take extra care with its FSM.
The recommended way for making configuration changes and restarting FSMs is to use the 'config' mode, which stops CVFS on one MDC and disables HA Reset on the other. CVFS will be restarted when returning to 'default' mode with both MDCs operating redundantly.
-
Run the following command at the CLI:
snhamgr config -
Make your configuration changes, and then run the following command:
snhamgr start
If you are only restarting FSMs without making configuration changes, the following steps will restart an FSM. To restart an HaManaged FSM, use this cvadmin command:
To restart an HaUnmanaged FSM or the HaShared FSM:
snhamgr mode=locked # on the secondary
snhamgr mode=single # on the primary
cvadmin # on the primary
fail <file system name>
select # repeat until you observe the FSM has started and activated
snhamgr start # on the primary

Answer: There could be several reasons why a failover is triggered. See HA Resets in the HA topic.

Answer: Either a StorNext File System client or a Node Status Service (NSS) coordinator (the systems listed in the fsnameservers
file) can initiate a vote.
An SNFS client triggers a vote when its TCP connection to a File System Manager (FSM) is disconnected. In many failure scenarios this loss of TCP connectivity is immediate, so it is often the primary initiator of a vote.
On Windows systems, StorNext provides a configuration option called Fast Failover that triggers a vote as a result of a 3 second FSM heartbeat loss. Occasionally, this is necessary because TCP disconnects can be delayed. There is also an NSS heartbeat between members and coordinators every half second. The NSS coordinator triggers a vote if the NSS heartbeat is absent for an FSM server for three seconds. Because the client triggers usually occur first, the coordinator trigger is not commonly seen.

In this situation the lab configuration is as follows:
MDC 1:
Hostname Shasta
10.35.1.110
MDC 2:
Hostname Tahoe
10.35.1.12
Two File Systems:
HaShared type: HAFS
HaManaged type: Reno3
There are no other client computers.
Shasta is the Primary MDC before the Ethernet cable is pulled.
At one point after the Ethernet was pulled, cvadmin on Tahoe showed:
Tahoe:/usr/cvfs/config # cvadmin
StorNext Administrator
Enter command(s)
For command help, enter "help" or "?".
List FSS
File System Services (* indicates service is in control of FS):
1>*HAFS[0] located on tahoe:50139 (pid 13326)
snadmin> select FSM "HAFS"
Admin Tap Connection to FSM failed: [errno 104]: Connection reset by peer
FSM may have too many connections active.
Cannot select FSS "HAFS"
snadmin> start reno3
Start FSS "reno3"
Cannot start FSS 'reno3' - failed (FSM cannot start on non-Primary server)
snadmin> activate reno3
Activate FSM "reno3"
Could not find File System Manager for "reno3" on Tahoe.
Cannot activate FSS reno3
Answer: The reason the failover and HA Reset did not occur is because the HaShared FSM on Shasta continues to be active, and this was detected in the ARB block through the SAN by the FSM on Tahoe.
Here's why. When the LAN connection is lost on Shasta, its active HaShared FSM continues to have one client: the Shasta MDC itself. On Tahoe, an election is held when the LAN heartbeats from Shasta's HAFS FSM stop, and Tahoe's FSM gets one vote from the client on Tahoe. The Tahoe FSM is told to activate, but cannot usurp the ARB with a 1-to-1 tie. So, it tries five times, then exits, and a new FSM is started in its place. You can observe this by running the cvadmin
command and watching the FSM's PID change every 20 seconds or so.
In StorNext 4.x HA allows HaUnmanaged FSMs to failover without resetting the MDC if possible, and HaManaged FSMs do not fail over because they are only started on the primary MDC.
Starting with StorNext 4.x, HA requires configuring at least one more client (other than the MDCs) of the HaShared file system to break the tie. This allows StorNext to determine which MDC has LAN connectivity, and to elect its HaShared FSM with enough votes to usurp control. When an HA Cluster is configured this way, the disconnected MDC (Shasta) will reset because of the usurpation of the HaShared ARB.
After the reboot, CVFS will restart and attempt to elect its HaShared FSM because it is not getting heartbeats from its peer. However, these activation attempts fail to cause a second reset because the HaShared FSM never has enough votes to have a successful usurpation. (You can watch it repeatedly fail to usurp if you can get on the console and run the cvadmin
command).
But what about the HaManaged Reno3 FSM? HaManaged FSMs are not started until the HaShared FSM activates and puts the MDC in Primary status. You can observe these blocked HaManaged FSMs with the cvadmin 'fsmlist' command, which displays the local FSMPM's internal FSM and process table. A remote FSMPM's table can also be viewed with 'fsmlist on <MDC name or address>
'.
Finally, the message: 'Admin Tap Connection to FSM failed
', is an intermittent response that occurred because the timing of the cvadmin select
command was during the period after the FSM failed the fifth usurpation attempt and before the FSM was restarted (a ten-second delay). At other times, the response will show an activating FSM. Note that the cvadmin-displayed asterisk simply indicates that the FSM has been told to activate, not that is has been successful at usurpation and activation.

Answer: When the fibre connection is lost, Shasta's FSMs cannot maintain their brands on their ARB blocks, so the HA timers do not get restarted in the read, write, restart-timer ARB branding loop. After five seconds the timers would expire and reset the MDC. However, there is a second method for resetting the timers that uses the LAN.
Every two seconds, the FSMPM on an MDC with active HA monitored FSMs sends a request to its peer FSMPM with the list of locally active FSMs. The peer gives permission to reset those FSMs' HA timers if it is not activating them, and promises not to activate them for two seconds. When the reply returns within one second of the request, the timers are reset by the FSMPM. This allows the cluster to ride through brief periods when LUNs are too busy to maintain the ARB brand, but otherwise are operating correctly.
So why does the reset occur after 30-40 seconds? After this delay, the HBA returns errors to the FSM, and the FSM quits. When the HaShared FSM quits with the file system mounted locally, an HA Reset occurs to protect databases for the Storage Manager etc.

Answer: When CVFS is running on only one MDC of an HA Cluster, attempting to connect a browser to the down MDC’s GUI produces a single page with a URL to the running MDC. Simply click the URL and login again.
When CVFS is down on both MDCs, the GUIs on both MDCs present a set of four troubleshooting pages. You can start CVFS from the CLI by running the following command: service cvfs start
Or, you can use the StorNext GUI’s Manage HA page and click the Enter Config Mode button, and then click the Exit Config Mode button. When the second step has completed, the HA Cluster will be running in Default-Default mode with MDC redundancy.

# /usr/adic/DSM/bin/snhamgr status
LocalMode=config
LocalStatus=primary
RemoteMode=locked
RemoteStatus=stopped
Answer: Do the following to restore a secondary HA MDC to the default mode of operation. If the primary HA MDC has status LocalMode=default and LocalStatus=primary, then go to Step 6.
-
On the primary MDC, verify the file system availability before exiting Config mode:
# /usr/adic/DSM/bin/cvadminVerify that all files systems listed in the fsmlist file are listed and have an asterisk (*) displayed to signify the primary MDC has activated its FSMs. If not, within
cvadmin
run:snadmin> disks refresh
snadmin> select
If any of the file systems are not listed or do not display as activated with an asterisk (*), resolve this before making any other changes to the HA modes.
-
On the primary MDC, verify that the HaShared file system is mounted:
# /bin/mount | grep HAMExample output:
/dev/cvfsctl1_HAFS on /usr/adic/HAM/shared type cvfs (rw,sparse=yes) -
On the primary MDC, verify you can write and read to the HaShared file system:
# date > /usr/adic/HAM/shared/test_1.tmp
# cat /usr/adic/HAM/shared/test_1.tmp
Example output (the current date is displayed):
Wed Feb 17 10:15:10 CST 2021 -
On the primary MDC, remove the file from Step 3:
# rm /usr/adic/HAM/shared/test_1.tmpContinue the steps below when you are able to successfully write and read to the file system.
-
On the primary MDC, set the HA mode back to default mode of operation (you might receive numerous output):
Note: This step restarts StorNext, and may prevent clients from working for an extended period of time.
# cd /usr/adic/DSM/bin/
# ./snhamgr mode=default
LocalMode=default
LocalStatus=stopped
RemoteMode=locked
RemoteStatus=stopped
-
On the primary MDC, start the SNFS services:
# /sbin/service cvfs start
-
- Repeat Step 1, Step 2, and Step 3 to verify file system functionality. If this passes then continue to next step.
-
On the secondary MDC, change the state of the backup server snhamgr to default (you might receive numerous output):
# cd /usr/adic/DSM/bin/
# ./snhamgr mode=default
LocalMode=default
LocalStatus=stopped
RemoteMode=default
RemoteStatus=primary
-
On the secondary MDC, start the SNFS services:
# /sbin/service cvfs start
-
-
On the secondary MDC, verify the HA status:
# cd /usr/adic/DSM/bin
# ./snhamgr status
LocalMode=default
LocalStatus=running
RemoteMode=default
RemoteStatus=primary