High Availability Troubleshooting
This section presents processes for troubleshooting various issues pertinent to HA conversion and management.
For an in-depth look at HA systems and operation, see the topic titled, High Availability Systems in the StorNext 5 User's Guide.

In some circumstances (such as when the secondary machine is down or otherwise unreachable via the network). Specifically, this button is grayed out by StorNext if a Failover license does not exist or the StorNext HA Manager (snhamgr) status is not in the default mode.
Note: Beginning with StorNext 4.0, a Failover (HA) license is required for HA. If you do not have the Failover (HA) license installed/enabled, you cannot perform HA tasks.
If you suspect the Enter Config Mode button on the Tools > High Availability > Manage page is unavailable (grayed out), perform the following task to reactivate the Enter Config Mode button:
Note: To reactivate the button, you must ensure the peer is reachable between the two node's snhamgr daemon. Ensure The local node should be in "Default" mode or the cluster in "single-locked" mode.
Caution: The following task requires executing commands from the command line interface (CLI). If you are not familiar or comfortable with executing CLI commands, do not attempt this task without assistance from Quantum technical support.
Reasons why snhamgr is not in Default mode |
Task |
---|---|
The local snhamgr daemon is not running |
Check StorNext HA Manager status by running the command "snhamgr status". If the primary node's mode is "Error," check whether the local snhamgr daemon is running. If it isn't, bring it up by running the command /etc/init.d/snhamgr start. |
The secondary node is powered off |
If the secondary node is powered off, power it on. |
The secondary node's snhamgr daemon is not running |
If the secondary node's snhamgr daemon is not running, bring it up by running the command /etc/init.d/snhamgr start. |
The communication between primary and secondary node fails |
Make sure network communication between the two nodes works by running the command "ping" from the primary node to the secondary, and then from the secondary to the primary. Resolve any network problems if a node is not reachable. |
The current node is already in Config mode |
|

This section describes how to determine whether an HA conversion has failed, and what you can do to remedy the situation.

Normally the response from the HA conversion operation should indicate whether an HA conversion was performed successfully. When you perform the HA conversion through the StorNext GUI, you'll see the message "Node Converted." if the conversion was accomplished successfully. There are typically two responses indicating that an HA conversion operation failed:
-
An error message is popped up during the HA conversion process. The error message appears similar to this: "Failed to convert node. Error in HA conversion process: Conversion failed".
-
During the conversion, the GUI enters into a special HA Report Page which displays the content of a log file:
/usr/adic/tomcat/logs/stornextgui.log

For failure case 1, it is very likely that the shared file system was not mounted, or that a wrong file system was mounted as the shared file system during the conversion of the secondary node so that the script syncha.pl
failed.
There are several possible reasons which could cause a failed HA conversion. Here are some reasons:
- The LUNs used in the file system stripe groups are not visible. Run the command
cvlabel -l
to make sure all LUNs shown in the cfgx config file are available. - The
fsnameserver
configuration file doesn't have proper name servers, or the name servers are not reachable. - The
dpserver
file is misconfigured. If you configure a Distributed LAN (DLC) server and virtual IP (vIP) is configured, make sure you have configured the IP address in thedpserver.fsname
configuration file. - The node is not allowed to mount by the central control configuration. Make sure it is allowed.
- A wrong file system was chosen for the conversion. This is unlikely when converting to HA through the StorNext GUI.
You can run the cvadmin
command to review the file systems started. If cvfs was not started, start it now, and then try to mount the shared file system and other file systems. If you cannot view the file systems, check the fsnameserver's configuration and also whether the name server is reachable. If you cannot mount file systems, check the system log message and nssdbg.out
to determine the root cause and fix it.
In a rare situation, the LUNs might be slow to respond to the disk I/O operations so that a reset occurred during the copy operations in syncha.pl
. You can determine this from the Reset log file: /usr/cvfs/debug/smithlog
and the syncha log file: /usr/adic/HAM/syncha.pl.log
. The reset log file displays whether a reset occurred last time, and the syncha log file displays whether the HA conversion was complete last time.
In this case, you may need to increase the value of ha_smith_interval
in the configuration file /usr/cvfs/config/ha_smith_interval
on both servers.
Note: You should be very careful when changing this configuration file. Consult a Quantum technical representative before making any changes.
These log files may help you figure out the root cause:
Log | Description |
---|---|
/usr/adic/HA/cnvt2ha.log
|
This file has the log information generated by the nvt2ha.sh script. The StorNext GUI mainly calls this script to perform HA conversion. |
/usr/adic/HAM/syncha.pl.log
|
This file has the log information generated by the syncha.sh script. It is mainly called by cvnt2ha.sh to create a mount point for the shared file system; create cron jobs to perform periodic synchronization of StorNext config files from/to the mirror directory on the shared file system; mount the shared file system if not mounted; install and update the mirror directory; and relocate the shared configuration files. |
/usr/adic/tomcat/logs/stornextgui.log
|
This is the log file generated by the StorNext GUI. |
/usr/cvfs/debug/nssdbg.out
|
This is the log file generated by fsmpm and nss. |
These other log files may also be helpful:
Log | Description |
---|---|
/usr/cvfs/debug/snactivated.fsname.log
|
This log contains information about when a file system is activated and the snactivated.pl script is called. |
/usr/cvfs/debug/mount.fsname.out
|
This log contains the mount information about the file system. |
/usr/cvfs/debug/smithlog
|
This log contains potential system reset (SMITH) information |
/usr/cvfs/data/fsname/log/cvlog
|
This log contains the file system (FSM) log information |
/var/log/messages
|
This log contains system information. |
For failure case 2, please refer to the section "Troubleshooting HA Report page" to resolve the problem. If you still cannot find the root cause that caused the HA conversion or HA-related operation failure, please consult Quantum technical support for assistance.

After you figured out the root cause of the failure and resolve it, go back to the HA conversion page and continue the HA conversion.

This section describes when the StorNext GUI enters into a special HA report page, and how to resolve the problem and exit from this state. You may encounter a special HA report page that displays, "Your HA system is in a state requiring user intervention. Please click the following link to view the troubleshooting guide."
This can happen during HA conversion or after the HA conversion is complete. The StorNext GUI enters into this page when it cannot determine which node is the primary server. Here are the possible cases which can trigger this page appearing:
-
Snhamgr communication failure. Probably caused because the snhamgr daemon is not running.
-
Both servers’ cvfs have been stopped.
-
One servers’ cvfs is stopped, and the other server is out of service or in peerdown mode.
- One server's cvfs failed to start and a touch file
/usr/cvfs/install/.ha_idle_failed_startup
is created that blocks further startup of cvfs, while the other server is out of service, or in peerdown mode. - Both servers' cvfs failed to start due to the touch file
/usr/cvfs/install/.ha_idle_failed_startup
The touch file /usr/cvfs/install/.ha_idle_failed_startup
was created when the shared file system activation failed and a reset occurred. This file will block any subsequent StorNext startup attempts so that a system administator can become involved and fix the problem. You can find this failed startup status by running the command snhamgr status
. Executing the command should show the following:
LocalMode=failed_startup LocalStatus=unknown |
There can be many reasons that cause the shared file system startup to fail. Below are a few reasons:
-
The shared file system was not shut down properly before
- The file
/usr/cvfs/config/license.dat
lacks proper licenses -
Storage Manager component failed to start
- Certain file operations failed (e.g., failed to create touch file
/usr/adic/install/.active_snfs_server
) -
virtual IP startup failed
-
"
snhamgr --primary
" failed
The log file snactivated.fsname.log, nssdbg.out
and smithlog
can help you troubleshoot the root cause. After you resolve the problem, run the command snhamgr clear
to remove the touch file /usr/cvfs/install/.ha_idle_failed_startup
. Then run the command service cvfs start
to start cvfs.
Run the command snhamgr status
to check the cluster's status and mode. Bring the cluster to the state where one server is in “primary” state. If the snhamgr returns "error" mode, make sure the snhamgr daemon is running properly. If both servers were stopped, run the command service cvfs start
to bring up at least one server. Once one server is in "primary" state, the StorNext GUI will display the proper page after a while. You may also click Refresh to exit the report page earlier.