High Availability Troubleshooting

This section presents processes for troubleshooting various issues pertinent to HA conversion and management.

For an in-depth look at HA systems and operation, see the topic titled, High Availability Systems in the StorNext 5 User's Guide.

"Enter Config Mode" Button Not Available

In some circumstances (such as when the secondary machine is down or otherwise unreachable via the network). Specifically, this button is grayed out by StorNext if a Failover license does not exist or the StorNext HA Manager (snhamgr) status is not in the default mode.

Note: Beginning with StorNext 4.0, a Failover (HA) license is required for HA. If you do not have the Failover (HA) license installed/enabled, you cannot perform HA tasks.

If you suspect the Enter Config Mode button on the Tools > High Availability > Manage page is unavailable (grayed out), perform the following task to reactivate the Enter Config Mode button:

Note: To reactivate the button, you must ensure the peer is reachable between the two node's snhamgr daemon. Ensure The local node should be in "Default" mode or the cluster in "single-locked" mode.

Caution: The following task requires executing commands from the command line interface (CLI). If you are not familiar or comfortable with executing CLI commands, do not attempt this task without assistance from Quantum technical support.

Reasons why snhamgr is not in Default mode	Task
The local snhamgr daemon is not running	Check StorNext HA Manager status by running the command "snhamgr status". If the primary node's mode is "Error," check whether the local snhamgr daemon is running. If it isn't, bring it up by running the command /etc/init.d/snhamgr start.
The secondary node is powered off	If the secondary node is powered off, power it on.
The secondary node's snhamgr daemon is not running	If the secondary node's snhamgr daemon is not running, bring it up by running the command /etc/init.d/snhamgr start.
The communication between primary and secondary node fails	Make sure network communication between the two nodes works by running the command "ping" from the primary node to the secondary, and then from the secondary to the primary. Resolve any network problems if a node is not reachable.
The current node is already in Config mode

Troubleshoot a Failed HA Conversion

This section describes how to determine whether an HA conversion has failed, and what you can do to remedy the situation.

Troubleshoot HA Conversion Failure

For failure case 1, it is very likely that the shared file system was not mounted, or that a wrong file system was mounted as the shared file system during the conversion of the secondary node so that the script syncha.pl failed.

There are several possible reasons which could cause a failed HA conversion. Here are some reasons:

The LUNs used in the file system stripe groups are not visible. Run the command cvlabel -l to make sure all LUNs shown in the cfgx config file are available.
The fsnameserver configuration file doesn't have proper name servers, or the name servers are not reachable.
The dpserver file is misconfigured. If you configure a Distributed LAN (DLC) server and virtual IP (vIP) is configured, make sure you have configured the IP address in the dpserver.fsname configuration file.
The node is not allowed to mount by the central control configuration. Make sure it is allowed.
A wrong file system was chosen for the conversion. This is unlikely when converting to HA through the StorNext GUI.

You can run the cvadmin command to review the file systems started. If cvfs was not started, start it now, and then try to mount the shared file system and other file systems. If you cannot view the file systems, check the fsnameserver's configuration and also whether the name server is reachable. If you cannot mount file systems, check the system log message and nssdbg.out to determine the root cause and fix it.

In a rare situation, the LUNs might be slow to respond to the disk I/O operations so that a reset occurred during the copy operations in syncha.pl. You can determine this from the Reset log file: /usr/cvfs/debug/smithlog and the syncha log file: /usr/adic/HAM/syncha.pl.log. The reset log file displays whether a reset occurred last time, and the syncha log file displays whether the HA conversion was complete last time.

In this case, you may need to increase the value of ha_smith_interval in the configuration file /usr/cvfs/config/ha_smith_interval on both servers.

Note: You should be very careful when changing this configuration file. Consult a Quantum technical representative before making any changes.

These log files may help you figure out the root cause:

Log	Description
`/usr/adic/HA/cnvt2ha.log`	This file has the log information generated by the `nvt2ha.sh` script. The StorNext GUI mainly calls this script to perform HA conversion.
`/usr/adic/HAM/syncha.pl.log`	This file has the log information generated by the `syncha.sh` script. It is mainly called by `cvnt2ha.sh` to create a mount point for the shared file system; create cron jobs to perform periodic synchronization of StorNext config files from/to the mirror directory on the shared file system; mount the shared file system if not mounted; install and update the mirror directory; and relocate the shared configuration files.
`/usr/adic/tomcat/logs/stornextgui.log`	This is the log file generated by the StorNext GUI.
`/usr/cvfs/debug/nssdbg.out`	This is the log file generated by fsmpm and nss.

These other log files may also be helpful:

Log	Description
`/usr/cvfs/debug/snactivated.fsname.log`	This log contains information about when a file system is activated and the `snactivated.pl` script is called.
`/usr/cvfs/debug/mount.fsname.out`	This log contains the mount information about the file system.
`/usr/cvfs/debug/smithlog`	This log contains potential system reset (SMITH) information
`/usr/cvfs/data/fsname/log/cvlog`	This log contains the file system (FSM) log information
`/var/log/messages`	This log contains system information.

For failure case 2, please refer to the section "Troubleshooting HA Report page" to resolve the problem. If you still cannot find the root cause that caused the HA conversion or HA-related operation failure, please consult Quantum technical support for assistance.

Troubleshoot the HA Report page

This section describes when the StorNext GUI enters into a special HA report page, and how to resolve the problem and exit from this state. You may encounter a special HA report page that displays, "Your HA system is in a state requiring user intervention. Please click the following link to view the troubleshooting guide."

This can happen during HA conversion or after the HA conversion is complete. The StorNext GUI enters into this page when it cannot determine which node is the primary server. Here are the possible cases which can trigger this page appearing:

Snhamgr communication failure. Probably caused because the snhamgr daemon is not running.
Both servers’ cvfs have been stopped.
One servers’ cvfs is stopped, and the other server is out of service or in peerdown mode.
One server's cvfs failed to start and a touch file /usr/cvfs/install/.ha_idle_failed_startup is created that blocks further startup of cvfs, while the other server is out of service, or in peerdown mode.
Both servers' cvfs failed to start due to the touch file /usr/cvfs/install/.ha_idle_failed_startup

The touch file /usr/cvfs/install/.ha_idle_failed_startup was created when the shared file system activation failed and a reset occurred. This file will block any subsequent StorNext startup attempts so that a system administator can become involved and fix the problem. You can find this failed startup status by running the command snhamgr status. Executing the command should show the following:

LocalMode=failed_startup

LocalStatus=unknown

There can be many reasons that cause the shared file system startup to fail. Below are a few reasons:

The shared file system was not shut down properly before
The file /usr/cvfs/config/license.dat lacks proper licenses
Storage Manager component failed to start
Certain file operations failed (e.g., failed to create touch file /usr/adic/install/.active_snfs_server)
virtual IP startup failed
"snhamgr --primary" failed

The log file snactivated.fsname.log, nssdbg.out and smithlog can help you troubleshoot the root cause. After you resolve the problem, run the command snhamgr clear to remove the touch file /usr/cvfs/install/.ha_idle_failed_startup. Then run the command service cvfs start to start cvfs.

Run the command snhamgr status to check the cluster's status and mode. Bring the cluster to the state where one server is in “primary” state. If the snhamgr returns "error" mode, make sure the snhamgr daemon is running properly. If both servers were stopped, run the command service cvfs start to bring up at least one server. Once one server is in "primary" state, the StorNext GUI will display the proper page after a while. You may also click Refresh to exit the report page earlier.