HA Manager
The HA Manager subsystem collects and reports the operating status of an HA cluster and uses that to control operations. It is part of a Storage Manager installation that has been converted to HA with the cnvt2ha.sh script. For manually-configured HA clusters where the cnvt2ha.sh script has not been run, the command-line interface (snhamgr) reports a default state that allows non-HA and File System Only HA configurations to operate.
The HA Manager supports non-default HA Cluster functionality such as suspending HA monitoring during administrative tasks. It attempts to communicate with its peer at every decision point, so it is mostly stateless and functions correctly regardless of what transpires between decision points. Following every command, the snhamgr command line interface reports the modes and statuses of both servers in the cluster, which provide necessary information for the StorNext control scripts.
HA Manager Modes and Statuses
The HA Manager relies on a set of administrator-configurable modes to override the default behaviors of HA. Modes persist across reboots. Following are the modes and descriptions of their purpose:
| Mode | Description |
|---|---|
| default | HA monitoring is turned on. When the peer server is not available for communication, it is assumed to be in default mode. |
| single | HA monitoring is turned off. The peer server must be communicating and in locked mode, or not communicating and certified as peerdown (recommended). This mode is meant for extended production operations without a redundant server such as when one server is being repaired or replaced. When the peer server is about to be restored to service, the operating server can be transitioned from single to default mode without stopping StorNext. |
| config | HA monitoring is turned off. The peer server must be communicating and in locked mode (recommended), or not communicating and certified as peerdown. The config mode is meant for re-configuration and other non-production service operations. When returning to production service and the default mode, StorNext must be stopped. This ensures that all StorNext processes are started correctly upon returning to default mode. |
| locked |
StorNext is stopped and prevented from starting on the local server. This mode allows the HA Manager to actively query the peer server to ensure that it is stopped when the local peer is operating in single or config mode. Communication with the locked node must continue, so this mode is effective when StorNext is stopped for a short period and the node will not be rebooted. If communication is lost, the peer node assumes this node is in default mode, which is necessary for avoiding split-brain scenario. Caution: If a secondary MDC in locked mode is rebooted or powered down while the primary MDC is in config or single mode, |
| peerdown |
The peer server is turned off and must not be communicating with the local server's HA Manager subsystem, so this mode is effective when the server is powered down. The mode is declared by the peerdown command on a working server to give information about the non-working peer server. By setting this mode, the administrator certifies the off status of the peer, which the HA Manager cannot verify by itself. This allows the local peer to be in single or config mode. If the peer starts communicating while this mode is set, the setting is immediately erased, the local mode is set to default to restore HA Monitoring, and StorNext is shut down, which can trigger an HA reset. The peerdown mode is changed to default mode with the peerup command. The peerdown and peerup commands must never be automated because they require external knowledge about the peer server's condition and operator awareness of a requirement to keep the peer server turned off. |
| ha_idle_failed_startup | A previous attempt to start StorNext with "service cvfs start" has failed before completion. Attempts to start StorNext are blocked until this status has been cleared by running "snhamgr clear". |
The HA Manager subsystem collects server statuses along with the server modes to fully measure the operating condition of the HA Cluster. The possible statuses are as follows:
- stopped: Running the 'DSM_control status' command has returned a false code.
- running: Running the 'DSM_control status' command has returned a true code.
- primary: The server's status is running and the FSMPM is in the primary state, which indicates that the HaShared FSM has been activated.
The HA Manager allows the cluster to be in one of the following restricted set of operating states. When a server is in default mode, HA monitoring is turned on.
- default-default
- default-locked
- default-peerdown
- single-peerdown
- single-locked
- config-peerdown
- config-locked
- locked-*
The following states are prohibited and prevented from occurring by the HA Manager, unless there is improper tampering. For example, the last state listed below (peerdown-*), is the case when a node that is designated as peerdown begins communicating with its peer. If any of these is discovered by the HA Manager, it will take action to move the cluster to a valid state, which may trigger an HA reset.
- single-default
- single-single
- single-config
- config-default
- config-single
- config-config
- peerdown-*
HA Manager Components
The following files and processes are some of the components of the HA Manager Subsystem:
| File/Process | Description |
|---|---|
|
snhamgr_daemon |
If the cnvt2ha.sh script has been run, this daemon is started after system boot and before StorNext, and immediately attempts to communicate with its peer. It is stopped after StorNext when Linux is shutting down. Otherwise, it should always be running. A watcher process attempts to restart it when it stops abnormally. Its status can be checked with 'service snhamgr status'. It can be restarted with 'service snhamgr start' or 'service snhamgr restart' if it is malfunctioning. |
|
snhamgr |
CLI that communicates with the daemon to deliver commands and report status, or to report a default status when the cnvt2ha.sh script has not been run. This is the interface for StorNext control scripts to regulate component starts. |
|
/usr/cvfs/install/.ha_mgr |
Stored mode value, which allows the single, config, locked, and peerdown modes to persist across reboots. |
|
SNSM_HA_CONFIGURED |
Environment variable that points to a touch file to indicate that cnvt2ha.sh has been run. |
|
/etc/init.d/snhamgr |
Service control script for starting the snhamgr_daemon. |
|
HA_IDLE_FAILED_STARTUP |
Environment variable that points to a touch file to indicate that a previous run of 'service cvfs start' failed before completion. This blocks startup attempts to prevent infinitely looping startup attempts. |
|
/usr/cvfs/debug/smithlog |
When an HA Reset is imminent, a descriptive line is added to the end of this file and the file is fsync'd in an attempt to ensure that the information is available for debugging the root cause of the reset. For example, when there is less than one second remaining on the HA Monitor timer, a notice is written in this file. It is likely that all other log files will lose some of the newest information at the time of the reset because it is in buffers that have not been written to disk. The fsmpm process writes the file, so the file may not have any diagnostic information in cases where the fsmpm itself fails and causes an HA reset. |
HA Manager Operation
In addition to the setting of modes, there are some commands provided by the HA Manager to simplify and automate the operation of an HA Cluster. The commands are listed in the table below and the command syntax is as follows, where cmd is the command in the table:
|
Command |
Description |
|
status |
Return cluster modes and operating statuses. All commands return status; this one does nothing else. |
|
stop |
Safely stop both servers in the cluster without incurring a HA reset. The secondary server is placed in locked mode, which stops StorNext on that server, then the primary server is placed in config mode and stopped, and then both servers are put in default mode with StorNext stopped. |
|
start |
Stop each server when there is a need and transition both servers to default mode, then bring up the local server first followed by the peer server so that the local server becomes primary and the peer server becomes secondary. Note: Running |
|
config |
First, check that the peer server is in locked or peerdown mode. Then, place the local server in config mode. The command must be run on the primary server, or either server when CVFS is stopped on both. |
|
clear |
Remove the file referenced by the |
|
force smith |
Trigger an immediate HA reset if the local server is in default mode. This command is meant for use in health-monitoring scripts. The command is two words to make accidental firing less likely. Caution: It is not recommended to use the
Wait for the secondary to become primary, then run:
|
|
peerdown |
Certify that the peer server is powered off. This mode is used when the peer server is powered down. In the event that the peer returns to service and begins to communicate, the assertion that the peer is down becomes false. Immediate action may be taken by the local server to transition itself to a safe operating mode, which could trigger an HA reset. The best practice is to power off the server or uninstall StorNext before setting peerdown mode, and to unset the mode before powering on the server. |
|
peerup |
Undo the peerdown mode. The command will fail if the local mode is config or single. Run this command before powering on the peer server. The local server will assume the peer is in default mode until the peer starts |
|
mode= |
Set the mode of the local server to
|
See the StorNext 5 Man Pages Reference Guide for additional details on the commands.
