HA Resets

The first method of an HA Reset is explained by the following description of the FSM monitoring algorithm (patent pending). The terms usurp and usurpation refer to the process of taking control of a file system, either with or without contention. It involves the branding of the arbitration block on the metadata disk to take control, and then the timed rebranding of the block to maintain control. The HA Monitor algorithm places an upper bound on the timing of the ARB branding protocol to prevent two FSMs from simultaneously attempting to control the metadata, even for an instant.

When an activating HaUnmanaged or HaShared FSM usurps the ARB, create a five-second timer that resets the computer if it expires
Wait five seconds plus a small delta before completing usurpation
Immediately after every ARB Brand update (.5 second period), reset the timer
Delete the timer when the FSM exits

When there is a SAN, LUN, or FSM process failure that delays updates of the ARB, the HA Monitor timer can run out. When it is less than one second from expiring, a one-line message describing this is written to the /usr/cvfs/debug/smithlog file.

If SAN or LUN delays are suspected of occurring with regular frequency, the following test can be run. This will significantly impact performance.

Increase the timer value (up to 999 seconds) by creating the /usr/cvfs/config/ha_smith_interval file on each MDC with only this line: 'ha_smith_interval=<integer>'. This will allow the delays to run their course without incurring a reset. The value must match on both MDCs.
Turn on debugging traces with 'cvdbset :ha'
Display debugging traces with 'cvdb -g -C -D 500'
Look for the lines like this example 'HAmonCheck PID #### FS "testfs" status delay = 1'
When the value grows is more than 1, there are abnormal delays occurring. When a standby FSM is running and the LAN is working, the negotiated timer resets should limit the growth of this value to four. When the value reaches two times the ha_smith_interval (default of 5 x 2 = 10), an HA Reset occurs.
Turn off tracing with 'cvdbset - all'

HA Resets of the Second Kind

The second method of HA Reset can occur on shutdown of CVFS if there is an unkillable process or delayed process exit under the HaShared file system mount point. This will keep the file system from being unmounted. The smithlog entry indicates when this has happened, but does not identify the process.

HA Resets of the Third Kind

The third method of HA Reset is the most common. It occurs when the snactivated script for the HaShared FSM experiences an error during startup. The current implementation invokes the 'snhamgr force smith' command to allow the peer MDC an opportunity to start up StorNext if it can. A similar strategy was used in previous releases. In this release, the failure to start will cause the /usr/cvfs/install/.ha_idle_failed_startup touch file to be created, and this will prevent startup of CVFS on this MDC until the file is erased with the 'snhamgr clear' command.

Using HA Manager Modes

The snhamgr rules for mode pairings are easier to understand by following a BAAB strategy for transitioning into and out of config or single mode. In this strategy, B stands for the redundant node, and A stands for the node to be placed into config or single mode. Enter the desired cluster state by transitioning B's mode first, then A's. Reverse this when exiting the cluster state by transitioning A's mode, then B's.

For the configuration-session example, place B in locked mode, then place A in config mode to start a configuration session. At the end of the session, place A in default mode, then place B in default mode.

For the single-server cluster example, shut down Linux and power off B, then designate it peerdown with the 'snhamgr peerdown' command on A, then place A in single mode. At the end of the session, place A in default mode, then designate B as up with the 'snhamgr peerup' command on A, then power on B.