HA Resets of the First Kind
The first method of an HA Reset is explained by the following description of the FSM monitoring algorithm (patent pending). The terms usurp and usurpation refer to the process of taking control of a file system, either with or without contention. It involves the branding of the arbitration block on the metadata disk to take control, and then the timed rebranding of the block to maintain control. The HA Monitor algorithm places an upper bound on the timing of the ARB branding protocol to prevent two FSMs from simultaneously attempting to control the metadata, even for an instant.
- When an activating HaUnmanaged or HaShared FSM usurps the ARB, create a five-second timer that resets the computer if it expires
- Wait five seconds plus a small delta before completing usurpation
- Immediately after every ARB Brand update (.5 second period), reset the timer
- Delete the timer when the FSM exits
When there is a SAN, LUN, or FSM process failure that delays updates of the ARB, the HA Monitor timer can run out. When it is less than one second from expiring, a one-line message describing this is written to the /usr/cvfs/debug/smithlog
file.
If SAN or LUN delays are suspected of occurring with regular frequency, the following test can be run. This will significantly impact performance.
- Increase the timer value (up to 999 seconds) by creating the
/usr/cvfs/config/ha_smith_interval
file on each MDC with only this line: 'ha_smith_interval=<integer>
'. This will allow the delays to run their course without incurring a reset. The value must match on both MDCs. - Turn on debugging traces with '
cvdbset :ha'
- Display debugging traces with '
cvdb -g -C -D 500
' - Look for the lines like this example '
HAmonCheck PID #### FS "testfs" status delay = 1
' - When the value grows is more than 1, there are abnormal delays occurring. When a standby FSM is running and the LAN is working, the negotiated timer resets should limit the growth of this value to four. When the value reaches two times the
ha_smith_interval
(default of 5 x 2 = 10), an HA Reset occurs. - Turn off tracing with '
cvdbset - all
'