How to Gracefully Fail Over a StorNext HA Node

How to Gracefully Fail Over a StorNext HA Node


Care should be taken when trying to fail over from the current MDC in a HA pair, particularly in a StorNext appliance. In particular the use of SMITH is discouraged to avoid the possibility of filesystem or SNSM database corruption.


Note that failing parts of the configuration, eg simply stopping the HA filesystem, will result in a SMITH and should be avoided.





First check that the alternate node is functioning and ready to take over by running snhamgr on the node you intend to fail over from :


# snhamgr status






Also ensure that any resources that the alternate node requires are available and operational. (EG check disks with cvlabel and tape libraries and drives with fs_scsi).


You may also wish to open shell windows on both nodes and tail the system logs to monitor the failover in real time.





To initiate the failover simply stop the StorNext filesystem service :


# service cvfs stop


On completion of the stop the snhamgr should show the alternate system as the primary and the current node as stopped :


# snhamgr status






Be sure to restart the service once the failover has completed to ensure that the local node is once again available to take over in the event of a failure of the new primary.



Example of Filesystem Service Successfully Stopping and Starting


# service cvfs stop


Initiating stop of StorNext SNAPI component

SNAPI software stopped.


Initiating stop of StorNext TSM component

FS0285 Tertiary Manager terminate requested.

FS0279 Tertiary Manager software successfully terminated.

FS0000 01 0001348744 /usr/adic/TSM/exec/fsconfig completed: Command Successful.


Initiating stop of StorNext MSM component

Media Manager Version 5.0.1 for Linux (Kernel:2618 OS:RHEL5) -- Copyright (C) 1992-2014 Quantum Corp.

Initiating the Media Manager shutdown

Setup environment variables ok

Shutting down the Media Manager system processes ... Done

System processes shut down ok

Shutting down the Media Manager servers ... Done

Servers shut down ok

Shutting down the Media Manager process server ... Done

Process server shut down ok

The Media Manager shutdown completed


Initiating stop of StorNext PSE component


Initiating stop of StorNext SRVCLOG component


Stopping sla with pid: 2211

Stopping ala with pid: 2221


Initiating stop of StorNext mysql component

Stopping mysqld

mysqld stopped


Initiating stop of StorNext DSM component

Stopping blockpool succeeded                               [  OK  ]

Terminating snpolicyd, this may take up to 300 seconds

Stopping snpolicyd succeeded                               [  OK  ]

Unmounting SNFS filesystems                                [  OK  ]


Stopping SNFS Daemons

Disabling vips

Running '/sbin/ifconfig bond0:ha down'

Stopping SNFS PortMapper

Waiting for FSMs to finish..


SNFS Stop                                                  [  OK  ]



# service cvfs start


Initiating start of StorNext DSM component


Checking maintenance license...

- The maintenance license status is: Good                  [  OK  ]


Initializing StorNext Filesystem (SNFS)

Loading SNFS modules

net.core.rmem_max = 1048576

Multipath enabled, waiting up to 500 seconds for multipath device creation

.                                                          [  OK  ]

Starting /usr/cvfs/bin/fsmpm .........

net.core.rmem_max = 131071

Starting /usr/cvfs/bin/cvfsd ...                           [  OK  ]

Mounting the shared file system: HA_shared

Waiting for primary

Waiting for CVFS mounts to complete                        [  OK  ]

SNFS Initialized                                           [  OK  ]





 the problem i see here is, when the shutdown hangs for specific reason


CR Ref  discussing about the best way to manually failover and other considerations


 Bug 50159 - man cvfs_failover should not suggest "snhamgr force smith"


Bug 45669 - when HA decide it need to smith itself it should try to stop cvfs before doing a hardware reset



Bug 50159 was opened because a smith could cause a mysql corruption ( Ref. Bug to be added )  Another approach would be to stop Storage Manager, smith the active node, wait for Node 2 to take over, start Storage Manager. I guess any method for manual failover will be a case by case decission on how to appreach it.



Note by Michael Richter on 12/02/2014 02:16 AM

This page was generated by the BrainKeeper Enterprise Wiki, © 2018