SR3747616: Nas Cluster broken after upgrade to 1.3.0 |
SR Information: 3747616 ERT
Problem Description: Nas Cluster broken after upgrade to 1.3.0
Product / Software Version:
XCellis 5.3.2.1 SN-NAS 1.2.4 |
Overview
After Upgrade to SN-NAS 1.3.0 all Nodes un-joined the cluster
SN-NAS Cluster
A StorNext NAS cluster provides users the ability to access NAS shares located on any of the NAS cluster's nodes.
Through NAS clusters, you can also take advantage of additional features built into StorNext NAS, such as NAS failover to ensure that users can always access NAS shares, or G300 load-balancing to maintain a desirable level of network-response time.
Symptoms & Identifying the problem
## 1 ## Review:
The customer upgraded his SN-NAS Version from 1.2.4 to 1.2.5 to 1.3.0
“>system show version” reflect that both XCellis Nodes have been updated to NAS 1.3.0
“>nascluster show“ revealed to both XCellis Nodes have un-joined the Nas-Cluster
/var/log/snnas_controller
2016-10-08 00:20:34: stdout/INFO:: nas_cluster_plugin.py:2901 master file (/stornext/XSAN_VOL/.StorNext/.snnas/ctdb/ctdb_nodes_0.master) updated with 192.168.20.42
2016-10-08 00:29:16: stdout/INFO:: sml_nascluster_utils.py:2149 NAS post-upgrade complete
2016-10-08 00:32:32: stdout/INFO:: controller_commands.py:79 New controller command: nascluster_join
2016-10-08 00:32:32: stdout/ERROR:: controller_commands.py:1211 Controller command 'nascluster_join' encountered error: NAS cluster version invalid: Master auth config: version 1, expected 2 (E-5064)
2016-10-08 01:05:25: stdout/INFO:: sml_nascluster_utils.py:1764 Previous NAS cluster join in-progress file /stornext/XSAN_VOL/.StorNext/.snnas/ctdb/ctdb_nodes_0.join-progress for 192.168.20.42 found, continuing ...
2016-10-08 01:05:25: stdout/ERROR:: controller_commands.py:1211 Controller command 'nascluster_join' encountered error: NAS cluster version invalid: Master auth config: version 1, expected 2 (E-5064)
Seems the join is looping due to the “in-progress file” and continuously failing
Note:
So why did the Master leave the Cluster?
## 2 ## Troubleshooting:
Let’s review, where the SN-NAS Configuration Files are located using the SN-NAS Shell by printing the nas registry
HM:qtm-node1> reg show nas
nas.cluster.snfs_root = /stornext/XSAN_VOL
/stornext/ XSAN_VOL /.StorNext/.snnas/ctdb/
-rw-rw-r-- 1 root root 28 Aug 25 11:38 ctdb_nodes_0
-rw-rw-r-- 1 root root 14 Aug 25 11:38 ctdb_nodes_0.bak
-rwxr-xr-x 1 root root 14 Oct 8 00:32 ctdb_nodes_0.join-progress
-rwxr-xr-x 1 root root 14 Oct 8 00:32 ctdb_nodes_0.master
-rw-rw-r-- 1 root root 0 Oct 10 10:41 ctdb_public_addresses_0
-rw-rw-r-- 1 root root 0 Aug 25 12:28 ctdb_recovery_0.lck
Logs show that the master suppose to be Node2 [ passive MDC] with IP 192.168.20.42
[root@qtm-node1 ctdb]# cat ctdb_nodes_0.master
192.168.20.41
[root@qtm-node1 ctdb]# cat ctdb_nodes_0.join-progress
192.168.20.42
The configuration File state otherwise. Node1 is the master and the “join-inprogress” file shows Node2 trying to join the cluster.
BHM:qtm-node1> nascluster join /stornext/XSAN_VOL 192.168.20.42
NAS cluster operation failed: Not NAS cluster master (E-7014)
BHM:qtm-node1> nascluster enable master 192.168.20.41 /stornext/XSAN_VOL
NAS cluster operation failed: Node 192.168.20.41 already enabled! (E-7014)
ctdb\supportbundle\tmp\snnas_support\net.txt
p1p1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.20.42 netmask 255.255.255.0 broadcast 192.168.20.255
p1p1:nas: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.20.40 netmask 255.255.255.255 broadcast 192.168.20.40
## 3 ## Resolutions/workarounds/fixes:
BHM:qtm-node2> nascluster join /stornext/XSAN_VOL 192.168.20.42
Preparing for NAS cluster join as ID 192.168.20.42
Applying NAS cluster join settings ...
Updating system NAS cluster configuration ...
Check for master takeover ...
Publish master configuration ...
Sending ads auth-config sync to 192.168.20.41 ...
Node 192.168.20.41 not joined, skipping auth-config sync.
Broadcasting share config-sync to NAS cluster ...
Broadcasting data analytic config-sync to NAS cluster ...
Cluster verification for 192.168.20.42 in-progress ...
Node state: pnn:0 192.168.20.42 UNHEALTHY (THIS NODE), waiting ...
Node state: pnn:0 192.168.20.42 OK (THIS NODE)
Cluster verification of 192.168.20.42 successful ...
Join to ERT2000.GR starting ...
Verify now joined to ERT2000.GR ...
Restart SMB services to join with ERT2000.GR ...
Successfully joined NAS cluster
BHM:qtm-node2> nascluster join /stornext/XSAN_VOL 192.168.20.41
Proxy join to 192.168.20.41 ...
[192.168.20.41]:
[192.168.20.41]: Preparing for NAS cluster join as ID 192.168.20.41
[192.168.20.41]: Waiting for NAS cluster join in-progress from ID 192.168.20.41 ...
[192.168.20.41]: Verifying local configuration with master 192.168.20.42 ...
[192.168.20.41]: Synchronization of local configuration with master 192.168.20.42 starting...
[192.168.20.41]: Applying ads auth config sync settings ...
[192.168.20.41]: Applying ads configuration settings ...
[192.168.20.41]: Checking SMB interface list: lo 192.168.20.41
[192.168.20.41]: Checking SMB interface 'p1p1:192.168.20.41' status ...
[192.168.20.41]: Join to ERT2000.GR starting ...
[192.168.20.41]: Verify now joined to ERT2000.GR ...
[192.168.20.41]: Restart SMB services to join with ERT2000.GR ...
[192.168.20.41]: Applying NAS cluster join settings ...
[192.168.20.41]: Updating system NAS cluster configuration ...
[192.168.20.41]: Verifying local configuration with master 192.168.20.42 ...
[192.168.20.41]: Cluster verification for 192.168.20.41 in-progress ...
[192.168.20.41]: Node state: pnn:1 192.168.20.41 UNHEALTHY (THIS NODE), waiting ...
[192.168.20.41]: Node state: pnn:1 192.168.20.41 OK (THIS NODE)
[192.168.20.41]: Cluster verification of 192.168.20.41 successful ...
[192.168.20.41]: Join to ERT2000.GR starting ...
[192.168.20.41]: Verify now joined to ERT2000.GR ...
[192.168.20.41]: Restart SMB services to join with ERT2000.GR ...
Successfully joined NAS cluster
What we learn from this case:
This page was generated by the BrainKeeper Enterprise Wiki, © 2018 |