SR3747616: Nas Cluster broken after upgrade to 1.3.0

SR Information: 3747616 ERT

Problem Description: Nas Cluster broken after upgrade to 1.3.0

Product / Software Version:

XCellis 5.3.2.1

SN-NAS 1.2.4

Overview

After Upgrade to SN-NAS 1.3.0 all Nodes un-joined the cluster

SN-NAS Cluster

A StorNext NAS cluster provides users the ability to access NAS shares located on any of the NAS cluster's nodes.

Through NAS clusters, you can also take advantage of additional features built into StorNext NAS, such as NAS failover to ensure that users can always access NAS shares, or G300 load-balancing to maintain a desirable level of network-response time.

Symptoms & Identifying the problem

## 1 ## Review:

The customer upgraded his SN-NAS Version from 1.2.4 to 1.2.5 to 1.3.0

“>system show version” reflect that both XCellis Nodes have been updated to NAS 1.3.0

“>nascluster show“ revealed to both XCellis Nodes have un-joined the Nas-Cluster

/var/log/snnas_controller

2016-10-08 00:20:34: stdout/INFO:: nas_cluster_plugin.py:2901 master file (/stornext/XSAN_VOL/.StorNext/.snnas/ctdb/ctdb_nodes_0.master) updated with 192.168.20.42

2016-10-08 00:29:16: stdout/INFO:: sml_nascluster_utils.py:2149 NAS post-upgrade complete

2016-10-08 00:32:32: stdout/INFO:: controller_commands.py:79 New controller command: nascluster_join

2016-10-08 00:32:32: stdout/ERROR:: controller_commands.py:1211 Controller command 'nascluster_join' encountered error: NAS cluster version invalid: Master auth config: version 1, expected 2 (E-5064)

2016-10-08 01:05:25: stdout/INFO:: sml_nascluster_utils.py:1764 Previous NAS cluster join in-progress file /stornext/XSAN_VOL/.StorNext/.snnas/ctdb/ctdb_nodes_0.join-progress for 192.168.20.42 found, continuing ...

2016-10-08 01:05:25: stdout/ERROR:: controller_commands.py:1211 Controller command 'nascluster_join' encountered error: NAS cluster version invalid: Master auth config: version 1, expected 2 (E-5064)

Seems the join is looping due to the “in-progress file” and continuously failing

Note:

Upgrade the Master node to 1.3.0, following standard procedure
- Each non-master 1.2.5 node will automatically leave the cluster
Upgrade each non-Master node to 1.3.0
- Each non-master node will automatically re-join the cluster once it’s at NAS 1.3.0

So why did the Master leave the Cluster?

## 2 ## Troubleshooting:

Let’s review, where the SN-NAS Configuration Files are located using the SN-NAS Shell by printing the nas registry

HM:qtm-node1> reg show nas

nas.cluster.snfs_root = /stornext/XSAN_VOL

/stornext/ XSAN_VOL /.StorNext/.snnas/ctdb/

-rw-rw-r-- 1 root root 28 Aug 25 11:38 ctdb_nodes_0

-rw-rw-r-- 1 root root 14 Aug 25 11:38 ctdb_nodes_0.bak

-rwxr-xr-x 1 root root 14 Oct 8 00:32 ctdb_nodes_0.join-progress

-rwxr-xr-x 1 root root 14 Oct 8 00:32 ctdb_nodes_0.master

-rw-rw-r-- 1 root root 0 Oct 10 10:41 ctdb_public_addresses_0

-rw-rw-r-- 1 root root 0 Aug 25 12:28 ctdb_recovery_0.lck

Logs show that the master suppose to be Node2 [ passive MDC] with IP 192.168.20.42

[root@qtm-node1 ctdb]# cat ctdb_nodes_0.master

192.168.20.41

[root@qtm-node1 ctdb]# cat ctdb_nodes_0.join-progress

192.168.20.42

The configuration File state otherwise. Node1 is the master and the “join-inprogress” file shows Node2 trying to join the cluster.

BHM:qtm-node1> nascluster join /stornext/XSAN_VOL 192.168.20.42

NAS cluster operation failed: Not NAS cluster master (E-7014)

BHM:qtm-node1> nascluster enable master 192.168.20.41 /stornext/XSAN_VOL

NAS cluster operation failed: Node 192.168.20.41 already enabled! (E-7014)

This is contradicting and we are stuck.

ctdb\supportbundle\tmp\snnas_support\net.txt

p1p1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500

inet 192.168.20.42 netmask 255.255.255.0 broadcast 192.168.20.255

p1p1:nas: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500

inet 192.168.20.40 netmask 255.255.255.255 broadcast 192.168.20.40

We can see that the NAS VIP is located on Node 2, so this node is suppose to be the Master.

## 3 ## Resolutions/workarounds/fixes:

Changed the ctdb_nodes_0.master set Node2 who owns the NAS VIP as Nas Master
Removed stale file ctdb_nodes_0.join-progress
Restarted Services via SN-NAS CLI “>system restart services all” on Node2 then Node1
Joined Node2 to the cluster then Node1 [from Master Node (Node2)]

BHM:qtm-node2> nascluster join /stornext/XSAN_VOL 192.168.20.42

Preparing for NAS cluster join as ID 192.168.20.42

Applying NAS cluster join settings ...

Updating system NAS cluster configuration ...

Check for master takeover ...

Publish master configuration ...

Sending ads auth-config sync to 192.168.20.41 ...

Node 192.168.20.41 not joined, skipping auth-config sync.

Broadcasting share config-sync to NAS cluster ...

Broadcasting data analytic config-sync to NAS cluster ...

Cluster verification for 192.168.20.42 in-progress ...

Node state: pnn:0 192.168.20.42 UNHEALTHY (THIS NODE), waiting ...

Node state: pnn:0 192.168.20.42 OK (THIS NODE)

Cluster verification of 192.168.20.42 successful ...

Join to ERT2000.GR starting ...

Verify now joined to ERT2000.GR ...

Restart SMB services to join with ERT2000.GR ...

Successfully joined NAS cluster

BHM:qtm-node2> nascluster join /stornext/XSAN_VOL 192.168.20.41

Proxy join to 192.168.20.41 ...

[192.168.20.41]:

[192.168.20.41]: Preparing for NAS cluster join as ID 192.168.20.41

[192.168.20.41]: Waiting for NAS cluster join in-progress from ID 192.168.20.41 ...

[192.168.20.41]: Verifying local configuration with master 192.168.20.42 ...

[192.168.20.41]: Synchronization of local configuration with master 192.168.20.42 starting...

[192.168.20.41]: Applying ads auth config sync settings ...

[192.168.20.41]: Applying ads configuration settings ...

[192.168.20.41]: Checking SMB interface list: lo 192.168.20.41

[192.168.20.41]: Checking SMB interface 'p1p1:192.168.20.41' status ...

[192.168.20.41]: Join to ERT2000.GR starting ...

[192.168.20.41]: Verify now joined to ERT2000.GR ...

[192.168.20.41]: Restart SMB services to join with ERT2000.GR ...

[192.168.20.41]: Applying NAS cluster join settings ...

[192.168.20.41]: Updating system NAS cluster configuration ...

[192.168.20.41]: Verifying local configuration with master 192.168.20.42 ...

[192.168.20.41]: Cluster verification for 192.168.20.41 in-progress ...

[192.168.20.41]: Node state: pnn:1 192.168.20.41 UNHEALTHY (THIS NODE), waiting ...

[192.168.20.41]: Node state: pnn:1 192.168.20.41 OK (THIS NODE)

[192.168.20.41]: Cluster verification of 192.168.20.41 successful ...

[192.168.20.41]: Join to ERT2000.GR starting ...

[192.168.20.41]: Verify now joined to ERT2000.GR ...

[192.168.20.41]: Restart SMB services to join with ERT2000.GR ...

Successfully joined NAS cluster

What we learn from this case:

You must upgrade the master node before upgrading the non-master nodes.
Print Nas registry and find <snfs-root-where-cluster-info-stored>