GUI: Out of Disk Space - "System Busy, try again later or contact tech support" (DRAFT)

SR Information: SR 1648848

Found on Product / Software Version: DXi6802 running 2.2.0.1_68. It may apply on other versions and platforms if the DXi allows disk usage to reach 100%.

Problem Description: After customer ran out of disk space, WebGui started to issue the message "system busy, try again later or contact tech support".

Overview

This article covers one specific root cause for the message

Note from Editor: Are there any upper case letters in the message, such as "S" for "system" or "B" for "busy"? If so, please fix as needed.

system busy, try again later or contact tech support.

If the description below doesn't match your problem, we encourage you to do further analysis, and if needed, request assistance from your senior engineer or backline engineer.

This article consists of the following sections:

Overview
Identifying the Problem
Solving the Problem
Additional Notes

Identifying the Problem

Note from Editor: Is "WebGui" correct, with that use of capitals? If not, please change to the way people would expect to see it.

In this case, when the customer attempted to navigate on the replication WebGui, they received the following message: system busy, try again later or contact tech support.

In a case like this, to identify the problem:

1. Look at tsunami.log. In this case, it contains messages indicating that share information is missing under the replication configuration file:

ERROR - 12/09/13-03:39:04 - unknown:8000 ReplicationAPI.cpp(16476) [webguid] validateExists() - Share EP-CV-DXI-02 does not exist in the replication configuration.

ERROR - 12/09/13-03:39:04 - SystemReport SysReportUtil.cpp(2874) [webguid] NASGroup() - Failed to get replication enable/disable for share: EP-CV-DXI-02

ERROR - 12/09/13-03:39:04 - unknown:8000 ReplicationAPI.cpp(16476) [webguid] validateExists() - Share R-DXI-CACYIMV-1 does not exist in the replication configuration.

ERROR - 12/09/13-03:39:04 - SystemReport SysReportUtil.cpp(2874) [webguid] NASGroup() - Failed to get replication enable/disable for share: R-DXI-CACYIMV-1

ERROR - 12/09/13-03:39:04 - unknown:8000 ReplicationAPI.cpp(16476) [webguid] validateExists() - Share R-DXI-COBAES-1 does not exist in the replication configuration.

ERROR - 12/09/13-03:39:04 - SystemReport SysReportUtil.cpp(2874) [webguid] NASGroup() - Failed to get replication enable/disable for share: R-DXI-COBAES-1

ERROR - 12/09/13-03:39:04 - unknown:8000 ReplicationAPI.cpp(16476) [webguid] validateExists() - Share R-DXI-EGBAPS-1 does not exist in the replication configuration.

ERROR - 12/09/13-03:39:04 - SystemReport SysReportUtil.cpp(2874) [webguid] NASGroup() - Failed to get replication enable/disable for share: R-DXI-EGBAPS-1

ERROR - 12/09/13-03:39:04 - unknown:8000 ReplicationAPI.cpp(16476) [webguid] validateExists() - Share R-DXI-USMIPS-1 does not exist in the replication configuration.

ERROR - 12/09/13-03:39:04 - SystemReport SysReportUtil.cpp(2874) [webguid] NASGroup() - Failed to get replication enable/disable for share: R-DXI-USMIPS-1

ERROR - 12/09/13-03:39:04 - unknown:8000 ReplicationAPI.cpp(16476) [webguid] validateExists() - Share WP-CV-DXI-02 does not exist in the replication configuration.

ERROR - 12/09/13-03:39:04 - SystemReport SysReportUtil.cpp(2874) [webguid] NASGroup() - Failed to get replication enable/disable for share: WP-CV-DXI-02

ERROR - 12/09/13-03:39:04 - unknown:8000 ReplicationAPI.cpp(16476) [webguid] validateExists() - Share WP-WGMS does not exist in the replication configuration.

ERROR - 12/09/13-03:39:04 - SystemReport SysReportUtil.cpp(2874) [webguid] NASGroup() - Failed to get replication enable/disable for share: WP-WGMS

ERROR - 12/09/13-03:39:04 - unknown:8000 ReplicationAPI.cpp(16476) [webguid] validateExists() - Share local does not exist in the replication configuration.

ERROR - 12/09/13-03:39:04 - SystemReport SysReportUtil.cpp(2874) [webguid] NASGroup() - Failed to get replication enable/disable for share: local

2. Confirm that the replication configuration file does not have any share indicated in tsunami.log. You may find that this file is empty or has missing data. In our case scenario, there was data missing:

###08:31:53### -Replication- 'cat /data/hurricane/replication.conf':

[Global_Global]

CvfsMountPoint=/snfs/Q/

DedupWindowDuration=0

DedupWindowScheduledTimesForSystem=

EncryptionEnabled=false

EncryptionMethod=0

IsReplicationForSystemEnabled=false

ProgramReplicationIsPaused=false

QbfsMountPoint=/Q/

ReplicationDestination=

ReplicationRevision=0

ReplicationScheduledTimesForSystem=

ReplicationSourceIP=

SourceHostList=

TargetNASReplicationSupported=false

TargetVTLReplicationSupported=false

UserReplicationIsPaused=false

Notes:

The replication configuration file can be found in the collect log bundle (you'll find a copy under collect.txt), and on the live system, under the path /snfs/common/data/hurricane/replication.conf.

Note that you also have the same file under /data/hurricane/replication.conf. This is because /data is a symbolic link for /snfs/common/data:

# ls -la /data

lrwxrwxrwx 1 root root 17 Jul 15 14:10 /data -> /snfs/common/data

Solving the Problem

1. Check to see if the system has an old collect that was stored before the problem started. In this case, we found an old collect that had all the shares reported above:

###12:18:40### -Replication- 'cat /data/hurricane/replication.conf':

[Global_Global]

CvfsMountPoint=/snfs/Q/

DedupWindowDuration=0

DedupWindowScheduledTimesForSystem=

EncryptionEnabled=false

EncryptionMethod=1

IsReplicationForSystemEnabled=false

ProgramReplicationIsPaused=false

QbfsMountPoint=/Q/

ReplicationDestination=172.24.249.3

ReplicationRevision=5

ReplicationScheduledTimesForSystem=

ReplicationSourceIP=10.50.249.3

SourceHostList=10.0.94.11,10.45.25.33,10.50.249.2,10.65.10.3,172.20.138.17,172.24.249.3

TargetNASReplicationSupported=true

TargetVTLReplicationSupported=true

UserReplicationIsPaused=false

[Share_10]

DedupAge=60

DedupEnabled=true

DedupWindowDuration=0

DedupWindowScheduledTimes=

NodeId=

NodeName=WP-CV-DXI-02

ReplicationEnabled=true

ReplicationInterval=

ReplicationRole=3

ReplicationScheduledTimes=

RetentionAge=120

ShareType=1

TriggerReplication=1

TriggerReplicationId=WP-CV-DXI-02

TriggerReplicationLastSetting=1

[Share_11]

DedupAge=60

DedupEnabled=true

DedupWindowDuration=0

DedupWindowScheduledTimes=

NodeId=

NodeName=EP-CV-DXI-02

ReplicationEnabled=false

ReplicationInterval=

ReplicationRole=3

ReplicationScheduledTimes=

RetentionAge=120

ShareType=1

TriggerReplication=0

TriggerReplicationId=EP-CV-DXI-02

TriggerReplicationLastSetting=0

[Share_12]

DedupAge=60

DedupEnabled=true

DedupWindowDuration=0

DedupWindowScheduledTimes=

NodeId=

NodeName=R-DXI-COBAES-1

ReplicationEnabled=false

ReplicationInterval=

ReplicationRole=3

ReplicationScheduledTimes=

RetentionAge=120

ShareType=1

TriggerReplication=2

TriggerReplicationId=R-DXI-COBAES-1

TriggerReplicationLastSetting=2

[Share_13]

DedupAge=60

DedupEnabled=true

DedupWindowDuration=0

DedupWindowScheduledTimes=

NodeId=

NodeName=local

ReplicationEnabled=false

ReplicationInterval=

ReplicationRole=3

ReplicationScheduledTimes=

RetentionAge=120

ShareType=1

TriggerReplication=0

TriggerReplicationId=

TriggerReplicationLastSetting=0

[Share_5]

DedupAge=60

DedupEnabled=true

DedupWindowDuration=0

DedupWindowScheduledTimes=

NodeId=

NodeName=WP-WGMS

ReplicationEnabled=false

ReplicationInterval=

ReplicationRole=3

ReplicationScheduledTimes=

RetentionAge=120

ShareType=1

TriggerReplication=0

TriggerReplicationId=

TriggerReplicationLastSetting=0

[Share_6]

DedupAge=60

DedupEnabled=true

DedupWindowDuration=0

DedupWindowScheduledTimes=

NodeId=

NodeName=R-DXI-CACYIMV-1

ReplicationEnabled=false

ReplicationInterval=

ReplicationRole=3

ReplicationScheduledTimes=

RetentionAge=120

ShareType=1

TriggerReplication=2

TriggerReplicationId=R-DXI-CACYIMV-1

TriggerReplicationLastSetting=2

[Share_7]

DedupAge=60

DedupEnabled=true

DedupWindowDuration=0

DedupWindowScheduledTimes=

NodeId=

NodeName=R-DXI-EGBAPS-1

ReplicationEnabled=false

ReplicationInterval=

ReplicationRole=3

ReplicationScheduledTimes=

RetentionAge=120

ShareType=1

TriggerReplication=2

TriggerReplicationId=R-DXI-EGBAPS-1

TriggerReplicationLastSetting=2

[Share_9]

DedupAge=60

DedupEnabled=true

DedupWindowDuration=0

DedupWindowScheduledTimes=

NodeId=

NodeName=R-DXI-USMIPS-1

ReplicationEnabled=false

ReplicationInterval=

ReplicationRole=3

ReplicationScheduledTimes=

RetentionAge=120

ShareType=1

TriggerReplication=2

TriggerReplicationId=R-DXI-USMIPS-1

TriggerReplicationLastSetting=2

2. Obtain the date when the collect from which you got the information above was gathered. Confirm with the customer that the replication configuration has not been changed since then.

3. Do one of the following:

If modifications were executed by the customer (remove a share, disable/enable replication on a share), please stop here and make sure that the modifications are applied in the recovered replication.conf. If you need to, seek advice from a senior or backline engineer.
If no modifications were done, or after you confirm the correct replication settings to be applied, do the following, which puts the recovered replication.conf in place:

Move /var/DXi/processwatcher to /var/DXi/processwatcher.old (you can also remove the file)

3. Stop webguid:

#service webguid stop qnode1

Question from Editor: Should we say how to do this?

4. Change replication.conf, making sure that it will have the recovered settings.

5. Here, you have the option to reboot DXi (if you do, make sure to notify the customer and arrange a maintenance window before rebooting), OR you can restart webguid and then create the processwatcher file.

# service webguid stop qnode1

# touch /var/DXi/processwatcher

Additional Notes

We encourage you to do further analysis in order to pursue the RCA of the file that became corrupted.

In this case scenario (SR 1648848), the file was damaged because the DXi filled up the disk to 100%, which made replication and other services start to fail.

When this happens, replication will try to pause the service. To do that, it will try to update the replication.conf file. If the DXi runs out of disk space (100% in use, with other services trying to write on the same disk) replication.conf may become damaged, along with other files. This may lead to a smith or shutdown due to the services failing. To avoid all of these problems, ask the customer to stop any ingest (via backup or replication).

Note from Editor: Would it be helpful to mark the lines that show significant informaiton in the listing below, maybe by putting them in red? Also, since this is a long listing, could we delete any of the lines? If so, we could put "...." as a line by itself to mark where stuff has been deleted..