GUI: Out of Disk Space - "System Busy, try again later or contact tech support" (DRAFT) |
SR Information: SR 1648848
Found on Product / Software Version: DXi6802 running 2.2.0.1_68. It may apply on other versions and platforms if the DXi allows disk usage to reach 100%.
Problem Description: After customer ran out of disk space, WebGui started to issue the message "system busy, try again later or contact tech support". |
This article covers one specific root cause for the message
Note from Editor: Are there any upper case letters in the message, such as "S" for "system" or "B" for "busy"? If so, please fix as needed.
system busy, try again later or contact tech support.
If the description below doesn't match your problem, we encourage you to do further analysis, and if needed, request assistance from your senior engineer or backline engineer.
This article consists of the following sections:
Note from Editor: Is "WebGui" correct, with that use of capitals? If not, please change to the way people would expect to see it.
In this case, when the customer attempted to navigate on the replication WebGui, they received the following message: system busy, try again later or contact tech support.
In a case like this, to identify the problem:
1. Look at tsunami.log. In this case, it contains messages indicating that share information is missing under the replication configuration file:
ERROR - 12/09/13-03:39:04 - unknown:8000 ReplicationAPI.cpp(16476) [webguid] validateExists() - Share EP-CV-DXI-02 does not exist in the replication configuration.
ERROR - 12/09/13-03:39:04 - SystemReport SysReportUtil.cpp(2874) [webguid] NASGroup() - Failed to get replication enable/disable for share: EP-CV-DXI-02
ERROR - 12/09/13-03:39:04 - unknown:8000 ReplicationAPI.cpp(16476) [webguid] validateExists() - Share R-DXI-CACYIMV-1 does not exist in the replication configuration.
ERROR - 12/09/13-03:39:04 - SystemReport SysReportUtil.cpp(2874) [webguid] NASGroup() - Failed to get replication enable/disable for share: R-DXI-CACYIMV-1
ERROR - 12/09/13-03:39:04 - unknown:8000 ReplicationAPI.cpp(16476) [webguid] validateExists() - Share R-DXI-COBAES-1 does not exist in the replication configuration.
ERROR - 12/09/13-03:39:04 - SystemReport SysReportUtil.cpp(2874) [webguid] NASGroup() - Failed to get replication enable/disable for share: R-DXI-COBAES-1
ERROR - 12/09/13-03:39:04 - unknown:8000 ReplicationAPI.cpp(16476) [webguid] validateExists() - Share R-DXI-EGBAPS-1 does not exist in the replication configuration.
ERROR - 12/09/13-03:39:04 - SystemReport SysReportUtil.cpp(2874) [webguid] NASGroup() - Failed to get replication enable/disable for share: R-DXI-EGBAPS-1
ERROR - 12/09/13-03:39:04 - unknown:8000 ReplicationAPI.cpp(16476) [webguid] validateExists() - Share R-DXI-USMIPS-1 does not exist in the replication configuration.
ERROR - 12/09/13-03:39:04 - SystemReport SysReportUtil.cpp(2874) [webguid] NASGroup() - Failed to get replication enable/disable for share: R-DXI-USMIPS-1
ERROR - 12/09/13-03:39:04 - unknown:8000 ReplicationAPI.cpp(16476) [webguid] validateExists() - Share WP-CV-DXI-02 does not exist in the replication configuration.
ERROR - 12/09/13-03:39:04 - SystemReport SysReportUtil.cpp(2874) [webguid] NASGroup() - Failed to get replication enable/disable for share: WP-CV-DXI-02
ERROR - 12/09/13-03:39:04 - unknown:8000 ReplicationAPI.cpp(16476) [webguid] validateExists() - Share WP-WGMS does not exist in the replication configuration.
ERROR - 12/09/13-03:39:04 - SystemReport SysReportUtil.cpp(2874) [webguid] NASGroup() - Failed to get replication enable/disable for share: WP-WGMS
ERROR - 12/09/13-03:39:04 - unknown:8000 ReplicationAPI.cpp(16476) [webguid] validateExists() - Share local does not exist in the replication configuration.
ERROR - 12/09/13-03:39:04 - SystemReport SysReportUtil.cpp(2874) [webguid] NASGroup() - Failed to get replication enable/disable for share: local
2. Confirm that the replication configuration file does not have any share indicated in tsunami.log. You may find that this file is empty or has missing data. In our case scenario, there was data missing:
###08:31:53### -Replication- 'cat /data/hurricane/replication.conf':
[Global_Global]
CvfsMountPoint=/snfs/Q/
DedupWindowDuration=0
DedupWindowScheduledTimesForSystem=
EncryptionEnabled=false
EncryptionMethod=0
IsReplicationForSystemEnabled=false
ProgramReplicationIsPaused=false
QbfsMountPoint=/Q/
ReplicationDestination=
ReplicationRevision=0
ReplicationScheduledTimesForSystem=
ReplicationSourceIP=
SourceHostList=
TargetNASReplicationSupported=false
TargetVTLReplicationSupported=false
UserReplicationIsPaused=false
Notes:
# ls -la /data
lrwxrwxrwx 1 root root 17 Jul 15 14:10 /data -> /snfs/common/data
1. Check to see if the system has an old collect that was stored before the problem started. In this case, we found an old collect that had all the shares reported above:
###12:18:40### -Replication- 'cat /data/hurricane/replication.conf':
[Global_Global]
CvfsMountPoint=/snfs/Q/
DedupWindowDuration=0
DedupWindowScheduledTimesForSystem=
EncryptionEnabled=false
EncryptionMethod=1
IsReplicationForSystemEnabled=false
ProgramReplicationIsPaused=false
QbfsMountPoint=/Q/
ReplicationDestination=172.24.249.3
ReplicationRevision=5
ReplicationScheduledTimesForSystem=
ReplicationSourceIP=10.50.249.3
SourceHostList=10.0.94.11,10.45.25.33,10.50.249.2,10.65.10.3,172.20.138.17,172.24.249.3
TargetNASReplicationSupported=true
TargetVTLReplicationSupported=true
UserReplicationIsPaused=false
[Share_10]
DedupAge=60
DedupEnabled=true
DedupWindowDuration=0
DedupWindowScheduledTimes=
NodeId=
NodeName=WP-CV-DXI-02
ReplicationEnabled=true
ReplicationInterval=
ReplicationRole=3
ReplicationScheduledTimes=
RetentionAge=120
ShareType=1
TriggerReplication=1
TriggerReplicationId=WP-CV-DXI-02
TriggerReplicationLastSetting=1
[Share_11]
DedupAge=60
DedupEnabled=true
DedupWindowDuration=0
DedupWindowScheduledTimes=
NodeId=
NodeName=EP-CV-DXI-02
ReplicationEnabled=false
ReplicationInterval=
ReplicationRole=3
ReplicationScheduledTimes=
RetentionAge=120
ShareType=1
TriggerReplication=0
TriggerReplicationId=EP-CV-DXI-02
TriggerReplicationLastSetting=0
[Share_12]
DedupAge=60
DedupEnabled=true
DedupWindowDuration=0
DedupWindowScheduledTimes=
NodeId=
NodeName=R-DXI-COBAES-1
ReplicationEnabled=false
ReplicationInterval=
ReplicationRole=3
ReplicationScheduledTimes=
RetentionAge=120
ShareType=1
TriggerReplication=2
TriggerReplicationId=R-DXI-COBAES-1
TriggerReplicationLastSetting=2
[Share_13]
DedupAge=60
DedupEnabled=true
DedupWindowDuration=0
DedupWindowScheduledTimes=
NodeId=
NodeName=local
ReplicationEnabled=false
ReplicationInterval=
ReplicationRole=3
ReplicationScheduledTimes=
RetentionAge=120
ShareType=1
TriggerReplication=0
TriggerReplicationId=
TriggerReplicationLastSetting=0
[Share_5]
DedupAge=60
DedupEnabled=true
DedupWindowDuration=0
DedupWindowScheduledTimes=
NodeId=
NodeName=WP-WGMS
ReplicationEnabled=false
ReplicationInterval=
ReplicationRole=3
ReplicationScheduledTimes=
RetentionAge=120
ShareType=1
TriggerReplication=0
TriggerReplicationId=
TriggerReplicationLastSetting=0
[Share_6]
DedupAge=60
DedupEnabled=true
DedupWindowDuration=0
DedupWindowScheduledTimes=
NodeId=
NodeName=R-DXI-CACYIMV-1
ReplicationEnabled=false
ReplicationInterval=
ReplicationRole=3
ReplicationScheduledTimes=
RetentionAge=120
ShareType=1
TriggerReplication=2
TriggerReplicationId=R-DXI-CACYIMV-1
TriggerReplicationLastSetting=2
[Share_7]
DedupAge=60
DedupEnabled=true
DedupWindowDuration=0
DedupWindowScheduledTimes=
NodeId=
NodeName=R-DXI-EGBAPS-1
ReplicationEnabled=false
ReplicationInterval=
ReplicationRole=3
ReplicationScheduledTimes=
RetentionAge=120
ShareType=1
TriggerReplication=2
TriggerReplicationId=R-DXI-EGBAPS-1
TriggerReplicationLastSetting=2
[Share_9]
DedupAge=60
DedupEnabled=true
DedupWindowDuration=0
DedupWindowScheduledTimes=
NodeId=
NodeName=R-DXI-USMIPS-1
ReplicationEnabled=false
ReplicationInterval=
ReplicationRole=3
ReplicationScheduledTimes=
RetentionAge=120
ShareType=1
TriggerReplication=2
TriggerReplicationId=R-DXI-USMIPS-1
TriggerReplicationLastSetting=2
2. Obtain the date when the collect from which you got the information above was gathered. Confirm with the customer that the replication configuration has not been changed since then.
3. Do one of the following:
Move /var/DXi/processwatcher to /var/DXi/processwatcher.old (you can also remove the file)
3. Stop webguid:
#service webguid stop qnode1
Question from Editor: Should we say how to do this?
4. Change replication.conf, making sure that it will have the recovered settings.
5. Here, you have the option to reboot DXi (if you do, make sure to notify the customer and arrange a maintenance window before rebooting), OR you can restart webguid and then create the processwatcher file.
# service webguid stop qnode1
# touch /var/DXi/processwatcher
We encourage you to do further analysis in order to pursue the RCA of the file that became corrupted.
In this case scenario (SR 1648848), the file was damaged because the DXi filled up the disk to 100%, which made replication and other services start to fail.
When this happens, replication will try to pause the service. To do that, it will try to update the replication.conf file. If the DXi runs out of disk space (100% in use, with other services trying to write on the same disk) replication.conf may become damaged, along with other files. This may lead to a smith or shutdown due to the services failing. To avoid all of these problems, ask the customer to stop any ingest (via backup or replication).
Note from Editor: Would it be helpful to mark the lines that show significant informaiton in the listing below, maybe by putting them in red? Also, since this is a long listing, could we delete any of the lines? If so, we could put "...." as a line by itself to mark where stuff has been deleted..
This page was generated by the BrainKeeper Enterprise Wiki, © 2018 |