Troubleshooting
Known Issues

ISSUE: |
Inconsistent GUI view on scale out system |
Platforms affected: |
All |
Triggering condition: | Scaling out and connecting to the GUI that’s running on one of the scale out nodes. |
Details: |
Some scale out nodes also host a GUI service. When comparing the view with the one of the base system (first system that got installed), you notice that this is different. For instance (there are more differences):
|
Customer Impact: | Customer will see inconsistent results, which is confusing. |
Rate of occurrence: | Applies to all scale outs. |
Workaround: | Only connect to a GUI service that is running on the base system. These will have a complete view of the deployment, also containing information and metrics of all scale out machines. |
ISSUE: |
Object transaction rate graphs reset after ActiveScale 6.5.0 Upgrade. |
Platforms affected: |
All |
Triggering condition: |
Upgrade to ActiveScale 6.5.0 |
Details: |
After upgrading to ActiveScale 6.5, the object transaction rate graphs (Transaction Read and Transaction Write) in the GUI will be reset and display new data. |
Customer Impact: |
Low. Historical data for these graphs will not be presented. The graph will only show the new data. |
Rate of occurrence: |
High |
Workaround: |
Contact Support if this would be problematic. |
ISSUE: |
Object transaction rate graphs reset after ActiveScale 6.5.0 Upgrade. |
Platforms affected: |
All |
Triggering condition: |
Upgrade to ActiveScale 6.5.0 |
Details: |
After upgrading to ActiveScale 6.5, the object transaction rate graphs (Transaction Read and Transaction Write) in the GUI will be reset and display new data. |
Customer Impact: |
Low. Historical data for these graphs will not be presented. The graph will only show the new data. |
Rate of occurrence: |
High |
Workaround: |
Contact Support if this would be problematic. |

ISSUE: | The system marks a reseated OS drive as NOT_FOUND. If it remains in this state for longer than the backoff interval (30 minutes), the system decommissions it. |
Platforms affected: |
All |
Triggering condition: |
Reseating (removing and reinserting) an OS drive |
Details: |
None |
Customer impact: |
OS remains active/responsive - no impact there. Its mirror restores itself automatically with one of the 2 spares. RAID loses one spare of the two, but this is not a major problem. If the disk contained a generic partition, metastore applications might be not running/unable to start, which might cause: “Application down” events “Metastore down” events No impact on the data path if only one metastore is down. |
Rate of occurrence: |
Low |
Workaround: |
Contact Support |
ISSUE: | GUI shows all Power (PDU) metrics in 1 rack-tab per datacenter |
Platforms affected: |
X100 |
Triggering condition: |
System has SW version 5.7.X, 6.0, 6.0.1, or 6.0.2 and has been scaled out at least once. So trigger could either be an upgrade of a scaled out system to 5.7.X, 6.0, 6.0.1 or 6.0.2 or the trigger could be scaling out a system that is on one of these SW versions |
Details: |
Visible when opening the GUI and navigating to the System Overview > Power tab. For each datacenter you can select the rack for which you want to display the power metrics, but there will only be 1 rack-tab per datacenter which actually shows metrics data. All the other rack-tabs of that datacenter will show “No data”. Upon closer inspection of the legenda of a graph that does show metrics it should be clear that the graph not only displays the power metrics for the selected rack, but actually for all racks in that datacenter. The PDUs for all racks in the datacenter are listed in that legenda. |
Customer impact: |
Some power (PDU) metrics are displayed in the wrong graph in the GUI. Some power metrics tabs show “No data”. |
Rate of occurrence: |
Always when system is on SW version 5.7.X, 6.0, 6.0.1 and 6.0.2 and has been scaled out at least once. |
Workaround: |
Click all rack-tabs and determine which ones display actual metrics. The metrics for all PDUs in the system can be monitored from those graphs. |
ISSUE: | A slow blockstore drive can lead to DSS_STORAGEPOOL_UNVERIFIED_OBJECTS event. |
Platforms affected: |
All |
Triggering condition: |
One or more blockstore drives become very slow. |
Details: |
When one or more blockstore drives become very slow, it is possible that DSS_STORAGEPOOL_UNVERIFIED_OBJECTS events are raised. During the verification process all the data is read in, and this can be slowed down significantly when there is an underperforming drive. |
Customer impact: |
DSS_STORAGEPOOL_UNVERIFIED_OBJECTS events. |
Rate of occurrence: |
Very Low |
Workaround: |
If the events don't disappear automatically after a while, it's best to manually decommission the drive. |
ISSUE: | If a replacement component has older firmware on it, the system doesn’t upgrade its firmware and the replacement instructions don’t include a step for upgrading firmware manually. |
Platforms affected: |
All |
Triggering condition: |
Replacing a component with one that has older firmware on it. |
Details: |
The |
Customer impact: |
Unknown |
Rate of occurrence: |
High |
Workaround: |
You have two choices: Contact Support after any hardware replacement or wait for a system-wide firmware upgrade when you upgrade OS. |

No Known Issues

ISSUE: | NFS Clients may see "Permission Denied" during Virtual IP failover, when using NFS Export with Kerberos Authentication |
Platforms affected: |
All |
Triggering condition: |
Virtual IP Failover |
Details: |
When Kerberos is used as authentication mechanism for file interface, the Client might see a "Permission Denied" error during VIP failover. After successful failover, NFS operations will return to operational state automatically. |
Customer impact: |
None |
Rate of occurrence: |
High |
Workaround: |
After successful failover, NFS operations will return to operational state automatically. No workaround required |

ISSUE: | Systems with replication enabled may transfer zero-length files. |
Platforms affected: |
All |
Triggering condition: |
A replicated object is deleted in a versioned bucket. |
Details: |
When versioning is enabled and an object is deleted, that object's DELETE operation may not be stored in correct order in an AWS destination bucket. You may notice a DELETE marker of size 0bytes. |
Customer impact: |
Risk of retrieving the wrong version of an object from an AWS destination bucket. |
Rate of occurrence: |
Depends on user setup. |
Workaround: |
When retrieving replicated objects from an AWS destination bucket with versioning enabled, always identify the correct version of the object by checking its |

No Known Issues

ISSUE: |
When KVM is connected USB errors can be seen on console and SYSLOG_ERROR events can be generated |
Platforms affected: |
P100 system node (Quanta D52L-1U) |
Triggering condition: |
When a remote console (KVM) is connected to the server |
Details: |
When a remote console (KVM) is connected, usb errors might be shown on the console might generate SYSLOG_ERRORS for them. These are harmless and can be ignored. |
Customer impact: |
The system will generate a SYSLOG_ERROR which can be ignored |
Rate of occurrence: |
Only when a remote console (KVM) is connected |
Workaround: |
Disconnect the remote console (KVM) |
ISSUE: | DIMM with memory errors slot cannot be located |
Platforms affected: |
X200, P200, P100 D52L Quanta EOL Replacement. |
Triggering condition: |
Memory Errors on RAM DIMM’s of the system. |
Details: |
MEMORY_ERRORS event in GUI where the details contains the following ambiguous message: Silk Screen 'Unknown', Socket None, Channel None, DIMM None |
Customer impact: |
Ambiguous event. Can’t determine which faulty memory DIMM to replace. |
Rate of occurrence: |
Always when there’s a memory error on a RAM DIMM. |
Workaround: |
Contact support |
ISSUE: | Severely reduced GET performance on systems originally installed with ActiveScale OS 5.0. |
Platforms affected: |
Systems originally installed with ActiveScale OS 5.0 |
Triggering condition: |
Upgrade to ActiveScale OS 5.7. |
Details: | The small object aggregation feature (introduced in ActiveScale OS 5.7) does not interact well with the erasure codes used on these systems, resulting in a severely reduced GET performance of objects written post-upgrade. |
Customer Impact: | Severely reduced GET performance. |
Rate of occurrence: | Always for these systems. |
Workaround: | A support procedure is in place for changing the erasure code in use (to be executed directly post upgrade), which resolves the issue. Please contact Quantum Support when planning to upgrade such a system. |

ISSUE: | COREDUMPS_FOUND events with signature “/var/crash/kdump_lock: empty” during upgrade to AS 6.3. |
Platforms affected: | All |
Triggering condition: | Upgrade to 6.3 |
Details: | It is possible that you get emails for COREDUMPS_FOUND events that occurred during upgrade to 6.3. These events will also be visible in the ActiveScale GUI. The signature of the event is “/var/crash/kdump_lock: empty”. These events are not harmful and can be ignored when these occur during the upgrade. |
Customer Impact: | Customer receives 1 or more COREDUMPS_FOUND events. |
Rate of occurrence: | Only during upgrade. They might also not occur at all. |
Workaround: | These events will gradually auto resolve as the upgrade progresses and maybe some time after the upgrade has finished. After a while these should not occur anymore. |
ISSUE: | Metrics graphs can show spikes, drops, gaps or missing data during upgrade to 6.1 |
Platforms affected: |
All |
Triggering condition: |
Upgrade to 6.1 |
Details: |
It is expected to see this behavior in metrics graphs in the ActiveScale GUI. These graphs will return to normal at the end of the upgrade. |
Customer impact: |
Distorted or incomplete metric graphs during upgrade |
Rate of occurrence: |
High - Always during upgrade to 6.1 |
Workaround: |
No action required as long as the system is upgrading. Contact Quantum Support if the metrics graphs don’t return to normal after upgrade. |
ISSUE: | External Prometheus and Grafana show incomplete metrics after upgrade to 6.1 |
Platforms affected: |
All |
Triggering condition: |
Upgrade to 6.1 of a scaled out deployment. Customer uses external Prometheus and Grafana to collect and view ActiveScale metrics. |
Details: |
It is expected that upgrading scaled out deployments to 6.1 will cause metrics that are being scraped directly from ActiveScale to be incomplete. Visual impact includes graphs suddenly dropping or spiking, and dashboards suddenly displaying less metrics. Historical metrics are not impacted, only newly collected metrics. A change in configuration of the external Prometheus and/or Grafana is required, otherwise the graphs won’t recover. |
Customer impact: |
Newly collected ActiveScale metrics by an external Prometheus are incomplete and seem incorrect/incomplete after upgrade. |
Rate of occurrence: |
Always if triggering conditions are met. |
Workaround: |
Contact Quantum Support to update the configuration of the external Prometheus and Grafana. |
ISSUE: | After replacement of a failed IOM, disks can be and remain in status NOTCONFIGURED |
Platforms affected: |
X100/X200 or other systems with JBOD only |
Triggering condition: |
IOM failure and replacement |
Details: |
An IOM failure will cause all disks in the JBOD to become NOTFOUND. Disk safety can lower due to the high amount of missing disks. Once the IOM is replaced, all NOTFOUND disks will become NOTCONFIGURED. |
Customer impact: |
During IOM Failure DISK_NOTFOUND event will be seen. There could be a lower disk safety due to the high amount of missing disks. After IOM replacement: No datapath impact. No functional impact. But the customer will see DISK_NOT_CONFIGURED events for each disk, 4 hours after the IOM replacement. |
Rate of occurrence: |
Low |
Workaround: |
Contact Support |
ISSUE: | Services (Arakoon, dss, scaler, and so on) might remain down after the system comes back online after a power failure. |
Platforms affected: |
All |
Triggering condition: |
The system is running a task that involves machine reconfiguration (such as a disk replacement or network reconfiguration) and at the same time a power failure happens. |
Details: |
|
Customer impact: |
None, because services are still running on some nodes. |
Rate of occurrence: |
Low |
Workaround: |
Contact Support |
ISSUE: | COREDUMPS_FOUND events with signature “/var/crash/kdump_lock: empty” during upgrade to AS 6.3. |
Platforms affected: |
All |
Triggering condition: |
Upgrade to 6.3 |
Details: |
It is possible that you get emails for COREDUMPS_FOUND events that occurred during upgrade to 6.3. These events will also be visible in the ActiveScale GUI. The signature of the event is “/var/crash/kdump_lock: empty”. These events are not harmful and can be ignored when these occur during the upgrade |
Customer impact: |
Customer receives 1 or more COREDUMPS_FOUND events. |
Rate of occurrence: |
Only during upgrade. They might also not occur at all. |
Workaround: |
These events will gradually auto resolve as the upgrade progresses and maybe some time after the upgrade has finished. After a while these should not occur anymore. |

ISSUE: | The mount command fails if you have exported the same top level directory twice. |
Platforms affected: |
All |
Triggering condition: |
You created multiple NFS exports of the same top level directory, with different permissions or access types. |
Details: |
None |
Customer impact: |
None |
Rate of occurrence: |
Varies |
Workaround: |
If you export the same top level directory more than once with different permissions, specify a unique --tag value for each export. To mount the export, use its tag without the leading slash rather than its export name. |
ISSUE: | Closing and/or flushing a file may fail. |
Platforms affected: |
All |
Triggering condition: |
A file is closed on the client side. |
Details: |
Most applications don’t check for errors when a file closes. This refers to the moment a file is closed on the client side. At this moment, the UDA service will perform a data flush. This is any data that has not yet been written to stable storage is written at that time as a precondition for the file close operation to succeed. This data flush can fail, which will in turn make the file close operation return an error. If the client does not check for that error, which is very common in many applications, it might appear that the file close and the data flush succeeded, when it actually did not. |
Customer impact: |
You need to check for errors when you close a file. |
Rate of occurrence: |
High |
Workaround: |
If you have control over how the application behaves in this situation, add an error check. If you don’t have control, it might be a good idea to verify the checksum. |
ISSUE: | Deleting a file when the metadata store is full fails. |
Platforms affected: | All |
Triggering condition: | Deleting a file when the metadata store is full. |
Details: | You cannot delete files when the metadata store is full. |
Customer impact: | You will get an error if you try to delete a file when the metadata store is full. |
Rate of occurrence: | Low |
Workaround: | Contact Support |
General Troubleshooting

Problem |
Recommended Action |
---|---|
You cannot access ActiveScale SM. |
|
Upon rebooting or shutting down a node through ActiveScale SM, the connection to the ActiveScale SM is lost. |
First try to reconnect through the browser after a reload of the page. If that fails:
|
ActiveScale SM does not display all elements. |
Use any of the supported browsers. ActiveScale SM is compatible with Chrome v67-69 and Firefox v61-63. Desktop versions only. |
When you try to change a DNS server, ActiveScale SM responds with the error "Failure on setting DNS." |
Refresh the ActiveScale SM page. |
ActiveScale SM does not display any PDUs. |
Customer-supplied PDUs may or may not be visible/manageable/monitorable in ActiveScale SM. ActiveScale SM only displays PDUs it recognizes. If your system has PDUs that ActiveScale OS does not recognize, it monitors PSUs instead, and raises alerts whenever it detects a loss of power, in which case you must pay close attention to the PSUs on other components and take action whenever the system raises alerts about loss of power. ActiveScale OS supports single phase PDUs but does not manage them. In other words, single phase PDUs are not visible in ActiveScale SM. |
When you shut down a site, the Resources page still displays that site's nodes as ONLINE even though they are stopped. |
Ignore the status ActiveScale SM shows for nodes in a shutdown site. |
When you choose the TCP with TLS protocol for syslog streaming, and upload a CA cert that is not self-signed by the server, ActiveScale SM returns a ÒConnection refusedÓ error. |
Upload a CA cert that's self-signed by the server if you choose the TCP with TLS protocol for the log stream. |
ActiveScale SM's Resources page doesn't automatically refresh when you add a System Expansion Node. |

Problem |
Recommended Action |
---|---|
A node is halted or hung, or unreachable after a reboot. |
Power the node off and then on again. If the problem still exists, contact Quantum Support. |
A disk is missing. |
ActiveScale OS marks a disk as NOTFOUND in the following cases:
In both cases the disk is eventually automatically decommissioned. ActiveScale OS does check how many other disks are NOTFOUND or DEGRADED. If there are more than 2 disks per node having the NOTFOUND or DEGRADED status, ActiveScale OS waits for an additional 12 hours to decommission the disk. If you replaced a disk before the original disk was fully DECOMMISSIONED, the new replacement disk is marked as NOTUSED. Contact Quantum Support in this case. If there is a hardware problem, you can try to reboot the node. This might solve a temporary problem with the disk. If this does not resolve the problem, contact Quantum Support. |
A disk has IO errors. |
Replace the disk after ActiveScale SM shows its status as DECOMMISSIONED. |
The hot-swapped disks are being ignored. |
Check the Events page in ActiveScale SM for errors that indicate:
|
There are ECC memory errors. |
Contact Quantum Support to replace the DIMM in the node that has the error. |
There are fan or temperature warnings. |
Contact Quantum Support to replace the fan in the node that has the error. For more information, see the ActiveScale P100/ActiveScale X100 Support Guide. |

Problem |
Recommended Action |
---|---|
A health check job has errors. |
Evaluate the job's output to determine if there are obvious failures, such as:
Otherwise, contact Quantum Support. |
A health check job is stuck. |
Contact Quantum Support. |

Problem |
Recommended Action |
---|---|
Log collection fails. |
Ensure that the account being used to access the system bucket has been granted read-only access. |
Log collection succeeds but log files are missing. |
|
Running a log collection job manually during an ActiveScale OS upgrade results in logs being put into the system bucket, but the log collection job never finishing. |
When the upgrade is done, retry the log collection job from the Jobs pane. Best practice is not to start a log collection job while an ActiveScale OS upgrade job is running. |
When you try to download system logs, the system reports that the downloadable bundle is empty or is smaller than it really is. |
If you select a huge number of log files, or if there are connectivity issues with the Column, the system might report that the downloadable bundle is empty or is smaller than it really is. If this happens, use the pull-down menus to select a smaller subset of logs and retry the download. |

Problem |
Recommended Action |
---|---|
You cannot communicate with the ActiveScale system. Instead, you see a network error (No route to host) in your client application. |
|
You received a status 403 Forbidden for an S3 API call. |
If you believe all these settings are good, please escalate to Quantum Support. |
The Update Network Configuration job fails. |
Contact Quantum Support to revert the network settings back to their previous values. |

Problem |
Recommended Action |
---|---|
After rebooting the rack through ActiveScale SM, a node is stuck in REBOOTING status. |
Reboot the node through ActiveScale SM:
|

Problem |
Recommended Action |
---|---|
The system raises events indicating that some objects are below the expected disk safety policy. |
This can occur when the system is nearing capacity. The system still allows ingest but writes objects with a suboptimal data safety. Object written with lower safety will be repaired to full safety within 24 hours even if the system is full. Repair is still allowed to write to READONLY blockstores. So in this case the objects with lower disk safety should be repaired to full safety within 24 hours. |

Problem |
Recommended Action |
---|---|
The system failed to upload a tarball. |
Check the Jobs page to see if an upgrade already in progress. Do not attempt to upload a new tarball if a system upgrade is already in progress. |
ActiveScale OS upgrade sometimes fails due to a network connection error. |
Use either of these workarounds:
|
A software upgrade failed. |
This can occur if there are anomalies in the system. Try the following:
|
A software upgrade on a system with offline nodes failed. |
Work with Quantum Support to bring the nodes back online and then retry the upgrade job. |
The system generates WARNING or CRITICAL events related to "Elasticsearch Cluster Health" during upgrades. |
Sample event: "...[CRITICAL][ELASTICSEARCH_CLUSTER_HEALTH_STATUS_CRITICAL]Metrics database: critical health status." You can safely ignore these events when they occur during an upgrade. |

Basic Sanity check before troubleshooting:
-
You must have a working Windows 2016 / Windows 2012 R2 Active Directory (AD) based Kerberos setup with a key distribution center (KDC)
-
It is recommended to use a secure Active Directory, that is configured to use LDAP over SSL/TLS.
-
AD, ActiveScale and NFS client should ideally be time synced. The maximum skew allowed is 300s. Quantum recommends to use NTP. This is necessary to prevent Kerberos authentication failure due to time skew.
-
ActiveScale and NFS clients must be properly configured to use DNS for correct name resolution. Only NFS Clients that are registered with DNS are supported.
-
AD Server, ActiveScale and NFS Clients should be able to reach each other.
-
It is recommended to login to the NFS Client using Kerberos authentication.
-
Make sure the required encryption types are enabled on KDC.
Problem |
Recommended Action |
---|---|
On client console if you see the following message kinit: Preauthentication failed while getting initial credentials.. |
Incorrect password or incorrect keytabfile. |
When testing the keytab created for a UNIX¨ server using kinit, you get the error Clock skew too great while getting initial credentials. |
You must keep clocks synchronized when using Kerberos. Use NTP service to time synchronize time between the various services involved |
When testing the keytab created for a UNIX server using kinit, you get the error Preauthentication failed while getting initial credentials or Password incorrect while getting initial credentials. |
The key in the keytab file is incorrect. Make sure you generated the keytab file correctly, with the correct principal name, Active Directory user name, and path. |
Note: Pre-authentication failure may happen for few reasons. Mostly we see when either the password for the relevant account in the Active Directory has changed since the keytab file was created; or the system clock is off by about 5 minutes from that of the Active Directory.
UDA Kerberos Encryption Related Issues
Problem |
Recommended action |
---|---|
On client console if you see the following message kinit: KDC has no support for encryption type while getting initial credentials |
Make sure the encryption types in the keytab file that is uploaded to ActiveScale or the one that is installed on the nfs are compatible with that specified in AD for the corresponding service account. The encryption types used in a keytab file can be listed using ktutil. |
UDA Kerberos Permission Denied Error
Problem |
Recommended Action |
---|---|
On the client: Console reports Permission Denied or Access Denied /var/log/syslog reports EACCES (13) error. |
Check the mount point/directory permissions to verify if NFS export has access to the user that is logged in. |
On the client: Console reports Permission denied /var/log/syslog reports rpc.gssd[423]: WARNING: can't create tcp rpc_clnt to server <NFS Server> for user with uid 10000: RPC: Remote system error - Connection timed out |
On the client, Increase the rpc.gssd timeout for RPC connection creation with the server using -T option and restart gssd service. |
On the client: Console reports Permission denied /var/log/syslog reports rpc.gssd[30631]: rpcsec_gss:gss_init_sec_context: (major) Unspecified GSS failure. Minor code may provide more information - (minor) Unknown code krb5 7 rpc.gssd[30631]: WARNING: Failed to create krb5 context for user with uid 0 with any credentials cache for server |
This might mean fully qualified domain name is not used for client/server principal. Create and update the keytabs with the FQDN of the Client/server and restart the gssd service on client/nfsganesha service on server. |
On the client: Console reports Permission denied /var/log/syslog reports rpc.gssd[31856]: ERROR: gssd_refresh_krb5_machine_credential: no usable keytab entry found in keytab /etc/krb5.keytab for connection with host <nfsserver> rpc.gssd[31856]: ERROR: No credentials found for connection to server <nfsserver> |
Check if key missing in /etc/krb5.keytab or /etc/krb5.keytab file missing |
On the client console you have logged on with an user other than root and able to mount successfully but not able to access the mount point. Console reports Permission Denied $ sudo mount /mnt/krb5 /var/log/syslog reports rpc.gssd[30712]: ERROR: GSS-API: error in gss_acquire_cred(): Unspecified GSS failure. Minor code may provide more information. Can't find client principal <user1>@<REALM.COM> in cache collection rpc.gssd[30712]: WARNING: Failed to create krb5 context for user with uid 501 for server <NFS Server> |
Check if kinit is done for the user. This issue might not occur when you log into the client using kerberos authentication. |
On the client: Console reports Permission denied /var/log/syslog reports rpc.gssd[xxxx] : Warning : failed to create krb5 context for user with uid = 500 for server <NFS Server> |
user with uid=500 doesn't have valid TGT ticket. Do kinit for the user on the client |
When ActiveScale NFS server is accessed using Virtual IP and Virtual IP failover is initiated, client might get Permission denied error |
Once the Virtual IP failovers to a different node, NFS operations will work as expected. |
Unexpected Permission denied when running NFS operations |
Check if the client user ticket is not expired and is valid. |
Stale File Handle Error
On the client console, you get Stale File Handle when trying to access the mount point |
Check if the user has a valid Ticket Granting ticket. If not do a kinit to get one from the KDC |