Troubleshoot

GUI

ISSUE:	Inconsistent GUI view on scale out system
Platforms affected:	All
Triggering condition:	Scaling out and connecting to the GUI that’s running on one of the scale out nodes.
Details:	Some scale out nodes also host a GUI service. When comparing the view with the one of the base system (first system that got installed), you notice that this is different. For instance (there are more differences): The GUI on the scale out system only displays machines of the scale out system. Metrics graphs only show data coming from the scale out nodes. The GUI on the scale out system never displays events . DNS, NTP … connection settings are empty.
Customer Impact:	Customer will see inconsistent results, which is confusing.
Rate of occurrence:	Applies to all scale outs.
Workaround:	Only connect to a GUI service that is running on the base system. These will have a complete view of the deployment, also containing information and metrics of all scale out machines.

ISSUE:	Object transaction rate graphs reset after ActiveScale 6.5.0 Upgrade.
Platforms affected:	All
Triggering condition:	Upgrade to ActiveScale 6.5.0
Details:	After upgrading to ActiveScale 6.5, the object transaction rate graphs (Transaction Read and Transaction Write) in the GUI will be reset and display new data.
Customer Impact:	Low. Historical data for these graphs will not be presented. The graph will only show the new data.
Rate of occurrence:	High
Workaround:	Contact Support if this would be problematic.

ISSUE:	Object transaction rate graphs reset after ActiveScale 6.5.0 Upgrade.
Platforms affected:	All
Triggering condition:	Upgrade to ActiveScale 6.5.0
Details:	After upgrading to ActiveScale 6.5, the object transaction rate graphs (Transaction Read and Transaction Write) in the GUI will be reset and display new data.
Customer Impact:	Low. Historical data for these graphs will not be presented. The graph will only show the new data.
Rate of occurrence:	High
Workaround:	Contact Support if this would be problematic.

Hardware Management

ISSUE:	The system marks a reseated OS drive as NOT_FOUND. If it remains in this state for longer than the backoff interval (30 minutes), the system decommissions it.
Platforms affected:	All
Triggering condition:	Reseating (removing and reinserting) an OS drive
Details:	None
Customer impact:	OS remains active/responsive - no impact there. Its mirror restores itself automatically with one of the 2 spares. RAID loses one spare of the two, but this is not a major problem. If the disk contained a generic partition, metastore applications might be not running/unable to start, which might cause: “Application down” events “Metastore down” events No impact on the data path if only one metastore is down.
Rate of occurrence:	Low
Workaround:	Contact Support

ISSUE:	GUI shows all Power (PDU) metrics in 1 rack-tab per datacenter
Platforms affected:	X100
Triggering condition:	System has SW version 5.7.X, 6.0, 6.0.1, or 6.0.2 and has been scaled out at least once. So trigger could either be an upgrade of a scaled out system to 5.7.X, 6.0, 6.0.1 or 6.0.2 or the trigger could be scaling out a system that is on one of these SW versions
Details:	Visible when opening the GUI and navigating to the System Overview > Power tab. For each datacenter you can select the rack for which you want to display the power metrics, but there will only be 1 rack-tab per datacenter which actually shows metrics data. All the other rack-tabs of that datacenter will show “No data”. Upon closer inspection of the legenda of a graph that does show metrics it should be clear that the graph not only displays the power metrics for the selected rack, but actually for all racks in that datacenter. The PDUs for all racks in the datacenter are listed in that legenda.
Customer impact:	Some power (PDU) metrics are displayed in the wrong graph in the GUI. Some power metrics tabs show “No data”.
Rate of occurrence:	Always when system is on SW version 5.7.X, 6.0, 6.0.1 and 6.0.2 and has been scaled out at least once.
Workaround:	Click all rack-tabs and determine which ones display actual metrics. The metrics for all PDUs in the system can be monitored from those graphs.

ISSUE:	A slow blockstore drive can lead to DSS_STORAGEPOOL_UNVERIFIED_OBJECTS event.
Platforms affected:	All
Triggering condition:	One or more blockstore drives become very slow.
Details:	When one or more blockstore drives become very slow, it is possible that DSS_STORAGEPOOL_UNVERIFIED_OBJECTS events are raised. During the verification process all the data is read in, and this can be slowed down significantly when there is an underperforming drive.
Customer impact:	DSS_STORAGEPOOL_UNVERIFIED_OBJECTS events.
Rate of occurrence:	Very Low
Workaround:	If the events don't disappear automatically after a while, it's best to manually decommission the drive.

ISSUE:	If a replacement component has older firmware on it, the system doesn’t upgrade its firmware and the replacement instructions don’t include a step for upgrading firmware manually.
Platforms affected:	All
Triggering condition:	Replacing a component with one that has older firmware on it.
Details:	The ActiveScale P100/ActiveScale X100 Support Guide does not contain instructions for upgrading firmware manually.
Customer impact:	Unknown
Rate of occurrence:	High
Workaround:	You have two choices: Contact Support after any hardware replacement or wait for a system-wide firmware upgrade when you upgrade OS.

Health Check

No Known Issues

Network Management

ISSUE:	NFS Clients may see "Permission Denied" during Virtual IP failover, when using NFS Export with Kerberos Authentication
Platforms affected:	All
Triggering condition:	Virtual IP Failover
Details:	When Kerberos is used as authentication mechanism for file interface, the Client might see a "Permission Denied" error during VIP failover. After successful failover, NFS operations will return to operational state automatically.
Customer impact:	None
Rate of occurrence:	High
Workaround:	After successful failover, NFS operations will return to operational state automatically. No workaround required

Replication

ISSUE:	Systems with replication enabled may transfer zero-length files.
Platforms affected:	All
Triggering condition:	A replicated object is deleted in a versioned bucket.
Details:	When versioning is enabled and an object is deleted, that object's DELETE operation may not be stored in correct order in an AWS destination bucket. You may notice a DELETE marker of size 0bytes.
Customer impact:	Risk of retrieving the wrong version of an object from an AWS destination bucket.
Rate of occurrence:	Depends on user setup.
Workaround:	When retrieving replicated objects from an AWS destination bucket with versioning enabled, always identify the correct version of the object by checking its com.wdc.activescale.versionid attribute against the corresponding attribute of the original object in the source bucket.

S3

No Known Issues

System Events

ISSUE:	When KVM is connected USB errors can be seen on console and SYSLOG_ERROR events can be generated
Platforms affected:	P100 system node (Quanta D52L-1U)
Triggering condition:	When a remote console (KVM) is connected to the server
Details:	When a remote console (KVM) is connected, usb errors might be shown on the console might generate SYSLOG_ERRORS for them. These are harmless and can be ignored.
Customer impact:	The system will generate a SYSLOG_ERROR which can be ignored
Rate of occurrence:	Only when a remote console (KVM) is connected
Workaround:	Disconnect the remote console (KVM)

ISSUE:	DIMM with memory errors slot cannot be located
Platforms affected:	X200, P200, P100 D52L Quanta EOL Replacement.
Triggering condition:	Memory Errors on RAM DIMM’s of the system.
Details:	MEMORY_ERRORS event in GUI where the details contains the following ambiguous message: Silk Screen 'Unknown', Socket None, Channel None, DIMM None
Customer impact:	Ambiguous event. Can’t determine which faulty memory DIMM to replace.
Rate of occurrence:	Always when there’s a memory error on a RAM DIMM.
Workaround:	Contact support

ISSUE:	Severely reduced GET performance on systems originally installed with ActiveScale OS 5.0.
Platforms affected:	Systems originally installed with ActiveScale OS 5.0
Triggering condition:	Upgrade to ActiveScale OS 5.7.
Details:	The small object aggregation feature (introduced in ActiveScale OS 5.7) does not interact well with the erasure codes used on these systems, resulting in a severely reduced GET performance of objects written post-upgrade.
Customer Impact:	Severely reduced GET performance.
Rate of occurrence:	Always for these systems.
Workaround:	A support procedure is in place for changing the erasure code in use (to be executed directly post upgrade), which resolves the issue. Please contact Quantum Support when planning to upgrade such a system.

System-Wide

ISSUE:	Failing S3 requests after performance scale out procedure.
Platforms affected:	All platforms on software version 7.0.2 and 7.1.1
Triggering condition:	Scale out system is installed and scale out procedure Unlocking S3 Access to Scale Out Systems has been applied by support.
Details:	This only affects scale out systems. All S3 requests to non-management machines of the scale out system will result in a 500 (Internal Server Error) response. In 1-GEO systems the non-management machines are machines 4, 5, 6 etc. In 3-GEO systems the non-management machines are machines 2, 3, 5 etc of every site. Systems with only 3 machines are not affected.
Customer Impact:	Failing S3 requests.
Rate of occurrence:	Always happens if triggering conditions apply.
Workaround:	Contact Quantum Support to execute procedure.

ISSUE:	COREDUMPS_FOUND events with signature “/var/crash/kdump_lock: empty” during upgrade to AS 6.3.
Platforms affected:	All
Triggering condition:	Upgrade to 6.3
Details:	It is possible that you get emails for COREDUMPS_FOUND events that occurred during upgrade to 6.3. These events will also be visible in the ActiveScale GUI. The signature of the event is “/var/crash/kdump_lock: empty”. These events are not harmful and can be ignored when these occur during the upgrade.
Customer Impact:	Customer receives 1 or more COREDUMPS_FOUND events.
Rate of occurrence:	Only during upgrade. They might also not occur at all.
Workaround:	These events will gradually auto resolve as the upgrade progresses and maybe some time after the upgrade has finished. After a while these should not occur anymore.

ISSUE:	Metrics graphs can show spikes, drops, gaps or missing data during upgrade to 6.1
Platforms affected:	All
Triggering condition:	Upgrade to 6.1
Details:	It is expected to see this behavior in metrics graphs in the ActiveScale GUI. These graphs will return to normal at the end of the upgrade.
Customer impact:	Distorted or incomplete metric graphs during upgrade
Rate of occurrence:	High - Always during upgrade to 6.1
Workaround:	No action required as long as the system is upgrading. Contact Quantum Support if the metrics graphs don’t return to normal after upgrade.

ISSUE:	External Prometheus and Grafana show incomplete metrics after upgrade to 6.1
Platforms affected:	All
Triggering condition:	Upgrade to 6.1 of a scaled out deployment. Customer uses external Prometheus and Grafana to collect and view ActiveScale metrics.
Details:	It is expected that upgrading scaled out deployments to 6.1 will cause metrics that are being scraped directly from ActiveScale to be incomplete. Visual impact includes graphs suddenly dropping or spiking, and dashboards suddenly displaying less metrics. Historical metrics are not impacted, only newly collected metrics. A change in configuration of the external Prometheus and/or Grafana is required, otherwise the graphs won’t recover.
Customer impact:	Newly collected ActiveScale metrics by an external Prometheus are incomplete and seem incorrect/incomplete after upgrade.
Rate of occurrence:	Always if triggering conditions are met.
Workaround:	Contact Quantum Support to update the configuration of the external Prometheus and Grafana.

ISSUE:	After replacement of a failed IOM, disks can be and remain in status NOTCONFIGURED
Platforms affected:	X100/X200 or other systems with JBOD only
Triggering condition:	IOM failure and replacement
Details:	An IOM failure will cause all disks in the JBOD to become NOTFOUND. Disk safety can lower due to the high amount of missing disks. Once the IOM is replaced, all NOTFOUND disks will become NOTCONFIGURED.
Customer impact:	During IOM Failure DISK_NOTFOUND event will be seen. There could be a lower disk safety due to the high amount of missing disks. After IOM replacement: No datapath impact. No functional impact. But the customer will see DISK_NOT_CONFIGURED events for each disk, 4 hours after the IOM replacement.
Rate of occurrence:	Low
Workaround:	Contact Support

ISSUE:	Services (Arakoon, dss, scaler, and so on) might remain down after the system comes back online after a power failure.
Platforms affected:	All
Triggering condition:	The system is running a task that involves machine reconfiguration (such as a disk replacement or network reconfiguration) and at the same time a power failure happens.
Details:
Customer impact:	None, because services are still running on some nodes.
Rate of occurrence:	Low
Workaround:	Contact Support

ISSUE:	COREDUMPS_FOUND events with signature “/var/crash/kdump_lock: empty” during upgrade to AS 6.3.
Platforms affected:	All
Triggering condition:	Upgrade to 6.3
Details:	It is possible that you get emails for COREDUMPS_FOUND events that occurred during upgrade to 6.3. These events will also be visible in the ActiveScale GUI. The signature of the event is “/var/crash/kdump_lock: empty”. These events are not harmful and can be ignored when these occur during the upgrade
Customer impact:	Customer receives 1 or more COREDUMPS_FOUND events.
Rate of occurrence:	Only during upgrade. They might also not occur at all.
Workaround:	These events will gradually auto resolve as the upgrade progresses and maybe some time after the upgrade has finished. After a while these should not occur anymore.

Unified Data Access

ISSUE:	The mount command fails if you have exported the same top level directory twice.
Platforms affected:	All
Triggering condition:	You created multiple NFS exports of the same top level directory, with different permissions or access types.
Details:	None
Customer impact:	None
Rate of occurrence:	Varies
Workaround:	If you export the same top level directory more than once with different permissions, specify a unique --tag value for each export. To mount the export, use its tag without the leading slash rather than its export name.

ISSUE:	Closing and/or flushing a file may fail.
Platforms affected:	All
Triggering condition:	A file is closed on the client side.
Details:	Most applications don’t check for errors when a file closes. This refers to the moment a file is closed on the client side. At this moment, the UDA service will perform a data flush. This is any data that has not yet been written to stable storage is written at that time as a precondition for the file close operation to succeed. This data flush can fail, which will in turn make the file close operation return an error. If the client does not check for that error, which is very common in many applications, it might appear that the file close and the data flush succeeded, when it actually did not.
Customer impact:	You need to check for errors when you close a file.
Rate of occurrence:	High
Workaround:	If you have control over how the application behaves in this situation, add an error check. If you don’t have control, it might be a good idea to verify the checksum.

ISSUE:	Deleting a file when the metadata store is full fails.
Platforms affected:	All
Triggering condition:	Deleting a file when the metadata store is full.
Details:	You cannot delete files when the metadata store is full.
Customer impact:	You will get an error if you try to delete a file when the metadata store is full.
Rate of occurrence:	Low
Workaround:	Contact Support

ActiveScale SM

Problem	Recommended Action
You cannot access ActiveScale SM.	Try to log into ActiveScale SM through a different System Node. Check for system events that indicate that ActiveScale SM is down. Reboot the node having problems, if needed. If none of the above work, contact Quantum Support.
Upon rebooting or shutting down a node through ActiveScale SM, the connection to the ActiveScale SM is lost.	First try to reconnect through the browser after a reload of the page. If that fails: Try to log into ActiveScale SM through a different System Node. Check for system events that indicate that ActiveScale SM is down. Reboot the node having problems, if needed

ActiveScale SM does not display all elements.	Use any of the supported browsers. ActiveScale SM is compatible with Chrome v67-69 and Firefox v61-63. Desktop versions only.
When you try to change a DNS server, ActiveScale SM responds with the error "Failure on setting DNS."	Refresh the ActiveScale SM page.
ActiveScale SM does not display any PDUs.	Customer-supplied PDUs may or may not be visible/manageable/monitorable in ActiveScale SM. ActiveScale SM only displays PDUs it recognizes. If your system has PDUs that ActiveScale OS does not recognize, it monitors PSUs instead, and raises alerts whenever it detects a loss of power, in which case you must pay close attention to the PSUs on other components and take action whenever the system raises alerts about loss of power. ActiveScale OS supports single phase PDUs but does not manage them. In other words, single phase PDUs are not visible in ActiveScale SM.


When you shut down a site, the Resources page still displays that site's nodes as ONLINE even though they are stopped.	Ignore the status ActiveScale SM shows for nodes in a shutdown site.
When you choose the TCP with TLS protocol for syslog streaming, and upload a CA cert that is not self-signed by the server, ActiveScale SM returns a ÒConnection refusedÓ error.	Upload a CA cert that's self-signed by the server if you choose the TCP with TLS protocol for the log stream.
ActiveScale SM's Resources page doesn't automatically refresh when you add a System Expansion Node.

Hardware Management

Problem	Recommended Action
A node is halted or hung, or unreachable after a reboot.	Power the node off and then on again. If the problem still exists, contact Quantum Support.
A disk is missing.	ActiveScale OS marks a disk as NOTFOUND in the following cases: The Disk was taken out and replaced before it was fully decommissioned. There is a hardware problem (either disk or node) and the operating system does not see the disk anymore. When a disk is marked as NOTFOUND, ActiveScale OS raises the DISK_NOT_FOUND event. In both cases the disk is eventually automatically decommissioned. ActiveScale OS does check how many other disks are NOTFOUND or DEGRADED. If there are more than 2 disks per node having the NOTFOUND or DEGRADED status, ActiveScale OS waits for an additional 12 hours to decommission the disk. If you replaced a disk before the original disk was fully DECOMMISSIONED, the new replacement disk is marked as NOTUSED. Contact Quantum Support in this case. If there is a hardware problem, you can try to reboot the node. This might solve a temporary problem with the disk. If this does not resolve the problem, contact Quantum Support.
A disk has IO errors.	Replace the disk after ActiveScale SM shows its status as DECOMMISSIONED.
The hot-swapped disks are being ignored.	Check the Events page in ActiveScale SM for errors that indicate: The replacement drive was not empty. The replacement drive had preexisting partition(s). The drive was immediately replaced by the same drive (even if empty). The replacement drive's capacity was smaller than the original drive's capacity.
There are ECC memory errors.	Contact Quantum Support to replace the DIMM in the node that has the error.
There are fan or temperature warnings.	Contact Quantum Support to replace the fan in the node that has the error. For more information, see the ActiveScale P100/ActiveScale X100 Support Guide.

Jobs with Errors

Problem	Recommended Action
A health check job has errors.	Evaluate the job's output to determine if there are obvious failures, such as: Node down: Determine state of the node. If it is a temporary failure (for example, node From was in the middle of a reboot), wait for node to return to ONLINE status. Data safety: Wait for system to finish repairs of objects. Otherwise, contact Quantum Support.
A health check job is stuck.	Contact Quantum Support.

Problem

Recommended Action

A health check job has errors.

Evaluate the job's output to determine if there are obvious failures, such as:

Node down: Determine state of the node. If it is a temporary failure (for example, node From was in the middle of a reboot), wait for node to return to ONLINE status.
Data safety: Wait for system to finish repairs of objects.

Otherwise, contact Quantum Support.

A health check job is stuck.

Contact Quantum Support.

Log Collection

Problem	Recommended Action
Log collection fails.	Ensure that the account being used to access the system bucket has been granted read-only access.
Log collection succeeds but log files are missing.	Ensure there are is sufficient disk space to store the logs where the script is being run. Browse the system bucket with the account that has been granted read-only access to determine if there are any log files collected there. Contact Quantum Support.
Running a log collection job manually during an ActiveScale OS upgrade results in logs being put into the system bucket, but the log collection job never finishing.	When the upgrade is done, retry the log collection job from the Jobs pane. Best practice is not to start a log collection job while an ActiveScale OS upgrade job is running.
When you try to download system logs, the system reports that the downloadable bundle is empty or is smaller than it really is.	If you select a huge number of log files, or if there are connectivity issues with the Column, the system might report that the downloadable bundle is empty or is smaller than it really is. If this happens, use the pull-down menus to select a smaller subset of logs and retry the download.

Network Management

Problem	Recommended Action
You cannot communicate with the ActiveScale system. Instead, you see a network error (No route to host) in your client application.	Check network cabling. If reachable, connect to ActiveScale SM and validate the network configuration settings. If the settings are correct, there is some other problem; contact Quantum Support for further assistance.
You received a status 403 Forbidden for an S3 API call.	401 or 403 means that an S3 HTTP request is forbidden by server due to a variety of reasons. Try to get the HTTP response body of the 401/403 (for example, check debug logs of your client S3 application). That might clarify things such as: System does not know the S3 access key you used System claims that you sent a wrong signature (which points to a wrong secret key, or a wrong timestamp on the client) You do not have access to the bucket you are trying to access Things to check: Is the time correct on your S3 client? Is your client doing HTTP S3 to port 80 or HTTPS S3 to port 443? Is your client pointing to the correct servers (IP or DNS)? Are your access key and secret key correct? Is your bucket name correct? Do you have access to that specific bucket? If you are an 'account' you always have access to your own buckets + buckets you are granted permission to. If you are a 'user' you only have access to buckets you are granted permission to. Just because a bucket shows up in "list all buckets" does not that you have access to its content. Is the "DNS suffix" used by your client correct? That is often s3.amazonaws.com. Is that same suffix also configured as DNS suffix in ActiveScale SM? If you believe all these settings are good, please escalate to Quantum Support.
The Update Network Configuration job fails.	Contact Quantum Support to revert the network settings back to their previous values.

Problem

Recommended Action

You cannot communicate with the ActiveScale system. Instead, you see a network error (No route to host) in your client application.

Check network cabling.
If reachable, connect to ActiveScale SM and validate the network configuration settings. If the settings are correct, there is some other problem; contact Quantum Support for further assistance.

You received a status 403 Forbidden for an S3 API call.

401 or 403 means that an S3 HTTP request is forbidden by server due to a variety of reasons.
Try to get the HTTP response body of the 401/403 (for example, check debug logs of your client S3 application). That might clarify things such as:
System does not know the S3 access key you used
System claims that you sent a wrong signature (which points to a wrong secret key, or a wrong timestamp on the client)
You do not have access to the bucket you are trying to access

Things to check:
- Is the time correct on your S3 client?
- Is your client doing HTTP S3 to port 80 or HTTPS S3 to port 443?
- Is your client pointing to the correct servers (IP or DNS)?
- Are your access key and secret key correct?
- Is your bucket name correct?
- Do you have access to that specific bucket? If you are an 'account' you always have access to your own buckets + buckets you are granted permission to. If you are a 'user' you only have access to buckets you are granted permission to. Just because a bucket shows up in "list all buckets" does not that you have access to its content.
- Is the "DNS suffix" used by your client correct? That is often s3.amazonaws.com. Is that same suffix also configured as DNS suffix in ActiveScale SM?

If you believe all these settings are good, please escalate to Quantum Support.

The Update Network Configuration job fails.

Contact Quantum Support to revert the network settings back to their previous values.

Startup and Shutdown

Problem	Recommended Action
After rebooting the rack through ActiveScale SM, a node is stuck in REBOOTING status.	Reboot the node through ActiveScale SM: Click the Resources tab, and, on multi-site systems, select a site from the Sites pull-down menu. Select the correct DataCenter and Rack. Highlight the stuck node in the rack graphic. In the Actions pull-down menu, select Reboot.

Problem

Recommended Action

After rebooting the rack through ActiveScale SM, a node is stuck in REBOOTING status.

Reboot the node through ActiveScale SM:

Click the Resources tab, and, on multi-site systems, select a site from the Sites pull-down menu.
Select the correct DataCenter and Rack.
Highlight the stuck node in the rack graphic.
In the Actions pull-down menu, select Reboot.

Object Store

Problem	Recommended Action
The system raises events indicating that some objects are below the expected disk safety policy.	This can occur when the system is nearing capacity. The system still allows ingest but writes objects with a suboptimal data safety. Object written with lower safety will be repaired to full safety within 24 hours even if the system is full. Repair is still allowed to write to READONLY blockstores. So in this case the objects with lower disk safety should be repaired to full safety within 24 hours.

Upgrades

Problem	Recommended Action
The system failed to upload a tarball.	Check the Jobs page to see if an upgrade already in progress. Do not attempt to upload a new tarball if a system upgrade is already in progress.
ActiveScale OS upgrade sometimes fails due to a network connection error.	Use either of these workarounds: Re-run the upgrade procedure (starting from step 1) on a computer that is physically connected to the System Network. This is most likely a computer in the same data center as the system. Use a VPN plus a remote desktop protocol (RDP) session to connect to a computer that is physically connected to the System Network. Re-run the upgrade procedure (starting from step 1) in the RDP session.


A software upgrade failed.	This can occur if there are anomalies in the system. Try the following: Ensure that the Resources page shows all nodes up and running. Retry the upgrade immediately. If it fails again, contact Quantum Support.
A software upgrade failed.
A software upgrade on a system with offline nodes failed.	Work with Quantum Support to bring the nodes back online and then retry the upgrade job.
A software upgrade on a system with offline nodes failed.
The system generates WARNING or CRITICAL events related to "Elasticsearch Cluster Health" during upgrades.	Sample event: "...[CRITICAL][ELASTICSEARCH_CLUSTER_HEALTH_STATUS_CRITICAL]Metrics database: critical health status." You can safely ignore these events when they occur during an upgrade.

UDA Kerberos

Basic Sanity check before troubleshooting:

You must have a working Windows 2016 / Windows 2012 R2 Active Directory (AD) based Kerberos setup with a key distribution center (KDC)
It is recommended to use a secure Active Directory, that is configured to use LDAP over SSL/TLS.
AD, ActiveScale and NFS client should ideally be time synced. The maximum skew allowed is 300s. Quantum recommends to use NTP. This is necessary to prevent Kerberos authentication failure due to time skew.
ActiveScale and NFS clients must be properly configured to use DNS for correct name resolution. Only NFS Clients that are registered with DNS are supported.
AD Server, ActiveScale and NFS Clients should be able to reach each other.
It is recommended to login to the NFS Client using Kerberos authentication.
Make sure the required encryption types are enabled on KDC.

Problem	Recommended Action
On client console if you see the following message kinit: Preauthentication failed while getting initial credentials..	Incorrect password or incorrect keytabfile.
When testing the keytab created for a UNIX¨ server using kinit, you get the error Clock skew too great while getting initial credentials.	You must keep clocks synchronized when using Kerberos. Use NTP service to time synchronize time between the various services involved


When testing the keytab created for a UNIX server using kinit, you get the error Preauthentication failed while getting initial credentials or Password incorrect while getting initial credentials.	The key in the keytab file is incorrect. Make sure you generated the keytab file correctly, with the correct principal name, Active Directory user name, and path.

Note: Pre-authentication failure may happen for few reasons. Mostly we see when either the password for the relevant account in the Active Directory has changed since the keytab file was created; or the system clock is off by about 5 minutes from that of the Active Directory.

UDA Kerberos Encryption Related Issues

Problem	Recommended action
On client console if you see the following message kinit: KDC has no support for encryption type while getting initial credentials	Make sure the encryption types in the keytab file that is uploaded to ActiveScale or the one that is installed on the nfs are compatible with that specified in AD for the corresponding service account. The encryption types used in a keytab file can be listed using ktutil.

UDA Kerberos Permission Denied Error

Problem	Recommended Action
On the client: Console reports Permission Denied or Access Denied /var/log/syslog reports EACCES (13) error.	Check the mount point/directory permissions to verify if NFS export has access to the user that is logged in.
On the client: Console reports Permission denied /var/log/syslog reports rpc.gssd[423]: WARNING: can't create tcp rpc_clnt to server <NFS Server> for user with uid 10000: RPC: Remote system error - Connection timed out	On the client, Increase the rpc.gssd timeout for RPC connection creation with the server using -T option and restart gssd service.


On the client: Console reports Permission denied /var/log/syslog reports rpc.gssd[30631]: rpcsec_gss:gss_init_sec_context: (major) Unspecified GSS failure. Minor code may provide more information - (minor) Unknown code krb5 7 rpc.gssd[30631]: WARNING: Failed to create krb5 context for user with uid 0 with any credentials cache for server	This might mean fully qualified domain name is not used for client/server principal. Create and update the keytabs with the FQDN of the Client/server and restart the gssd service on client/nfsganesha service on server.

On the client: Console reports Permission denied /var/log/syslog reports rpc.gssd[31856]: ERROR: gssd_refresh_krb5_machine_credential: no usable keytab entry found in keytab /etc/krb5.keytab for connection with host <nfsserver> rpc.gssd[31856]: ERROR: No credentials found for connection to server <nfsserver>	Check if key missing in /etc/krb5.keytab or /etc/krb5.keytab file missing

On the client console you have logged on with an user other than root and able to mount successfully but not able to access the mount point. Console reports Permission Denied $ sudo mount /mnt/krb5 $ cd /mnt/krb5 bash: cd: /mnt/krb5: Permission Denied /var/log/syslog reports rpc.gssd[30712]: ERROR: GSS-API: error in gss_acquire_cred(): Unspecified GSS failure. Minor code may provide more information. Can't find client principal <user1>@<REALM.COM> in cache collection rpc.gssd[30712]: WARNING: Failed to create krb5 context for user with uid 501 for server <NFS Server>	Check if kinit is done for the user. This issue might not occur when you log into the client using kerberos authentication.

On the client: Console reports Permission denied /var/log/syslog reports rpc.gssd[xxxx] : Warning : failed to create krb5 context for user with uid = 500 for server <NFS Server>	user with uid=500 doesn't have valid TGT ticket. Do kinit for the user on the client
When ActiveScale NFS server is accessed using Virtual IP and Virtual IP failover is initiated, client might get Permission denied error	Once the Virtual IP failovers to a different node, NFS operations will work as expected.
Unexpected Permission denied when running NFS operations	Check if the client user ticket is not expired and is valid.

Stale File Handle Error

On the client console, you get Stale File Handle when trying to access the mount point

Check if the user has a valid Ticket Granting ticket. If not do a kinit to get one from the KDC

Troubleshooting

Known Issues

General Troubleshooting