vmPRO SmartMotion Backup Hangs with NDX as Target Storage

Overview

When a vmPRO SmartMotion backup is performed to a Quantum NDX appliance, and the virtual disk has a large gap of zero data, the write command hangs long enough to cause a CIFS timeout. This causes the backup to hang indefinitely.

SR Information: N/A (this was found in test)

Product / Software Version: All vmPRO software versions with a Quantum NDX appliance configured as target storage.

Problem Description: When the vmPRO virtual disk has a large gap of zero data, a virtual machine (vm) backup to an NDX appliance can hang indefinitely.

Problem

The information in the messages log below shows that an NDX system (10.30.242.27) is not responsive.

messages log:
=============
Mar  7 15:57:51 localhost kernel: CIFS VFS: Server 10.30.242.27 has not
responded in 300 seconds. Reconnecting...
Mar  7 16:02:52 localhost kernel: CIFS VFS: Server 10.30.242.27 has not
responded in 300 seconds. Reconnecting...
Mar  7 16:07:54 localhost kernel: CIFS VFS: Server 10.30.242.27 has not
responded in 300 seconds. Reconnecting...
Mar  7 16:12:55 localhost kernel: CIFS VFS: Server 10.30.242.27 has not
responded in 300 seconds. Reconnecting...
Mar  7 16:17:56 localhost kernel: CIFS VFS: Server 10.30.242.27 has not
responded in 300 seconds. Reconnecting...
Mar  7 16:22:57 localhost kernel: CIFS VFS: Server 10.30.242.27 has not
responded in 300 seconds. Reconnecting...

When a large chunk of zeros is encountered with smartcp, instead of writing them, the system advances the offset and occasionally will write a few zeros to the output file to prevent a CIFS timeout.

When the system eventually finds data that needs to be written, it will write to the output file at the current offset, which usually completes quickly. On an NDX appliance, writing after a large seek will cause a long delay since it appears that the NDX needs to write all of the zeros up to the seek location instead of creating a sparse file and skipping to the appropriate location. The occasional write of zeros that the system does in smartcp does not appear to reduce this delay that happens on the first non-zero write.

This issue was discovered on a vm that had an extremely large (45 GB) hole in the middle. When the system got to the non-zero data, the write command would hang long enough to cause a CIFS timeout.

Refer to PTR 5357 in Bugzilla for additional details.

Solution

Changing the “nas_storage.sparse_enabled” registry key to 0 (zero) will fix this issue and allow the vm to be backed up to an NDX appliance. It results in smartcp writing all of the zeros it encounters instead of skipping them.

To fix the issue:

Run the following command to show the current “nas_storage.sparse_enabled” registry key: reg show nas_storage.sparse_enabled

Example output: nas_storage.sparse_enabled = 1

Run the following command to change the registry key to 0: reg set nas_storage.sparse_enabled 0

Example output: Registry key 'nas_storage.sparse_enabled' set to '0'

Run the following command to confirm that the registry key has been changed: reg show nas_storage.sparse_enabled

Example output: nas_storage.sparse_enabled = 0