Unexpected full backup of VM(s) exported with CBT enabled.

Problem:

SmartMotion backup of VM(s) exported with CBT enabled shows that the copy type is 'full' backup when only 'partial' backup should have happened. This problem is specifically for an unexpected full backup that happened after rebooting the appliance..

In vm_proxy_fs log, observed that handles were being held open for a long period of time (over 10 days in the following example snips of vm_proxy_fs log). This prevented snapshot removal, and then when the appliance was rebooted vm_proxy_fs cleaned up the snapshots which allowed the pending CBT reset. This CBT reset happened on an abnormal day, which led to a full backup out of schedule.

vm_proxy_fs log showing handle held open for over 10 days:

================================================================

2014-11-13 13:01:58.317151: WARNING: 1 handles are still open to uscvcenter.usc.ad.salvationarmy.org on vm-231, deferring snapshot removal.

2014-11-13 13:01:58.317333: WARNING: 1 handles are still open to uscvcenter.usc.ad.salvationarmy.org on vm-17044, deferring snapshot removal.

...

2014-11-17 12:29:12.170543: VM 'uscsophosthq' on server 'uscvcenter.usc.ad.salvationarmy.org' has a leftover vmPRO snapshot created over 4 days ago. If you are no longer exporting this VM, Please manually delete snapshot 'Quantum: a3f4901696f511e286279b843c839245'. If it isn't automatically removed after the next backup please contact support.

2014-11-17 12:29:12.170679: VM 'USCARCSQLRPT001' on server 'uscvcenter.usc.ad.salvationarmy.org' has a leftover vmPRO snapshot created over 10 days ago. If you are no longer exporting this VM, Please manually delete snapshot 'Quantum: a3f4901696f511e286279b843c839245'. If it isn't automatically removed after the next backup please contact support.

2014-11-17 12:29:12.172518: WARNING: 1 handles are still open to uscvcenter.usc.ad.salvationarmy.org on vm-231, deferring snapshot removal.

2014-11-17 12:29:12.172625: WARNING: 1 handles are still open to uscvcenter.usc.ad.salvationarmy.org on vm-17044, deferring snapshot removal.

Reboot via console by user:

===========================

Nov 18 07:54:48 sml/INFO:: sml_system.py:232 Shutdown/reboot from Console

vm_proxy_fs restart due to reboot, cleans up snapshot and does CBT reset:

===========================================================================

2014-11-18 07:54:49.663387: cnt_thr: stopping

2014-11-18 07:54:49.663437: cnt_thr_stop

2014-11-18 07:54:49.831905: Removing snapshot Quantum: a3f4901696f511e286279b843c839245 on USCARCSQLRPT001 uscvcenter.usc.ad.salvationarmy.org

...

2014-11-18 08:08:56.017394: Changed Block Tracking reset/enabled for VM 'USCARCSQLRPT001' on hypervisor 'uscvcenter.usc.ad.salvationarmy.org'

Cause:

The reason that VM file handle held open for a long period of time is because of removing DM and/or loop devices was timed out, either vmPRO appliance and/or ESXi server is overloaded at the time of a backup complete. That would leave messages in vm_proxy_fs log like the following:

2014-11-06 17:20:14.250344: WARNING [thread:140267695933184, where:spawn.c:240, error:62] /sbin/dmsetup remove a5ce43ea.633545.top : Timer expired

vm_proxy_fs[2932]: WARNING [thread:140267695933184, where:spawn.c:240, error:62] /sbin/dmsetup remove a5ce43ea.633545.top : Timer expired

device-mapper: table ioctl on a5ce43ea.633545.top failed: No such device or address

device-mapper: reload ioctl on a5ce43ea.633545.top failed: No such device or address

device-mapper: remove ioctl on a5ce43ea.633545.top failed: No such device or address

Command failed

2014-11-06 17:20:14.730086: WARNING [thread:140267695933184, where:spawn.c:246, error:1] /sbin/dmsetup remove -f a5ce43ea.633545.top

vm_proxy_fs[2932]: WARNING [thread:140267695933184, where:spawn.c:246, error:1] /sbin/dmsetup remove -f a5ce43ea.633545.top

2014-11-06 17:22:17.609581: WARNING [thread:140267695933184, where:spawn.c:240, error:62] /sbin/losetup -d /dev/loop14 : Timer expired

vm_proxy_fs[2932]: WARNING [thread:140267695933184, where:spawn.c:240, error:62] /sbin/losetup -d /dev/loop14 : Timer expired

2014-11-06 17:22:17.642488: snapshot_update: path = /Production_2/USCARCSQLRPT001/USCTHQSQLRPT001-pancbt.vmdk, hvname = uscvcenter.usc.ad.salvationarmy.org, vmname = USCARCSQLRPT001, quiesce = 0, uuid = a3f4901696f511e286279b843c839245/uscthqvmpro, ctime = Thu Nov 6

16:54:05 2014

, ref = 0, remove

....

2014-11-06 17:22:17.642896: snapshot 8 : USCARCSQLRPT001 on uscvcenter.usc.ad.salvationarmy.org

2014-11-06 17:22:17.642905: created 2014-11-06 16:54:05

2014-11-06 17:22:17.642914: accessed 2014-11-06 17:22:17 by 0 handles

....

2014-11-06 17:22:47.662397: vm_proxy_do_evict: evicted /Production_2/USCARCSQLRPT001/USCTHQSQLRPT001-pancbt.vmdk

....

The failure of removing DM/loop devices resulted in handle being held open, and thus vm_proxy_fs can't remove the VM snapshot :

....

2014-11-06 17:27:51.071087: WARNING: 1 handles are still open to uscvcenter.usc.ad.salvationarmy.org on vm-231, deferring snapshot removal.

....

Solution:

If the customer encounters the problem, and observed message(s) in vm_proxy_fs log like the following:

2014-11-06 17:28:06.533097: WARNING: 1 handles are still open to uscvcenter.usc.ad.salvationarmy.org on vm-231, deferring snapshot removal.

and these is an alert like this:

VM 'USCARCSQLRPT001' on server 'uscvcenter.usc.ad.salvationarmy.org' has a leftover vmPRO snapshot created over 3 days ago. If you are no longer exporting this VM, Please manually delete snapshot 'Quantum: 3f4901696f511e286279b843c839245'. If it isn't automatically removed after the next backup please contact support.

The customer can use either one of these workarounds to recover from the situation:

Method 1:

A. login into vmPRO as sysadmin and escape pan-shell (shell-escape).

B. manually delete the leftover DM and loop devices while there is no backup or recover jobs in progress, using commands like so :

# dmsetup remove_all -f

# losetup -a

/dev/loop14: [0016]:18 (/vmfs/volumes/4d18ce88-73f1f3ba-4631-002655dbb4ec/USCTHQSQLRPT*)

/dev/loop15: [0801]:158691 (/tmp/5b119425.566575.panstorage)

# losetup -d /dev/loop15

# losetup -d /dev/loop14

These commands release the open handle and momentarily VM snapshot will be removed automatically.

Method 2: A more straight forward method is to reboot the vmPRO appliance or restart all services on vmPRO appliance.

That should release the open handle and remove VM snapshot automatically.

(Note: This information came from vmPRO Bug 6344.)