VMPRO: Virtual Machine backup stuck in "Queued" state

SR Information: 1611512

Product / Software Version: VmPro 3.1.1 - vCenter 5.1

Problem Description: Virtual machine backup will stick in "queued" state

Related PTRs:  

Bug 5678 - Customer URS, SR 1589842: Finished vmPRO jobs displaying as running

 

 

 

Overview

 

The scenario of the issue:

  • Customer executes VM backups periodically.
  • Customer noticed that some of the backup for a specific VM stuck in the "queued" state while other jobs executed with success and finished.
  • Customer reports that the only option he has available is to abort the job and restart again, but then the issue reoccurs.

 

Notes:

  • You may also find the backup stuck in the "running" state. The log regarding the execution of the backup may vary compared to a case where backup stuck in the "queued" state. This article will provide a description of logs when a machine is hung in "queued" state.
  • This article contains a customer's screen shot so the content of this article is for internal use only.

 

 

 

Symptom (how to identify the problem)

 

  1. Under Tasks, you'll find the policies that are running. In the example below, we can see that all the backups for the policy finished and only one VM is left behind. It's been hours/days and the status remains "Queued - Calculating".

 

Machine generated alternative text: e Rwi _________
IT IP’% 41.4.4418 ¡41 4$.4 ieij 5mertNOOfl Task rr-Qeduprhc,r copymg 17 vMs to dm46O r hes been runmng tar S hours end 32 Irwiutes.
v%, .$0ir
y tID..vm,t
rfr r7•  41.4.4 41 I ¡ .4 .41
___ 4.d..
a TT .4.I.,,m.4. .. I 0W .41
ri D___._.I. P.*..ed .1 •a$ .4 0 _ 4 b... b,.
#ì...h.d . II4 .4  Cilio.. r0.l 84 00 0%
a ¡•TO•••l4iV14.l “‘ ‚. rrmi.I o 0%
R...h.d .1 ISO 441 I ______  o.
.1 t4I $41 . I,  o u c
bl.0414 41 021 $0 4$ il 37)0410 ¡2%
A ri. O.44I441.4i ‘I4I’.4 II tetO 2 4$ F-..t.i pi%aJ ¡045.0 4%
A E. Oo.41014114$. $‘%$.4d It ¡0.03 *410*
D.v.lop.’.. PI’0.P.41 41 ¡0.02 41.
Art.o..do,$ne. $w.44 ot 10t02 ‘mo.
AflD..411__. .4md 410.34 P41.45
rT• Do.ilo..’.ett. Iio414 $1033 $0 41’ S
Ft-4....A..,..mo-.a. 4.$4  533 0W 0 ______
4 0.11 . I
a ri D...l.i....41.  1 )3 0W o.
_._l. A...b..4 .1)14 Iii .S
¿ t7-D...l.p.’,.W.  .11. S.l $0 o. *
ti-0__*n_.L ba* tt.tai..
FV-   .5 UI) P41
I FT ilo $.i.4 fi 200 $0 o. a
ii  I’0Wd at 15.24 54140
bw4 ,09la4o.4..a
rt•  .riod 41 )t33 *41 III
Afl0...10.1I41t $biod al ¡24i 0W o.
IrT.O.b_I__.L r.4 413.4. 0Wi A,.
J
.  L7$0 —
.—  Mbaœy
P.%_._. .4 *4.41 4.c.
ð.,.d lb.. * lb I.E . j
So...Ao.d: $ii)Rl,.4 1.017 ¡I.*4.
3 *.*.* oi.d Sep ¡7 11,45tO .4i 17 I1*C
2 .sm sod. lop li ¡i.53t0 4.p ¡7 1t2StL
5  o.od Sep17 1145.0 lop 17 flzSOd
2 .o... .d t Sep 17 11.44.4 lop ¡7 1151,3
) $01db 0*4 • lip ¡7 11:44 lop ¡7 114)4
—
lr$0opØI
¡-III 0%
il 4)3.$o. 1%
p...I ?IA $0 43%
04.441 I .451 0%
p.41 042.) WI 0%
.rI.41 5,1) WI S%
p.ooal R O 51 0%
P$0lI0d taI 172.3.0 V%
7 .‘wldb. ood? lop ¡7 t I .4)4 lop ¡7
8 ...d 4.p ¡P 11.45,4 lo* ¡7 $t4i.C
4 -.4a. o.d? lip 1? ¡1.51:04.1 I?
O .o.*a. 441? Sep 1? 11.47 5 lop ¡7 ¡14..)
2 .o. .aâ t l.plC 11.924 lop ¡7 1111.1
S o.04*40 41il Sep17 11.454 lop 17 11*04
sop..aa..od; Sep ¡?L 40.3 lop ¡7 ¡111,4
2 o.o. sod • l.p ¡7 ¡1.45.0 lop 17 11:10,41
Pi__I 4 - • _II ,;I Ii-m i I

 

 

  1. When you unzip the master log bundle you'll find the file named psqldump.output under the /var directory. This file is the database dump and locate the following section:

 

--

-- Data for Name: alert; Type: TABLE DATA; Schema: public; Owner: root

--

 

COPY alert (entry_id, gen_id, alert_id, object_id, alert_time, hv_host, vhost, severity, message, cleared_by, cleared_time, node_uuid, managed_alert, posted) FROM stdin;

 

  1. On that section you'll see that several registry and sync folder update times out in the queue:

 

   2597 351141  \N      7d38f1c5-3bd4-4b17-b399-bbab2c4833e5    registry_update {"values": {"feature.smartview.enabled": true, "feature.smartmotion.enabled": true}, "name": "feature"} e5298360e0bb11e29c3e0da2d091e093        0       1378095016.862613       -1      -1

   2597       103     -1      timed out stuck in queue        \N      \N      \N      \N

   2598 364769  \N      93e09b44-f398-4095-8c2e-8cd215e4b67c    registry_update {"values": {"feature.smartview.enabled": true, "feature.smartmotion.enabled": true}, "name": "feature"} f2cedc4ae0bb11e283ce21512d2369ee        0       1378883431.632072       -1      -1

   2598       103     -1      timed out stuck in queue        \N      \N      \N      \N

   2599 353517  \N      45158f2d-0aa2-4d22-8a2a-e023d8ef3a52    registry_update {"values": {"feature.smartview.enabled": true, "feature.smartmotion.enabled": true}, "name": "feature"} e5298360e0bb11e29c3e0da2d091e093        0       1378238977.124738       -1      -1

  

2621 350138  \N      47895eba-48a1-46fd-94f0-0e4534e20086    tomato_sync_folder      {"parent_command": "f2ebc5fa-665c-4961-9e43-9d0cecaaf890", "queued_time": 1378053902.792007, "smtask": {"enabled_w": false, "copy_time_m": "use_current", "enabled_y": false, "copy_tim   2621 e_y": "use_current", "enabled_d": true, "enabled_m": false}, "folder_uuid": "d264b444-5b7e-4346-9d41-e4ecfc78f159", "policy": {"schedule_week": "1,0,0,0,0,0,0", "retain_years": 0, "uuid": "2b8837c4-1b86-46e1-a154-0d24e24794ab", "copy_mode": "backup", "schedule":    2621 "1145", "config_backup_enabled": 1, "enabled_month": 0, "retain_weeks": 4, "enabled": 1, "enabled_week": 1, "is_default": 0, "enabled_year": 0, "copy_time_month": "use_current", "schedule_year": "", "retain_months": 0, "email_enabled": 0, "target_id": "dfa74c5a-0   2621 cdb-4081-8d74-b665bd52f8e6", "retain_days": 10, "copy_time_year": "use_current", "schedule_month": "", "name": "IT-Development"}, "policy_error": null, "group_uuid": "73f2182e-38f3-4b5c-b996-a41464edaf4c"}      f2cedc4ae0bb11e283ce21512d2369ee        0       1378   2621 053904.0380311      -1      -1      103     -1      timed out stuck in queue        \N      \N      \N      73f2182e-38f3-4b5c-b996-a41464edaf4c

 

   2911 350859  \N      52901510-edda-423d-b630-cb8386f1df65    node_update_vm  {"uuid": "50239a64-64e4-458c-7006-6faccc43606c"}        f2cedc4ae0bb11e283ce21512d2369ee        0       1378084730.7184 -1      -1      103     -1      timed out stuck in queue        \N

 

 

  1. If you are familiar with VMware log analysis, you can take a look on the virtual machine log.

 

Notes:

 

  • The virtual machine logs are located under the virtual machine folder /vmfs/volumes/<datastore-name>/<virtual-machine>/ and they are named vmware.log (the old logs are renamed to vmware.log.1).

 

  • If you opt to collect the esx host logs in order to analyze the virtual machine log in house, make sure to collect the ESX log bundle that is currently running the virtual machine. ESX will only collect logs of the virtual machines that it has registered and are running.

 

So on the example above we see that all the backups in the policy "IT-Development" started and finished on Sep 17.

 

Looking at the vmware.log for the Virtual machine ICWSPT1, it was found that the last activity was reported on Aug 30. We see that a snapshot consolidation occurred at that time, but the latest events (message to toolbox-dnd time out) indicate a possible issue with the vmtools:

 

# tail vmware.log

2013-08-30T16:56:01.138Z| vcpu-0| I120: SnapshotVMXConsolidateOnlineCB: Destroying thread 10

2013-08-30T16:56:01.138Z| vcpu-0| I120: Turning off snapshot info cache.

2013-08-30T16:56:01.139Z| vcpu-0| I120: Turning off snapshot disk cache.

2013-08-30T16:56:01.139Z| vcpu-0| I120: SnapshotVMXConsolidateOnlineCB: Done with consolidate

2013-08-30T16:56:01.152Z| vcpu-2| I120: HBACommon: First write on scsi0:0.fileName='/vmfs/volumes/51b4dc81-6739d73e-dda5-002219238016/ICWSPT1/ICWSPT1.vmdk'

2013-08-30T16:56:01.152Z| vcpu-2| I120: DDB: "longContentID" = "fc61879d10187fbfba8786f1dfed51bb" (was "4ffb9f780ef4dd6e08bbc103719948ce")

2013-08-30T16:56:01.176Z| vcpu-2| I120: DISKLIB-CHAIN : DiskChainUpdateContentID: old=0x719948ce, new=0xdfed51bb (fc61879d10187fbfba8786f1dfed51bb)

2013-08-30T16:56:53.180Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox-dnd timed out.

2013-08-30T16:56:56.710Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox-dnd timed out.

2013-08-30T16:57:01.146Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox-dnd timed out.

 

If vmware tools under the Virtual Machine you want to backup isn't running, ESX/vCenter won't be able to take snapshots.

 

  1. Looking back to the vmPRO log bundle from the master server, you'll find the file report.txt and you'll observe the following:

 

5.1. The backup environment is composed of a group of 5 vmPRO servers:

 

Group members

=============

Hostname              Role        Last Contact Time

--------------------  ----------  ----------------------

icuvmprop1            master      05 Sep 19:22:14

icuvmprop3            node        03 Sep 09:35:56

icuvmprop4            node        05 Sep 19:22:03

icuvmprop2            node        05 Sep 19:22:11

icuvmprop5            node        30 Aug 19:33:28

--------------------  ----------  ----------------------

 

5.2. The Virtual Machine ICWSPT1 is registered under the node icuvmprop5.

 

Virtual Machines

================

Name                                  Server            Operating System                Discover Time    Exported               CBT Enabled   Quiesce Enabled   Node

------------------------------------  ----------------  ------------------------------  ---------------  ---------------------  ------------  ----------------  ----------------

tcprod-app1                           vCenter           Microsoft Windows Server 2008   05 Sep 19:06:52  yes                    yes           no                icuvmprop4

tcprod-etl1                           vCenter           Microsoft Windows Server 2008   05 Sep 19:06:52  yes                    yes           no                icuvmprop4

Pegasus                               vCenter           Microsoft Windows Server 2008   05 Sep 19:21:14  yes                    yes           no                icuvmprop2

Metis                                 vCenter           Microsoft Windows Server 2008   05 Sep 19:06:51  yes                    yes           no                icuvmprop1

ICATRUCARED1                          vCenter           Microsoft Windows Server 2008   05 Sep 19:06:52  yes                    yes           no                icuvmprop4

ICWSPT1                               vCenter           Microsoft Windows Server 2008   05 Sep 19:06:52  yes                    yes           no                icuvmprop5

ICUPATCHP1                            vCenter           Microsoft Windows Server 2008   05 Sep 19:17:42  yes                    yes           no                icuvmprop2

RDS1                                  vCenter           Microsoft Windows Server 2008   05 Sep 19:12:49  yes                    yes           no                icuvmprop4

 

5.3. You'll also find in the master log that for a long time the master has been unable to communicate to the slave node icuvmprop5:

 

controller.1:2013-09-13 10:11:17: Clearing alert: No contact from a node icuvmprop5 in the last over 1 week.

controller.1:2013-09-13 10:12:13: controller/WARNING:: AlertLib.py:112 Posting alert: No contact from a node icuvmprop5 in the last over 1 week.

controller.1:2013-09-13 10:37:17: No contact from a node icuvmprop5 in the last over 1 week.

controller.2:2013-09-06 19:37:11: controller/WARNING:: AlertLib.py:112 Posting alert: No contact from a node icuvmprop5 in the last 1 week.

controller.2:2013-09-06 20:17:39: No contact from a node icuvmprop5 in the last 1 week.

controller.2:2013-09-07 19:36:06: controller/WARNING:: AlertLib.py:112 Posting alert: No contact from a node icuvmprop5 in the last over 1 week.

 

5.4. And looking at the vmPro master server web GUI under VMs > Resources you'll see that the slave icuvmprop5 is registered but invalid:

 

Machine generated alternative text: -. h I
M ‘-‘.‘
.. iaJb’nV2
I ‘wø’3
..

 

In this case, customer checked his Virtual Machine inventory and confirmed that the node icuvmprop5 no longer exists.

 

Resolution (fix or workaround)

 

  1. Advise the customer to fix the VMware tool issue. They can contact VMware technical support for assistance.

 

  1. Because the node icuvmprop5 no longer exists, advise the customer to delete it from the list.

 

 

Machine generated alternative text: c—I.’,--—.. J
N.d. List
¿3 *
..-.
*
q•  $
j ..‘ m.iae,:.
tiIeiv.Iri L.ntr.,b (ai al NaI Sep
F
•
A
•• CICPi
I —F..3
.. .ß———*
Icwmp.2
ca

Delete the node under VMs > Resources Click Configure and in the Configure Node List window, you can delete the node.

 

  1. Restart the backup policy.

 



This page was generated by the BrainKeeper Enterprise Wiki, © 2018