CBT backup fails due to changeid mismatch (DRAFT)

 

SR Information: 1611512

Product / Software Version: VmPro 3.1.1 - vCenter 5.0

Problem Description: Virtual machine backup will stuck in 'queued' state

Related PTRs:  

Bug 5927 - Customer Quantum IT: Backups fail due to changeid mismatch for VMs with same-named disks on different datastores

 

Overview

 

This article provides clarification and description of workaround for a specific case found in the field where vmPRO will fail backups when the changeid of the virtual disk is different from the current changeid.

 

When you enable cbt, vmPRO will request the changeid of each virtual disk of the virtual machine that is being processed for the backup. For version control, vmPRO will save this changeid followed by a sequence number. For example:

 

52 44 cf 9b 8d 0e 90 30-5e 0b ee 35 76 30 f3 de/1911

 

Where:

52 44 cf 9b 8d 0e 90 30-5e 0b ee 35 76 30 f3 de is the change id

1911 is the sequence/version number

 

Whenever vmPRO is backing up a vm using cbt, vmPRO will have the change id and the sequence number from the prior backup and compare with the change id and sequence number collected during the current backup.

 

This case describes a specific condition where the changeid doesn't match, and describes the symptoms and the workaround.

 

Symptom

 

  1. vmPRO backup fails with the following error message in the vmPRO messages log (you'll also see the message in the vmPRO GUI and vm_proxy.log)

 

2013-10-09 19:58:25.595918: ERROR [thread:140005896836864, where:spawn.c:246, error:8] /usr/local/pancetera/bin/cbt_getblocklist.pl --server ppoesx23.quantum.com --portnumber 443 --username root --password ****** --vmuuid 4  31500 20aa639-4bbd-889c-b122-e9836130794f --vmname PPOWXENAPPV5 --vmdkpath /vmfs/volumes/4d654ab0-08d768f8-95a0-a4badb1252fe/PPOWXENAPPV5/PPOWXENAPPV5.vmdk --vmdksize 9663676416 --changeid_only n --resultsfile /tmp/queryHAJZaD --  31500 vendor_id Quantum --appliance_uuid 2a2bea9ee24c11e187f90050568a6300 --debug 1 --changeid_incr 52 44 cf 9b 8d 0e 90 30-5e 0b ee 35 76 30 f3 de/1911

 

  1. In order to go deeper in the analysis, enable the cbt debug on the vmPRO:

 

  1. Login via ssh to the vmPRO as sysadmin

 

  1. Execute the following command:

 

reg set appliance.cbt.debug 1

 

Note: to change the setting back to the default, execute:

 

reg set appliance.cbt.debug 0

 

When you have debug set, you can reproduce the problem and you'll find the following message in the vm_proxy.log (vmPRO appliance log):

 

  27997 get_changeid checking kPV5/PPOWXENAPPV5.vmdk

  27998 vmdk_size: 9663676416

  27999 results_file: /tmp/queryHAJZaD

  28000 get_changeid ds_path:  /vmfs/volumes/4d654ab0-08d768f8-95a0-a4badb1252fe/

  28001 get_changeid disk_path: PPOWXENAPPV5/PPOWXENAPPV5.vmdk

  28002 get_changeid ds_url: /vmfs/volumes/4b902464-cf573416-7e5f-001f29cefe88/

  28003 get_changeid ds_url: /vmfs/volumes/502530b3-8d820f93-02bc-a4badb125366/

  28004 get_changeid ds_url: /vmfs/volumes/4d654ab0-08d768f8-95a0-a4badb1252fe/

  28005 get_changeid found ds_url match: /vmfs/volumes/4d654ab0-08d768f8-95a0-a4badb1252fe/

  28006 get_changeid ds_name: DS_PPO_T1_119_R1

  28007 get_changeid layout_disk: [DS_PPO_T1_119_R1] PPOWXENAPPV5/PPOWXENAPPV5.vmdk

  28008 get_changeid diskFile: [DS_PPO_T1_110_R1] PPOWXENAPPV5/PPOWXENAPPV5.vmdk

  28009 get_changeid diskFile: [DS_PPO_T1_110_R1] PPOWXENAPPV5/PPOWXENAPPV5-000001.vmdk

  28010 get_changeid diskFile: [DS_PPO_T1_119_R1] PPOWXENAPPV5/PPOWXENAPPV5.vmdk

  28011 get_changeid found diskFile match: [DS_PPO_T1_119_R1] PPOWXENAPPV5/PPOWXENAPPV5.vmdk

  28012 get_changeid disk_key: 2001

  28013 get_changeid diskFile: [DS_PPO_T1_114] PPOWXENAPPV5/PPOWXENAPPV5.vmdk

  28014 get_changeid diskFile: [DS_PPO_T1_114] PPOWXENAPPV5/PPOWXENAPPV5-000001.vmdk

  28015 get_changeid checking key: 200

  28016 get_changeid checking key: 201

  28017 get_changeid checking key: 300

  28018 get_changeid checking key: 100

  28019 get_changeid checking key: 400

  28020 get_changeid checking key: 600

  28021 get_changeid checking key: 700

  28022 get_changeid checking key: 500

  28023 get_changeid checking key: 12000

  28024 get_changeid checking key: 1000

  28025 get_changeid checking key: 3000

  28026 get_changeid checking key: 4000

  28027 get_changeid checking key: 2000

  28028 get_changeid checking key: 2001

  28029 get_changeid found key match: [DS_PPO_T1_119_R1] PPOWXENAPPV5/PPOWXENAPPV5.vmdk

  28030 found device key: 2001

  28031 found changeId: 52 26 e2 c1 19 1c 52 de-01 48 54 15 a2 e5 99 61/2201

  28032 received device_key: 2001

  28033 received changeid_cur:    52 26 e2 c1 19 1c 52 de-01 48 54 15 a2 e5 99 61/2201

  28034 modified changeid_base_1: 52 26 e2 c1 19 1c 52 de-01 48 54 15 a2 e5 99 61/1

  28035 compare_changeids cur 0:   52 26 e2 c1 19 1c 52 de-01 48 54 15 a2 e5 99 61

  28036 compare_changeids cur 1:   2201

  28037 compare_changeids incr 0:  52 44 cf 9b 8d 0e 90 30-5e 0b ee 35 76 30 f3 de

  28038 compare_changeids incr 1:  1911

  28039 Changeid_incr does not match changeid_cur for /vmfs/volumes/4d654ab0-08d768f8-95a0-a4badb1252fe/PPOWXENAPPV5/PPOWXENAPPV5.vmdk

  28040 getblocklist changeid_incr: 52 44 cf 9b 8d 0e 90 30-5e 0b ee 35 76 30 f3 de/1911

  28041 getblocklist changeid_cur:  52 26 e2 c1 19 1c 52 de-01 48 54 15 a2 e5 99 61/2201

  28042 2013-10-09 19:58:25.904121: open_pancbt: requested auto Changed Block Tracking reset for: PPOWXENAPPV5

  28043 ey: 12000

  28044 get_changeid checking key: 1000

  28045 get_changeid checking key: 3000

  28046 get_changeid checking key: 4000

  28047 get_changeid checking key: 2000

 

Note that in the above example, we have the same virtual disk (in green) and changeid_inc/seq# (in blue and pink) in the message logs event described under item1.

 

Where the does the changeid_cur (in red) came from? For that you'll need to check the virtual machine under the ESX host. Please see the next steps for directions.

 

  1. You can execute the virtual machine analysis directly on the ESX host connected via PuTTy. If you aren't familiar with working on the ESX host, it's highly advised that you collect the ESX logs and analyze the virtual machine from those logs. In the next part of the article we'll be showing how to analyze those logs.

 

Additional notes:

 

 

 

The ESX log will have the format <esx_host_name>-YYYY-MM-DD@HH-MM-SS.tgz as we see in the example bellow:

 

ppoesx23.quantum.com-vmsupport-2013-10-10@13-21-55.tgz

 

 

  1. Uncompress the log and note that a directory is created:

 

# tar -xzvf ppoesx23.quantum.com-vmsupport-2013-10-10\@13-21-55.tgz

 

# ll

drwxr-xr-x 12 eeskenbe domain^users      4096 Oct 15 10:41 esx-ppoesx23.quantum.com-2013-10-10--19.20

-rw-r--r--  1 eeskenbe domain^users 159608097 Oct 10 16:20 ppoesx23.quantum.com-vmsupport-2013-10-10@13-21-55.tgz

 

 

  1. Go to the ESX log bundle directory and look for the virtual machine name (identified on the vmPRO messages) under the path /vmfs/volume:

 

# cd esx-ppoesx23.quantum.com-2013-10-10--19.20

 

# find ./vmfs/volumes -name PPOWXENAPPV5

./vmfs/volumes/502530b3-8d820f93-02bc-a4badb125366/PPOWXENAPPV5

./vmfs/volumes/4d654ab0-08d768f8-95a0-a4badb1252fe/PPOWXENAPPV5

./vmfs/volumes/4b902464-cf573416-7e5f-001f29cefe88/PPOWXENAPPV5

 

 

Above we see that the virtual machine has 3 directories in different datastores. This means that this virtual machine has virtual disks spread in different datastores. You can find out where by running a simple find:

 

# find ./vmfs/volumes -name PPOWXENAPPV5* | grep vmdk

./vmfs/volumes/502530b3-8d820f93-02bc-a4badb125366/PPOWXENAPPV5/PPOWXENAPPV5-ctk.vmdk

./vmfs/volumes/502530b3-8d820f93-02bc-a4badb125366/PPOWXENAPPV5/PPOWXENAPPV5.vmdk

./vmfs/volumes/4d654ab0-08d768f8-95a0-a4badb1252fe/PPOWXENAPPV5/PPOWXENAPPV5-ctk.vmdk

./vmfs/volumes/4d654ab0-08d768f8-95a0-a4badb1252fe/PPOWXENAPPV5/PPOWXENAPPV5.vmdk

./vmfs/volumes/4b902464-cf573416-7e5f-001f29cefe88/PPOWXENAPPV5/PPOWXENAPPV5-ctk.vmdk

./vmfs/volumes/4b902464-cf573416-7e5f-001f29cefe88/PPOWXENAPPV5/PPOWXENAPPV5.vmdk

 

The datastores are:

502530b3-8d820f93-02bc-a4badb125366

4d654ab0-08d768f8-95a0-a4badb1252fe

4b902464-cf573416-7e5f-001f29cefe88

 

Additional notes:

 

A virtual disk is composed by 2 files:

 

*.vmdk and *-flat.vmdk

 

If you have cbt enabled, each virtual disk will have an additional file named *-ctk.vmdk

 

In the example above, because we are looking at the ESX log bundle, we will not have the *-flat.vmdk file included. This is ok; we won't need this file for this analysis.

 

  1. Now we will verify the cbt changeid for each virtual disk found above with a hex editor.  The command is 'hexdump -C -n 64 /path/virtualdisk-ctk.vmdk' as shown:

 

# hexdump -C -n 64 ./vmfs/volumes/502530b3-8d820f93-02bc-a4badb125366/PPOWXENAPPV5/PPOWXENAPPV5-ctk.vmdk

00000000  a2 72 19 f6 01 00 00 00  00 00 00 00 00 02 00 00  |.r..............|

00000010  00 00 40 06 00 00 00 00  af 07 00 00 ab 07 00 00  |..@.............|

00000020  00 00 00 00 80 00 00 00  00 80 0c 00 52 44 cf 9b  |............RD..|

00000030  8d 0e 90 30 5e 0b ee 35  76 30 f3 de 00 00 00 00  |...0^..5v0......|

00000040

 

# hexdump -C -n 64 ./vmfs/volumes/4d654ab0-08d768f8-95a0-a4badb1252fe/PPOWXENAPPV5/PPOWXENAPPV5-ctk.vmdk

00000000  a2 72 19 f6 01 00 00 00  00 00 00 00 00 02 00 00  |.r..............|

00000010  00 00 20 01 00 00 00 00  a8 08 00 00 99 08 00 00  |.. .............|

00000020  00 00 00 00 80 00 00 00  00 40 02 00 52 26 e2 c1  |.........@..R&..|

00000030  19 1c 52 de 01 48 54 15  a2 e5 99 61 00 00 00 00  |..R..HT....a....|

00000040

 

# hexdump -C -n 64 ./vmfs/volumes/4b902464-cf573416-7e5f-001f29cefe88/PPOWXENAPPV5/PPOWXENAPPV5-ctk.vmdk

00000000  a2 72 19 f6 01 00 00 00  00 00 00 00 00 02 00 00  |.r..............|

00000010  00 00 40 01 00 00 00 00  78 07 00 00 72 07 00 00  |..@.....x...r...|

00000020  00 00 00 00 80 00 00 00  00 80 02 00 52 1c 4e 25  |............R.N%|

00000030  0f 31 14 73 52 80 e4 d8  3d f1 e4 ee 00 00 00 00  |.1.sR...=.......|

 

Where is the changeid located?

 

Starting from offset '2d' we have the changeid of the virtual disk in the ctk file.

 

Note again under the vm_proxy log (item 2, marked with the same color code)  that vmPRO is comparing the changeid of two different virtual disks:

 

  28039 Changeid_incr does not match changeid_cur for /vmfs/volumes/4d654ab0-08d768f8-95a0-a4badb1252fe/PPOWXENAPPV5/PPOWXENAPPV5.vmdk

  28040 getblocklist changeid_incr: 52 44 cf 9b 8d 0e 90 30-5e 0b ee 35 76 30 f3 de/1911

  28041 getblocklist changeid_cur:  52 26 e2 c1 19 1c 52 de-01 48 54 15 a2 e5 99 61/2201

 

What you should expect in a normal condition is something like:

 

  28039 Changeid_incr does not match changeid_cur for /vmfs/volumes/4d654ab0-08d768f8-95a0-a4badb1252fe/PPOWXENAPPV5/PPOWXENAPPV5.vmdk

  28040 getblocklist changeid_incr: 52 26 e2 c1 19 1c 52 de-01 48 54 15 a2 e5 99 61/2200

  28041 getblocklist changeid_cur:  52 26 e2 c1 19 1c 52 de-01 48 54 15 a2 e5 99 61/2201

 

 

 

Resolution

 

Monitor PTR 5927 for the fix status. In the interim, the workaround is to rename the virtual disks owned by the virtual machines with unique names.

 

The procedure to rename a virtual disk is complex. We have basic guidelines to get you familiar with the process, but we advise you to explain to the customer about the process and highly recommend that the customer pursue VMware technical support to assist him on this procedure.

 

Steps to rename the virtual disk (requires downtime of the virtual machine):

 

  1. Shutdown the virtual machine (make sure there are no snapshots or delta files)

 

  1. Open a PuTTy session to the ESX and rename the virtual disk using the command vmkfstools like the example below:

 

# vmkfstools -E  /vmfs/volumes/4d654ab0-08d768f8-95a0-a4badb1252fe/PPOWXENAPPV5/PPOWXENAPPV5.vmdk /vmfs/volumes/4d654ab0-08d768f8-95a0-a4badb1252fe/PPOWXENAPPV5/PPOWXENAPPV5_1.vmdk

 

The command above should change all the 3 files. Information about this command can be found in the VMware knowledgebase article:

 

http://kb.vmware.com/kb/1002491

 

  1. Make sure all 3 files were changed  *-flat.vmdk, *.vmdk and *-ctk.vmdk to the new name

 

  1. Next you need to change the *.vmx file. Find for the line where the disk above is pointed:

 

# find ./ -name PPOWXENAPPV5.vmx

./vmfs/volumes/4b902464-cf573416-7e5f-001f29cefe88/PPOWXENAPPV5/PPOWXENAPPV5.vmx

 

 

Inside the vmx, you'll find the following lines:

 

scsi0:1.present = "true"

scsi0:1.fileName = "/vmfs/volumes/4d654ab0-08d768f8-95a0-a4badb1252fe/PPOWXENAPPV5/PPOWXENAPPV5.vmdk"

scsi0:1.deviceType = "scsi-hardDisk"

scsi0:1.ctkEnabled = "true"

sched.scsi0:1.shares = "normal"

 

Change to the new renamed disk:

 

scsi0:1.present = "true"

scsi0:1.fileName = "/vmfs/volumes/4d654ab0-08d768f8-95a0-a4badb1252fe/PPOWXENAPPV5/PPOWXENAPPV5_1.vmdk"

scsi0:1.deviceType = "scsi-hardDisk"

scsi0:1.ctkEnabled = "true"

sched.scsi0:1.shares = "normal"

 

  1. In order to the change take effect, you must to unregister and re-register the virtual machine. Be careful with the GUI as it has an option of 'delete from inventory'. Usually we highly advise you to do this operation via the CLI using the vim-cmd command (PuTTy) or the VIMA (ESX remote console):

 

Clarification and instructions about the commands are found in the following vmware article:

 

http://kb.vmware.com/kb/2012964

 

Here the steps:

 

 

# vim-cmd vmsvc/getallvms

Vmid              Name                                                    File                                               Guest OS           Version                      Annotation

13     DXi0_2.2.0_build47644         [Managed] BLUE_DXi0_2.2.0_build47644/BLUE_DXi0_2.2.0_build47644.vmx              rhel6_64Guest             vmx-07    IP: 10.20.234.82

15     RHEL-v6.2                     [Unmanaged] RHEL-/RHEL-.vmx                                                      rhel6_64Guest             vmx-07

17     Centos 6.3                    [Unmanaged] Centos 6.3/Centos 6.3.vmx                                            centos64Guest             vmx-08

18     Marsden_Windows7              [Unmanaged] Marsden_Windows7/Marsden_Windows7.vmx                                windows7_64Guest          vmx-08

 

 

# vim-cmd vmsvc/unregister <vimid_of_the_virtual_machine>

 

 

 

# vim-cmd solo/registervm /vmfs/volumes/4b902464-cf573416-7e5f-001f29cefe88/PPOWXENAPPV5/PPOWXENAPPV5.vmx

 

 

Important Note: this procedure will re-register the vm, but it will not add it back to the original resource pool, assuming that the machine was previously registered under one. Thus, we advise that the customer seek assistance from VMware. The re-registration may require extra steps to set the machine back to the original resource pool and other settings after re-registration.



This page was generated by the BrainKeeper Enterprise Wiki, © 2018