When Using vmPRO SmartView, NetWorker Hangs When Backing Up Large VMs with Heavy I/O Loads |
This article covers diagnosis and resolution for the situation when EMC NetWorker backups using Quantum vmPRO SmartView hang when backing up virtual guest machines with large disks under heavy I/O loads.
Symptom:
NetWorker backups using vmPRO SmartView can hang if the virtual guest machines are under heavy I/O loads.
Cause:
If a virtual guest machine is under heavy I/O loads, it can take VMWare hours or days to remove a snapshot of the guest machine. Because of this, vmPRO cannot create a new snapshot when a new backup is initiated by NetWorker. Since NetWorker is unaware that the snapshot removal is still in progress, NetWorker will eventually timeout, but will never stop the process. This leads to NetWorker stacking backup processes on the vmPro appliance, and NetWorker eventually hangs (core dumps).
See the related Bugzilla entry: PTR 5430
Action:
If NetWorker backups begin to fail, examine the NetWorker logs on the backup server under /nsr/logs/daemon.log for something similar to the example below:
38758 03/29/2013 07:42:27 AM 0 0 0 3607394048 5851 0 svt50.svtbellvue.quantum.com nsrd NSR info savegroup failure alert: small-vm aborted, Total 2 client(s), 0 Clients disabled, 0 Hostname(s) Unresolved, 2 Failed, 0 Succeeded, 0 CPR Failed, 0 CPR Succeeded, 0 BMR Failed, 0 BMR Succeeded.
38758 03/29/2013 07:42:27 AM 0 0 0 3607394048 5851 0 svt50.svtbellvue.quantum.com nsrd NSR info savegroup alert: small-vm aborted, Total 2 client(s), 2 Failed. See group completion details for more information.
38758 03/29/2013 07:42:28 AM 0 0 0 3607394048 5851 0 svt50.svtbellvue.quantum.com nsrd NSR info savegroup failure alert: large-vm aborted, Total 2 client(s), 0 Clients disabled, 0 Hostname(s) Unresolved, 2 Failed, 0 Succeeded, 0 CPR Failed, 0 CPR Succeeded, 0 BMR Failed, 0 BMR Succeeded.
38758 03/29/2013 07:42:28 AM 0 0 0 3607394048 5851 0 svt50.svtbellvue.quantum.com nsrd NSR info savegroup alert: large-vm aborted, Total 2 client(s), 2 Failed. See group completion details for more information.
71193 03/29/2013 07:45:46 AM 0 0 0 3607394048 5851 0 svt50.svtbellvue.quantum.com nsrd NSR info Media Notice: Save set (2387976300) vmpro-skipp.svtbellvue.quantum.com:export/10.25.200.19/large-vm volume svt50.svtbellvue.quantum.com.001 on cletus is being terminated because: inactivity timeout
0 03/29/2013 07:45:46 AM 3 0 0 364308224 5966 0 svt50.svtbellvue.quantum.com nsrmmd NSR error 03/29/13 07:45:46 nsrmmd #2: save set export/10.25.200.19/large-vm for client vmpro-skipp.svtbellvue.quantum.com was aborted and removed from volume svt50.svtbellvue.quantum.com.001
71193 03/29/2013 07:45:46 AM 0 0 0 3607394048 5851 0 svt50.svtbellvue.quantum.com nsrd NSR info Media Info: save set export/10.25.200.19/large-vm for client vmpro-skipp.svtbellvue.quantum.com was aborted and removed from volume svt50.svtbellvue.quantum.com.001
71659 03/29/2013 07:45:46 AM 0 0 2 3607394048 5851 0 svt50.svtbellvue.quantum.com nsrd NSR info vmpro-skipp.svtbellvue.quantum.com:export/10.25.200.19/large-vm done saving to pool 'Default' (svt50.svtbellvue.quantum.com.001)
82327 03/29/2013 07:52:34 AM 1 9 0 1451214592 5869 0 svt50.svtbellvue.quantum.com nsrjobd JOBS notice Starting full purge of jobs database
93514 03/29/2013 07:52:35 AM 1 9 0 1451214592 5869 0 svt50.svtbellvue.quantum.com nsrjobd JOBS notice Completed full database purge in 0 min 1 sec. Records purged: 0
89017 03/29/2013 08:00:00 AM 1 5 0 3607394048 5851 0 svt50.svtbellvue.quantum.com nsrd NSR notice Scheduled resource list: NSR task DefaultReportHomeTask exited with return code 1.
If the logs indicate inactivity timeout, any backups running will need to be halted, and services will need to be restarted on both the NetWorker server and the client on the vmPRO appliance.
To restart services on the NetWorker server, run the following commands:
/etc/init.d/networker stop
/etc/init.d/networker start
To restart services on the vmPRO appliance run the following command:
nw restart
This will flush out any outstanding backups that may have stacked up on the vmPRO appliance.
Prevention:
To avoid running into this issue with future backups, users should adjust their backup schedules to compensate for the time it takes for VMWare to completely remove a virtual guest machine before initiating a new backup.
This page was generated by the BrainKeeper Enterprise Wiki, © 2018 |