Problems with vzdump, qm suspend/resume lock timeout

e100

Renowned Member
Nov 6, 2010
1,268
47
88
Columbus, Ohio
ulbuilder.wordpress.com
On the pve-user mailing list someone else is having a similar issue with no resolution:
http://pve.proxmox.com/pipermail/pve-user/2012-October/004897.html

When vzdump is running sometimes qm resume or qm suspend will randomly fail.
I am still running pve 2.1, but I doubt this is some problem that was fixed in 2.2.

Code:
138: Oct 31 00:22:57 INFO: Starting Backup of VM 138 (qemu)
138: Oct 31 00:22:57 INFO: status = running
138: Oct 31 00:22:57 INFO: backup mode: snapshot
138: Oct 31 00:22:57 INFO: ionice priority: 7
138: Oct 31 00:22:57 INFO: suspend vm to make snapshot
138: Oct 31 00:22:57 INFO: trying to aquire lock... failed
138: Oct 31 00:22:57 INFO: can't lock file '/var/log/pve/tasks/.active.lock' - can't aquire lock - Interrupted system call
[B]138: Oct 31 00:22:58 ERROR: Backup of VM 138 failed - command 'qm suspend 138 --skiplock' failed: exit code 4[/B]

Code:
115: Oct 31 00:12:02 INFO: Starting Backup of VM 115 (qemu)
115: Oct 31 00:12:02 INFO: status = running
115: Oct 31 00:12:02 INFO: backup mode: snapshot
115: Oct 31 00:12:02 INFO: ionice priority: 7
115: Oct 31 00:12:02 INFO: suspend vm to make snapshot
115: Oct 31 00:12:03 INFO:   Logical volume "vzsnap-vm1-0" created
115: Oct 31 00:12:03 INFO:   Logical volume "vzsnap-vm1-1" created
115: Oct 31 00:12:05 INFO:   Logical volume "vzsnap-vm1-2" created
115: Oct 31 00:12:05 INFO: resume vm
115: Oct 31 00:12:05 INFO: trying to aquire lock... failed
115: Oct 31 00:12:05 INFO: can't lock file '/var/log/pve/tasks/.active.lock' - can't aquire lock - Interrupted system call
[B]115: Oct 31 00:12:06 ERROR: Backup of VM 115 failed - command 'qm resume 115 --skiplock' failed: exit code 4[/B]

We have been upgrading CPUs/motherboards in our cluster, today we upgraded the 2nd server.
During both upgrades a 6 core CPU was replaced with an 8 core CPU with hyperthreading, so we moved from a 6 core to 16 core in two servers.
The above errors happened on two different nodes, neither error occurred on a node that was upgraded.

There are a total of 17 nodes in this cluster
Before the upgrades there were a total of 108 cores, after the upgrades there are a total of 128 cores.
Could the additional CPU cores cause this problem?

Any suggestions?