[SOLVED] Certain VMs from a cluster cannot be backed up and managed

Error occurs here for me too after upgrade from Proxmox 5 to 6
One node of a 3 node cluster with Ceph storage affected
Backup, Console access, Snapshot, Migration for VMs all get timeouts
After shutting down VM and restarting on the same node problem occurs soon again
After migrating VM to one of the other nodes, problem doesn't occur

Works for me now with pve-qemu-kvm 4.0.0-7 from pvetest
 
  • Like
Reactions: fireon
I know what was the issue, I started this thread...

I tried to ask the Proxmox staff for details about what caused this issue since they said they fixed it.

a bug in Qemu's monitor implementation that could lead to a permanently suspended (and thus non-responsive) QMP instance.

if you are interested in the nitty-gritty details:
https://git.proxmox.com/?p=pve-qemu...0;hb=3b1986f02ff23f9a6d8aad8708a959bef5ec0b6e
https://lists.gnu.org/archive/html/qemu-devel/2019-10/msg00584.html
 
Just a quick question: How can we remove the trace pattern config?

just remove the 'args' option from the VM config files where you added it, and once you did that for all VMs, remove the pattern file as well.

e.g., qm set VMID --delete args
 
Ok, good. I marked this as solved.

Interesting is the fact that after we separated the cluster network from the management we did not have any issues like this anymore.
This happened the day i submitted this thread or a day after that, since then i saw no problematic VMs and we did not update our nodes yet.
 
Interesting is the fact that after we separated the cluster network from the management we did not have any issues like this anymore.

we have seen this bug off and on for many years. doing things line separate the cluster network only make the issue happen less often.

the fact that the cause of the issue seems to have finally been found and a patch implemented is excellent. see
https://lists.gnu.org/archive/html/qemu-devel/2019-10/msg02105.html :

" You made me look at ->suspend_cnt, and now I feel nauseous. ...
It's been ten years "

Now this is not an unusual to have old bugs in software. they make the hike more interesting. we are very lucky to have proxmox devs along with devs at all the other subsystems working to make things right.
 
Hi everyone,
we're experiencing the backup hangs as well here, with completely updated packages.
Our pve-qemu-kvm is at 4.1.1-2, so that is high above the expected fix with 4.0.0-7 i suppose.

The backup hangs at

Code:
INFO: Starting Backup of VM 102 (qemu)
INFO: Backup started at 2020-01-08 00:46:33
INFO: status = running
INFO: update VM 102: -lock backup
INFO: VM Name: Datenserver
INFO: include disk 'scsi0' 'VMs:vm-102-disk-0' 200G
INFO: include disk 'scsi1' 'VMs:vm-102-disk-2' 2T
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: snapshots found (not included into backup)
INFO: creating archive '/mnt/pve/Backup-Daily/dump/vzdump-qemu-102-2020_01_08-00_46_33.vma.lzo'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '5a59325b-467a-4c84-a5de-f9931ec24160'
INFO: status: 0% (824246272/2413771620352), sparse 0% (706871296), duration 3, read/write 274/39 MB/s
INFO: status: 1% (24434966528/2413771620352), sparse 0% (22524571648), duration 109, read/write 222/16 MB/s
INFO: status: 2% (48348266496/2413771620352), sparse 1% (45942001664), duration 203, read/write 254/5 MB/s
INFO: status: 3% (72500183040/2413771620352), sparse 2% (68416868352), duration 309, read/write 227/15 MB/s
INFO: status: 4% (96783564800/2413771620352), sparse 3% (90717052928), duration 429, read/write 202/16 MB/s
INFO: status: 5% (120813912064/2413771620352), sparse 4% (114680606720), duration 519, read/write 267/0 MB/s
INFO: status: 6% (145151623168/2413771620352), sparse 5% (134593253376), duration 624, read/write 231/42 MB/s
INFO: status: 7% (169108111360/2413771620352), sparse 6% (158010511360), duration 670, read/write 520/11 MB/s
INFO: status: 8% (193692499968/2413771620352), sparse 7% (181905051648), duration 723, read/write 463/13 MB/s
INFO: status: 9% (217293062144/2413771620352), sparse 8% (202872406016), duration 782, read/write 400/44 MB/s
INFO: status: 10% (241448976384/2413771620352), sparse 8% (204389859328), duration 961, read/write 134/126 MB/s
INFO: status: 11% (265641328640/2413771620352), sparse 8% (206242381824), duration 1156, read/write 124/114 MB/s
INFO: status: 12% (289658175488/2413771620352), sparse 8% (209941843968), duration 1348, read/write 125/105 MB/s
INFO: status: 13% (313911803904/2413771620352), sparse 8% (213902004224), duration 1542, read/write 125/104 MB/s
INFO: status: 14% (337975771136/2413771620352), sparse 8% (216371822592), duration 1740, read/write 121/109 MB/s
INFO: status: 15% (362171334656/2413771620352), sparse 9% (220185518080), duration 1937, read/write 122/103 MB/s

since Jan 8th 01:15.

The respective VM 102 is locked and NoVnc not availlable for it (all other VMs consoles work just fine)

The backup target is a separate openmediavault bare metal connected succesfully via CIFS.
Backup storage is connected and accessible.

Later backup jobs obviously fail with
Code:
INFO: trying to get global lock - waiting...
ERROR: can't acquire lock '/var/run/vzdump.lock' - got timeout
TASK ERROR: got unexpected control message:

At this point, i don't really now what i could do and would be very happy for any advice you could give me.
thanks
Sascha
 
Hi everyone,
we're experiencing the backup hangs as well here, with completely updated packages.
Our pve-qemu-kvm is at 4.1.1-2, so that is high above the expected fix with 4.0.0-7 i suppose.

The backup hangs at

Code:
INFO: Starting Backup of VM 102 (qemu)
INFO: Backup started at 2020-01-08 00:46:33
INFO: status = running
INFO: update VM 102: -lock backup
INFO: VM Name: Datenserver
INFO: include disk 'scsi0' 'VMs:vm-102-disk-0' 200G
INFO: include disk 'scsi1' 'VMs:vm-102-disk-2' 2T
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: snapshots found (not included into backup)
INFO: creating archive '/mnt/pve/Backup-Daily/dump/vzdump-qemu-102-2020_01_08-00_46_33.vma.lzo'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '5a59325b-467a-4c84-a5de-f9931ec24160'
INFO: status: 0% (824246272/2413771620352), sparse 0% (706871296), duration 3, read/write 274/39 MB/s
INFO: status: 1% (24434966528/2413771620352), sparse 0% (22524571648), duration 109, read/write 222/16 MB/s
INFO: status: 2% (48348266496/2413771620352), sparse 1% (45942001664), duration 203, read/write 254/5 MB/s
INFO: status: 3% (72500183040/2413771620352), sparse 2% (68416868352), duration 309, read/write 227/15 MB/s
INFO: status: 4% (96783564800/2413771620352), sparse 3% (90717052928), duration 429, read/write 202/16 MB/s
INFO: status: 5% (120813912064/2413771620352), sparse 4% (114680606720), duration 519, read/write 267/0 MB/s
INFO: status: 6% (145151623168/2413771620352), sparse 5% (134593253376), duration 624, read/write 231/42 MB/s
INFO: status: 7% (169108111360/2413771620352), sparse 6% (158010511360), duration 670, read/write 520/11 MB/s
INFO: status: 8% (193692499968/2413771620352), sparse 7% (181905051648), duration 723, read/write 463/13 MB/s
INFO: status: 9% (217293062144/2413771620352), sparse 8% (202872406016), duration 782, read/write 400/44 MB/s
INFO: status: 10% (241448976384/2413771620352), sparse 8% (204389859328), duration 961, read/write 134/126 MB/s
INFO: status: 11% (265641328640/2413771620352), sparse 8% (206242381824), duration 1156, read/write 124/114 MB/s
INFO: status: 12% (289658175488/2413771620352), sparse 8% (209941843968), duration 1348, read/write 125/105 MB/s
INFO: status: 13% (313911803904/2413771620352), sparse 8% (213902004224), duration 1542, read/write 125/104 MB/s
INFO: status: 14% (337975771136/2413771620352), sparse 8% (216371822592), duration 1740, read/write 121/109 MB/s
INFO: status: 15% (362171334656/2413771620352), sparse 9% (220185518080), duration 1937, read/write 122/103 MB/s

since Jan 8th 01:15.

The respective VM 102 is locked and NoVnc not availlable for it (all other VMs consoles work just fine)

The backup target is a separate openmediavault bare metal connected succesfully via CIFS.
Backup storage is connected and accessible.

Later backup jobs obviously fail with
Code:
INFO: trying to get global lock - waiting...
ERROR: can't acquire lock '/var/run/vzdump.lock' - got timeout
TASK ERROR: got unexpected control message:

At this point, i don't really now what i could do and would be very happy for any advice you could give me.
thanks
Sascha
Well i just checked and we are on qemu 4.0.0-5 yet... So i cannot tell you anthing for sure.
In these cases we just shut down the machines and started up again.
But the issue had pretty much gone away after we moved the cluster traffic to a separate dedicated network without any package update.
 
hm, well forgot to say, that we're experiencing this on a one node system, no cluster.
But it could well be, that using a different NIC for the VMs may help.
Will try this out as soon as possible.
However, if you have more ideas, again, any reccomendation is welcome.
cheers
Sascha
 
hm, well forgot to say, that we're experiencing this on a one node system, no cluster.
But it could well be, that using a different NIC for the VMs may help.
Will try this out as soon as possible.
However, if you have more ideas, again, any reccomendation is welcome.
cheers
Sascha
Well in this case it most likely will make no difference.

Just shut down the VM and start it up again when possible.

Edit:

It see the issue is lock related, you could try "qm unlock VMID"
 
Last edited:
@tafkaz that does look like a different bug though, the log output is very different. can you check the system logs starting with when the backup started to around 3 minutes after it stopped printing new output?
 
Hi,

Code:
Jan  8 00:46:33 pve vzdump[21653]: INFO: Finished Backup of VM 101 (00:21:16)
Jan  8 00:46:33 pve vzdump[21653]: INFO: Starting Backup of VM 102 (qemu)
Jan  8 00:46:34 pve qm[27379]: <root@pam> update VM 102: -lock backup
Jan  8 00:58:07 pve kernel: [95606.157217] perf: interrupt took too long (9964 > 9961), lowering kernel.perf_event_max_sample_rate to 20000

[...]

Jan  8 01:39:01 pve systemd[1]: Started Proxmox VE replication runner.
Jan  8 01:39:25 pve pvestatd[1408]: got timeout
Jan  8 01:39:25 pve kernel: [98084.380919] CIFS VFS: Cancelling wait for mid 6919638 cmd: 5
Jan  8 01:39:25 pve kernel: [98084.382029] CIFS VFS: Cancelling wait for mid 6919639 cmd: 16
Jan  8 01:39:25 pve pvestatd[1408]: unable to activate storage 'Backup-Daily' - directory '/mnt/pve/Backup-Daily' does not exist or is unreachable
Jan  8 01:39:28 pve pvestatd[1408]: got timeout
Jan  8 01:39:28 pve kernel: [98086.804933] CIFS VFS: Cancelling wait for mid 6919642 cmd: 5
Jan  8 01:39:28 pve kernel: [98086.805504] CIFS VFS: Cancelling wait for mid 6919643 cmd: 16
Jan  8 01:39:28 pve pvestatd[1408]: unable to activate storage 'Backup-Weekly' - directory '/mnt/pve/Backup-Weekly' does not exist or is unreachable
Jan  8 01:39:28 pve pvestatd[1408]: status update time (7.655 seconds)
Jan  8 01:39:35 pve pvestatd[1408]: got timeout
Jan  8 01:39:35 pve kernel: [98094.033736] CIFS VFS: Cancelling wait for mid 6919667 cmd: 5
Jan  8 01:39:35 pve kernel: [98094.034313] CIFS VFS: Cancelling wait for mid 6919668 cmd: 16
Jan  8 01:39:35 pve pvestatd[1408]: unable to activate storage 'Backup-Daily' - directory '/mnt/pve/Backup-Daily' does not exist or is unreachable
Jan  8 01:39:37 pve pvestatd[1408]: got timeout
Jan  8 01:39:37 pve kernel: [98096.434956] CIFS VFS: Cancelling wait for mid 6919669 cmd: 5
Jan  8 01:39:37 pve kernel: [98096.435593] CIFS VFS: Cancelling wait for mid 6919670 cmd: 16
Jan  8 01:39:37 pve pvestatd[1408]: unable to activate storage 'Backup-Weekly' - directory '/mnt/pve/Backup-Weekly' does not exist or is unreachable
Jan  8 01:39:37 pve pvestatd[1408]: status update time (7.617 seconds)
Jan  8 01:39:45 pve pvestatd[1408]: got timeout
Jan  8 01:39:45 pve kernel: [98104.648100] CIFS VFS: Cancelling wait for mid 6919675 cmd: 5
Jan  8 01:39:45 pve kernel: [98104.648694] CIFS VFS: Cancelling wait for mid 6919676 cmd: 16
Jan  8 01:39:45 pve pvestatd[1408]: unable to activate storage 'Backup-Daily' - directory '/mnt/pve/Backup-Daily' does not exist or is unreachable
Jan  8 01:39:48 pve pvestatd[1408]: got timeout
Jan  8 01:39:48 pve kernel: [98107.038101] CIFS VFS: Cancelling wait for mid 6919679 cmd: 5
Jan  8 01:39:48 pve kernel: [98107.038700] CIFS VFS: Cancelling wait for mid 6919680 cmd: 16
Jan  8 01:39:48 pve pvestatd[1408]: unable to activate storage 'Backup-Weekly' - directory '/mnt/pve/Backup-Weekly' does not exist or is unreachable
Jan  8 01:39:48 pve pvestatd[1408]: status update time (7.602 seconds)
Jan  8 01:39:55 pve pvestatd[1408]: got timeout
Jan  8 01:39:55 pve kernel: [98114.251640] CIFS VFS: Cancelling wait for mid 6919684 cmd: 5
Jan  8 01:39:55 pve kernel: [98114.252246] CIFS VFS: Cancelling wait for mid 6919685 cmd: 16
Jan  8 01:39:55 pve pvestatd[1408]: unable to activate storage 'Backup-Weekly' - directory '/mnt/pve/Backup-Weekly' does not exist or is unreachable
Jan  8 01:39:57 pve pvestatd[1408]: got timeout
Jan  8 01:39:57 pve kernel: [98116.403458] CIFS VFS: Cancelling wait for mid 6919686 cmd: 5
Jan  8 01:39:57 pve kernel: [98116.404076] CIFS VFS: Cancelling wait for mid 6919687 cmd: 16
Jan  8 01:39:57 pve pvestatd[1408]: unable to activate storage 'Backup-Daily' - directory '/mnt/pve/Backup-Daily' does not exist or is unreachable


Jan 8 00:58:07 pve kernel: [95606.157217] perf: interrupt took too long (9964 > 9961), lowering kernel.perf_event_max_sample_rate to 20000 looks strange allready, but nevertheless the system seems to have lost the connection to the cifs share somewhere. Can't actually understand this though, as the share is mounted according to mtab:

Code:
//192.168.1.201/Backup-Daily /mnt/pve/Backup-Daily cifs rw,relatime,vers=3.0,cache=strict,username=someuser,uid=0,noforceuid,gid=0,noforcegid,addr=192.168.1.201,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=60,actimeo=1 0 0
//192.168.1.201/Backup-Weekly /mnt/pve/Backup-Weekly cifs rw,relatime,vers=3.0,cache=strict,username=someuser,uid=0,noforceuid,gid=0,noforcegid,addr=192.168.1.201,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=60,actimeo=1 0 0

and the files are also browseable via PVE GUI.
 
that just means that it is mounted (again) now, but with network shares like NFS and CIFS an old access attempt can still be blocked forever..
 
hm, ok i see.
So the question should be, why did the connection drop down in the first place.
well, will reboot the complete node and see if we are getting somewhere from there.
thanks
Sascha
 
Is this resolved? Seems like I get this issue with one of my 2 nodes.
The Windows 2019 Standard seems to be getting this issue ever since I tried to back it up to PVE Backup Server.
My other node with Windows 2016 Essentials doesn't have this behaviour...
 
Is this resolved? Seems like I get this issue with one of my 2 nodes.
The Windows 2019 Standard seems to be getting this issue ever since I tried to back it up to PVE Backup Server.
My other node with Windows 2016 Essentials doesn't have this behaviour...
Since I've been using the new backup server, I've also had the same problem with a VM running Windows Server Standard 2019. The machine hangs up during backups, except for the previously mentioned log entries nothing can be seen.

VM xxx qmp command 'backup-cancel' failed - unable to connect to VM xxx qmp socket - timeout after...
 
Last edited:
VM xxx qmp command 'backup-cancel' failed - unable to connect to VM xxx qmp socket - timeout after...

I have the same problem. The VM machine does not have a qemu agent

Code:
2020-10-24 01:03:02 INFO: Starting Backup of VM 109 (qemu)
2020-10-24 01:03:02 INFO: status = running
2020-10-24 01:03:02 INFO: VM Name: PROD-XXXX
2020-10-24 01:03:02 INFO: include disk 'scsi0' 'local-zfs:vm-109-disk-0' 8G
2020-10-24 01:03:02 INFO: include disk 'scsi1' 'local-zfs:vm-109-disk-1' 50G
2020-10-24 01:03:02 INFO: include disk 'scsi2' 'local-zfs:vm-109-disk-2' 200G
2020-10-24 01:03:02 INFO: backup mode: snapshot
2020-10-24 01:03:02 INFO: bandwidth limit: 200000 KB/s
2020-10-24 01:03:02 INFO: ionice priority: 7
2020-10-24 01:03:02 INFO: creating vzdump archive '/mnt/pve/backup/dump/vzdump-qemu-109-2020_10_24-01_03_02.vma.zst'
2020-10-24 01:03:02 INFO: started backup task '7462eae1-70d6-4950-988e-8d29bcf9a31a'
2020-10-24 01:03:02 INFO: resuming VM again
...
2020-10-24 01:41:14 ERROR: VM 109 qmp command 'query-backup' failed - got timeout
2020-10-24 01:41:14 INFO: aborting backup job
2020-10-24 01:51:14 ERROR: VM 109 qmp command 'backup-cancel' failed - unable to connect to VM 109 qmp socket - timeout after 5983 retries
2020-10-24 01:51:21 ERROR: Backup of VM 109 failed - VM 109 qmp command 'query-backup' failed - got timeout

std: error 25 : Write error : No space left on device (cannot write compressed block).                                                                                                                                                      
Warning: unable to close filehandle GEN3430 properly: No space left on device at /usr/share/perl5/PVE/VZDump/QemuServer.pm line 752.                                                                                                        
ERROR: VM 109 qmp command 'query-backup' failed - got timeout                                                                                                                                                                               
INFO: aborting backup job                                                                                                                                                                                                                   
ERROR: VM 109 qmp command 'backup-cancel' failed - unable to connect to VM 109 qmp socket - timeout after 5983 retries                                                                                                                      
ERROR: Backup of VM 109 failed - VM 109 qmp command 'query-backup' failed - got timeout                                                                                                                                                     
INFO: Failed at 2020-10-24 01:51:46

Ok, I ran out of space for the backup resource, but the backup should be handled by the system in a sensible way that does not hang the virtual machine.

the machine is hanging, and i can't connect via vnc. Only stopping and restarting helps :(

a backup is made from proxmox. I don't use proxmox-backup-server

Code:
root@pve2:/var/log/vzdump# pveversion
pve-manager/6.2-12/b287dd27 (running kernel: 5.4.65-1-pve)

root@pve2:/var/log/vzdump# qm guest cmd 109 fsfreeze-status
No QEMU guest agent configured
 
Last edited:
I am sorry guys, but i can't really contribute.
We didn't see this behavior anymore for quite some time.
We actually thought this might have been solved by some of the updates in the meantime.
hope you get this solved!
Sascha
 
  • Like
Reactions: fitful
My customer has the same problem. VM is Windows Server 2016 based and has qemu agent and balloon drivers installed. The backup store is local.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!