Possible bug after upgrading to 7.2: VM freeze if backing up large disks

I have switched to VirtIO SCSI single and turned iothread on on each disk. No change.
Log excerpt:

May 11 15:54:31 PROX-B1 pvestatd[3548]: VM 121 qmp command failed - VM 121 qmp command 'query-proxmox-support' failed - got timeout
May 11 15:54:31 PROX-B1 pvestatd[3548]: status update time (6.223 seconds)
May 11 15:54:41 PROX-B1 pvestatd[3548]: VM 121 qmp command failed - VM 121 qmp command 'query-proxmox-support' failed - got timeout
May 11 15:54:41 PROX-B1 pve-ha-lrm[140660]: VM 121 qmp command failed - VM 121 qmp command 'query-status' failed - unable to connect to VM 121 qmp socket - timeout after 31 retries
May 11 15:54:41 PROX-B1 pve-ha-lrm[140660]: VM 121 qmp command 'query-status' failed - unable to connect to VM 121 qmp socket - timeout after 31 retries
May 11 15:54:41 PROX-B1 pvestatd[3548]: status update time (6.222 seconds)
May 11 15:54:51 PROX-B1 pve-ha-lrm[140951]: VM 121 qmp command failed - VM 121 qmp command 'query-status' failed - got timeout
May 11 15:54:51 PROX-B1 pve-ha-lrm[140951]: VM 121 qmp command 'query-status' failed - got timeout
May 11 15:54:51 PROX-B1 pvestatd[3548]: VM 121 qmp command failed - VM 121 qmp command 'query-proxmox-support' failed - unable to connect to VM 121 qmp socket - timeout after 31 retries
May 11 15:54:52 PROX-B1 pvestatd[3548]: status update time (6.222 seconds)
May 11 15:55:01 PROX-B1 pvestatd[3548]: VM 121 qmp command failed - VM 121 qmp command 'query-proxmox-support' failed - unable to connect to VM 121 qmp socket - timeout after 31 retries
May 11 15:55:01 PROX-B1 pvestatd[3548]: status update time (6.231 seconds)
May 11 15:55:01 PROX-B1 pve-ha-lrm[141239]: VM 121 qmp command failed - VM 121 qmp command 'query-status' failed - unable to connect to VM 121 qmp socket - timeout after 31 retries
May 11 15:55:01 PROX-B1 pve-ha-lrm[141239]: VM 121 qmp command 'query-status' failed - unable to connect to VM 121 qmp socket - timeout after 31 retries
May 11 15:55:10 PROX-B1 pvestatd[3548]: status update time (5.030 seconds)
May 11 15:55:21 PROX-B1 pve-ha-lrm[141754]: VM 121 qmp command failed - VM 121 qmp command 'query-status' failed - got timeout
May 11 15:55:21 PROX-B1 pve-ha-lrm[141754]: VM 121 qmp command 'query-status' failed - got timeout
May 11 15:55:21 PROX-B1 pvestatd[3548]: VM 121 qmp command failed - VM 121 qmp command 'query-proxmox-support' failed - unable to connect to VM 121 qmp socket - timeout after 31 retries
May 11 15:55:21 PROX-B1 pvestatd[3548]: status update time (6.230 seconds)
May 11 15:55:25 PROX-B1 pvedaemon[3628]: worker exit
May 11 15:55:25 PROX-B1 pvedaemon[3626]: worker 3628 finished
May 11 15:55:25 PROX-B1 pvedaemon[3626]: starting 1 worker(s)
May 11 15:55:25 PROX-B1 pvedaemon[3626]: worker 142036 started
May 11 15:55:31 PROX-B1 pve-ha-lrm[142069]: VM 121 qmp command failed - VM 121 qmp command 'query-status' failed - got timeout
May 11 15:55:31 PROX-B1 pve-ha-lrm[142069]: VM 121 qmp command 'query-status' failed - got timeout
May 11 15:55:31 PROX-B1 pvestatd[3548]: VM 121 qmp command failed - VM 121 qmp command 'query-proxmox-support' failed - unable to connect to VM 121 qmp socket - timeout after 31 retries
May 11 15:55:31 PROX-B1 pvestatd[3548]: status update time (6.222 seconds)
May 11 15:55:41 PROX-B1 pve-ha-lrm[142357]: VM 121 qmp command failed - VM 121 qmp command 'query-status' failed - got timeout
May 11 15:55:41 PROX-B1 pve-ha-lrm[142357]: VM 121 qmp command 'query-status' failed - got timeout
May 11 15:55:41 PROX-B1 pvestatd[3548]: VM 121 qmp command failed - VM 121 qmp command 'query-proxmox-support' failed - unable to connect to VM 121 qmp socket - timeout after 31 retries
May 11 15:55:42 PROX-B1 pvestatd[3548]: status update time (6.229 seconds)
Node and VM were restarted before the backup attempt. Backup on an old node with virtio scsi single and iothread worked like a charm.
 
Yes, but I don't quite understand what to do now. Do we need to downgrade pve-qemu-kvm on all our nodes? If yes, will it affect stability of the cluster?

@Fabian_E would be great to hear some recommendations.
Since the workaround with iothread didn't work for both of you, the issue I managed to run into is obviously different. Unfortunately, I haven't been able to reproduce the issue with the timed out QMP commands yet. There were a lot of upstream changes between QEMU 6.1.1-2 and 6.2.0-1 so it's not obvious what started triggering this. The backup code from our side mostly stayed the same between these.

If no other workaround helps, I'm afraid keeping pve-qemu-kvm downgraded might be the only way for now. It should not affect the stability, but you won't get any fixes/updates for QEMU of course...

As for finding other workarounds/trying to track this down more:
Does spacing the backups out more help? Can you trigger the issue reliably or does it depend on system/PBS load? Could you share your storage configuration and the configuration of some affected VMs? The one for the minimal VM with empty disks would be interesting too.
 
As for finding other workarounds/trying to track this down more:
Does spacing the backups out more help? Can you trigger the issue reliably or does it depend on system/PBS load? Could you share your storage configuration and the configuration of some affected VMs? The one for the minimal VM with empty disks would be interesting too.

I have stopped upgrading my cluster after running into this bug, so i can provide you with both a working and a not working config or logs. I have the affected VM running alone on an upgraded host and it is absolutely reliable some time into the backup starting to produce the errors, while running fine on an not upgraded node or with the version of pve-qemu-kvm downgraded.
@Fabian_E Please let me know if you need the output of specific commands. In general, the VM resides on a ceph cluster and the backup goes to a Proxmox Backup Server running locally on dedicated hardware, connected with a 10GBit/s link.

VM config:
agent: 1
boot: order=virtio0;net0
cores: 8
cpu: EPYC
machine: pc-i440fx-6.0
memory: 8192
name: VGFILE2
net0: virtio=A2:64:23:42:7E:2A,bridge=vmbr0,firewall=1,tag=110
numa: 0
onboot: 1
ostype: win10
scsihw: virtio-scsi-single
smbios1: uuid=a193ab4a-b863-47f6-95f3-385428458c08
sockets: 1
virtio0: VG-Pool:vm-121-disk-0,discard=on,iothread=1,size=128G
virtio1: VG-Pool:vm-121-disk-1,discard=on,iothread=1,size=5T
virtio2: VG-Pool:vm-121-disk-2,discard=on,iothread=1,size=1T
vmgenid: c56fd5fc-a3f3-480d-9095-8ea11cdecbf8
 
Last edited:
Can you trigger the issue reliably or does it depend on system/PBS load?
Yes, it happens reliably 100% of the time. The PBS load is ~0% (nothing happening), same with the node. The issue happens regardless of whether the VM is on or off.

Code:
agent: 1
balloon: 131072
bios: ovmf
boot: order=scsi0;net0
cores: 35
efidisk0: cpool:vm-110-disk-2,efitype=4m,size=528K
memory: 196608
meta: creation-qemu=6.1.0,ctime=1651062805
name: ****
net0: virtio=56:57:17:5E:60:F7,bridge=vmbr0,firewall=1
numa: 1
onboot: 1
ostype: l26
scsi0: cpool:vm-110-disk-0,discard=on,iothread=1,size=300G,ssd=1
scsi1: cpool:vm-110-disk-1,discard=on,iothread=1,size=6300G,ssd=1
scsi2: cpool:vm-110-disk-3,discard=on,iothread=1,size=10G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=ad9152a8-5683-4601-b725-12d134f00e65
sockets: 2
vmgenid: 74ed1dbe-56ca-4cdb-95c2-3188dda5be0a

the "creation-qemu" was apparently different because I had to downgrade everything.
 
Yes, it happens reliably 100% of the time. The PBS load is ~0% (nothing happening), same with the node. The issue happens regardless of whether the VM is on or off.

Code:
agent: 1
balloon: 131072
bios: ovmf
boot: order=scsi0;net0
cores: 35
efidisk0: cpool:vm-110-disk-2,efitype=4m,size=528K
memory: 196608
meta: creation-qemu=6.1.0,ctime=1651062805
name: ****
net0: virtio=56:57:17:5E:60:F7,bridge=vmbr0,firewall=1
numa: 1
onboot: 1
ostype: l26
scsi0: cpool:vm-110-disk-0,discard=on,iothread=1,size=300G,ssd=1
scsi1: cpool:vm-110-disk-1,discard=on,iothread=1,size=6300G,ssd=1
scsi2: cpool:vm-110-disk-3,discard=on,iothread=1,size=10G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=ad9152a8-5683-4601-b725-12d134f00e65
sockets: 2
vmgenid: 74ed1dbe-56ca-4cdb-95c2-3188dda5be0a

the "creation-qemu" was apparently different because I had to downgrade everything.
What kind of storage is cpool, any special configuration for it?
 
I think I was finally able to reproduce the problem, using Ceph without krbd as the underlying storage for the large disks. When krbd is enabled, it doesn't trigger for me, so that should be another workaround. I'll try to find the root cause now with the reproducer.
 
Hello, KRBD doesn't help here, it was already enabled.
pve+ceph 13 nodes cluster alls ssd. pbs all ssd. separate 10gbit networks for ceph and public+backup.
just now updated and rebooted whole cluster.
backups always maxed out the 10gbit network (900MBs for all nodes together), since update last saturday backups are slow as 70-110MBs (all nodes together). only few smaller vms are backed up successfully. everything above 128GB disk size is timing out (different qmp timeouts).
 
Hi,
Hello, KRBD doesn't help here, it was already enabled.
pve+ceph 13 nodes cluster alls ssd. pbs all ssd. separate 10gbit networks for ceph and public+backup.
just now updated and rebooted whole cluster.
backups always maxed out the 10gbit network (900MBs for all nodes together), since update last saturday backups are slow as 70-110MBs (all nodes together). only few smaller vms are backed up successfully. everything above 128GB disk size is timing out (different qmp timeouts).
this might be a different issue then. Please open a new thread and link to it here or ping me with @Fabian_E with the following information:
  • pveversion -v
  • config of an affected VM
  • full backup log and /var/log/syslog during the backup
  • If the issue is also present when backing up to a non-PBS storage
Possible workarounds to try:
  • Whether downgrading to pve-qemu-kvm=6.2.0-1 helps
  • Whether downgrading to pve-qemu-kvm=6.1.1-2 helps
  • Whether switching the ISCSI controller type to Virtio SCSI single and enabling iothread on the disks helps
 
  • Like
Reactions: flames
Morning from PST all,

Just a note to perhaps help someone else experiencing this frustrating issue.

We experienced out 1.5 hour multi-VM backup (Ceph and Proxmox's built in backup, not PBS) suddenly changing to 12+ hours. On top of that the VMs with the largest disks (750GB) would drop in performance by 95% causing our help desk to blow up with complaints. I've checked KRBD and tested. This didn't help. The VMs needed to be shut down and restarted. This solved the issue. I'm guessing that this is obvious to some but not others ... not when you're under pressure to fix an issue.

That said, I see new patches for backup related components this morning awaiting install ... if I recall (I see a lot of posts from Bugzilla in my email) the appropriate components have been rolled back a version to fix this issue.

best,

James
 
with the largest disks (750GB) would drop in performance by 95% causing our help desk to blow up with complaints. I've checked KRBD and tested. This didn't help. The VMs needed to be shut down and restarted. . This solved the issue
yes, this need a vm stop/start (or a live migration) to use the krbd path.
Same when qemu will be updated with the fix for non-krbd access.(as it's a regression in qemu).

https://lists.proxmox.com/pipermail/pve-devel/2022-May/053021.html
 
Thank You Fabian,
sadly i can test the older pve-qemu-kvm coming weekend at earliest (service time window), since all vms need to be rebooted and i lose dirty map again.
If i understand it correctly, todays updates contain a regression patch for the "bdrv_co_block_status" issue, i will try this one first, if it doesn't help, test the older pve-qemu-kvm versions as you suggested. i will provide all the requested logs and infos in a new thread, if issue persists.
Again, thanks!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!