Possible bug after upgrading to 7.2: VM freeze if backing up large disks

The package pve-qemu-kvm=6.2.0-8 which reverts the problematic change for QEMU's RBD driver, is now available on the pvetest repository.

You can test it by:
  1. Adding/Enabling the pvetest repository (can also be done in the Repostiories panel in the UI).
  2. apt update && apt install pve-qemu-kvm
  3. Removing/disabling the pvetest repository again.
  4. VMs need to be stopped/started or migrated to pick up the new version.
Done this yesterday, issue is gone ;)
 
  • Like
Reactions: lfl and spark 5
Done this yesterday, issue is gone ;)
great to hear that ... thanks for that information
the support told us yesterday the same

came in german, but here:

Wir haben nach viele Versuchen (z.B. downgrading von pve-qemu-kvm) via rbd journaling auf dem Failover Cluster geschwenkt.
Auf welche Version haben Sie ein Downgrade durchgeführt? Haben Sie nach dem Downgrade die Live-Migration der VMs angehalten und gestartet? Eins von diese Schritte sind notwendig, damit die VMs die neue Qemu-Version laden können.

Tritt das Problem bei VM-Backups auf oder nur allgemein? Obwohl die Laufwerke nicht so groß sind, mein erster Eindruck ist, dass es sich nach dem Problem anhört, über das in unserem Forum berichtet wurde [1] und das kürzlich durch Rückgängigmachen eines Commits behoben wurde [2]. Die korrigierte Version (pve-qemu-kvm=6.2.0-8) befindet sich in unserem pvetest [3] Repository.
Wenn Sie dieses Paket ausprobieren möchten, um zu sehen, ob es hilft
 
  • Like
Reactions: ITT
hi ... my update
we tried today to cleanup our second cluster, after we moved everything to our rbd mirror
we stopped some vms to make a final backup. some of these vms stuck with timeout.
after that, we installed the devel pve-qemu-kvm=6.2.0-8, but this does not help.

at the end, after many tear drops, i had an idea.

- stop stuck vm
- rbd copy pool/image pool/image_neu
- change scsi device name in vm config
- tada ... vm up and running ... sleeping will be possible this night
- now we stopped the vm and moved the image to the correct name

so, for the complete disaster night nightmare ... copy rbd image will help, but also only as a workaround

what do we found there ... unbelievable

next we try to reboot alle the vms ... but not this night
after that, we have to repair the rbd mirror settings and bring our cluster 1 back to party
 
one update again ... makes the parts more strange

maybe, since we enabled journal based replication, we never stopped a vm
so, we cleanup the pool config and removed all the journaling features ...

what should i say ... our vms are booting

i think, we run in a another problem.
but, we also had problems with snapshot replication an peformance after updates ... will tell you later.

thanks to everybody
 
Fabian_E said:


The package pve-qemu-kvm=6.2.0-8 which reverts the problematic change for QEMU's RBD driver, is now available on the pvetest repository.

You can test it by:
  1. Adding/Enabling the pvetest repository (can also be done in the Repostiories panel in the UI).
  2. apt update && apt install pve-qemu-kvm
  3. Removing/disabling the pvetest repository again.
  4. VMs need to be stopped/started or migrated to pick up the new version.


Done this yesterday, issue is gone ;)
After deploying this package across my cluster the issue is much improved, but i still had 3 workloads shutdown (compared to 10 plus before updating kvm pkg) over night which powered off during a backup process.

Interestingly the hosts that powered off weren't actually included in the backup but all make use of RBD storage for data storage (video archive)

Any further ideas from the community on what i can try to solve this issue once and for all?

I saw mentioned in an earlier post someone referring to disabling krdb flag in the disk config, is that correct?

1653464292629.png

Thanks in advance.
 
Fabian_E said:


The package pve-qemu-kvm=6.2.0-8 which reverts the problematic change for QEMU's RBD driver, is now available on the pvetest repository.

You can test it by:
  1. Adding/Enabling the pvetest repository (can also be done in the Repostiories panel in the UI).
  2. apt update && apt install pve-qemu-kvm
  3. Removing/disabling the pvetest repository again.
  4. VMs need to be stopped/started or migrated to pick up the new version.



After deploying this package across my cluster the issue is much improved, but i still had 3 workloads shutdown (compared to 10 plus before updating kvm pkg) over night which powered off during a backup process.

Interestingly the hosts that powered off weren't actually included in the backup but all make use of RBD storage for data storage (video archive)

Any further ideas from the community on what i can try to solve this issue once and for all?

I saw mentioned in an earlier post someone referring to disabling krdb flag in the disk config, is that correct?

View attachment 37324

Thanks in advance.
Please try to pin Kernel Linux 5.13.19-6-pve for now. Has helped with several current raising problems with 7.2-4 the last days.
 
Hey itNGO

Thanks for your input -

My issue is that i already "cleverly" purged all the 5.13.* kernels from my cluster - so there will be some work in getting the 5.13* kernel downloaded and redeployed to entire cluster.

Was hoping for a quicker fix. :)
 
Interestingly the hosts that powered off weren't actually included in the backup but all make use of RBD storage for data storage (video archive)
That sounds a bit odd.. Anything in the kernel log / journalctl, could even be an OOM kill (which then would be rather unrelated to any such bugs and a sign of memory over commitment, if it really was a OOM kill)
I saw mentioned in an earlier post someone referring to disabling krdb flag in the disk config, is that correct?
Enabling KRBD, not disabling. What can also help is enabling IO-Thread (and the VirtIO SCSI single controller, if you're using SCSI for the VM disks), also using the default No Cache cache mode for disks, as others, like e.g. writeback, while being sometimes seemingly faster, can cause much more erratic IO and sensitivity for host memory pressure for the guest.
 
We completed a PVE and Ceph upgrade from 6.4 to 7.2 Friday night. A 9.5TB NVR VM backup started shortly thereafter. We noticed the operation of the VM ground to a halt and the backup performance was fairly slow (71Mbps read, 52Mbps write).
I applied the update to pve-qemu-kvm as described above and restarted just that NVR VM. I also enabled the IO-Thread option for the VM drives.
After a restart of that VM, I started a new backup. VM operation is great even with the backup running. Also of note, the backup performance is substantially better (better than when on 6.4 as well). Backup is now running at ~225 Mbps read, 212Mbps write!

Is it a concern that I only restarted this one VM to use the newer version of pve-qemu-kvm? Restarting all other VMs is a big project which requires a lot of coordination.
 
Thanks for your feedback.
Is it a concern that I only restarted this one VM to use the newer version of pve-qemu-kvm? Restarting all other VMs is a big project which requires a lot of coordination.
Is this a cluster? As then you could live-migrate the VMs to already updated PVE hosts, that way they also get started with the new PVE-QEMU version without any interruption in the guest itself.
 
Thanks for your feedback.

Is this a cluster? As then you could live-migrate the VMs to already updated PVE hosts, that way they also get started with the new PVE-QEMU version without any interruption in the guest itself.
That's a good idea. Thanks for the reply!
 
Sorry for reviving this old thread but...

is this issue still an issue or has it been resolved? Because I am experiencing the same symptoms under PVE 7.3-4...

Only one of my VMs is affected. It has 1TB second virtual disk (unlike all the others that are not affected).

Activating KRBD for both Ceph pools (main disk is on nvme pool and second disk is on hdd pool) and migrating the VM to another node (and back) did not help.

But switching to Virtio Single seems to have done the trick (I am currently backing up and it is taking a while but I got beyond the dreaded fs-freeze command. So I am cautiously optimistic the backup is finally going through after a couple of nerve wrecking days...
 
Hi,
Sorry for reviving this old thread but...

is this issue still an issue or has it been resolved? Because I am experiencing the same symptoms under PVE 7.3-4...
yes, since pve-qemu >= 6.2.0-8 with this commit.
Only one of my VMs is affected. It has 1TB second virtual disk (unlike all the others that are not affected).

Activating KRBD for both Ceph pools (main disk is on nvme pool and second disk is on hdd pool) and migrating the VM to another node (and back) did not help.
then it's most likely a different issue.

But switching to Virtio Single seems to have done the trick (I am currently backing up and it is taking a while but I got beyond the dreaded fs-freeze command. So I am cautiously optimistic the backup is finally going through after a couple of nerve wrecking days...
Glad you found a workaround. What fs-freeze does depends on what is installed in the VM. What filesystems are there? Do you have the latest QEMU guest agent installed? Maybe it caused enough IO to block out the QEMU main thread. With VirtIO SCSI single controller + iothread, IO handling gets done in their own threads.
 
Glad you found a workaround. What fs-freeze does depends on what is installed in the VM. What filesystems are there? Do you have the latest QEMU guest agent installed? Maybe it caused enough IO to block out the QEMU main thread. With VirtIO SCSI single controller + iothread, IO handling gets done in their own threads.
Yeah, well. The backup did go through the first time and a couple of times afterwards but then it crashed the VM again (and now I have turned the backup off again - I know...)

So I will try to also check the iothread box and see what happens.

The VM has two disks, one 15GB system disk and one 1TB data disk (currently about half full). Both have ext4 file systems. Qemu agent is installed (latest version from the debian 11 repo).

In another thread, someone suggested to try a backup with the 1TB disk disconnected to see whether the size of the disk was the problem (as I have a few other VMs that don't crash when the backup comes; they are pretty much similar, just without the data disk), but also with just the system disk connected, the VM would crash (that was before I switched to Virtio single). So it does not seem to be related to the disk size. But what then? Other than the data disk, this VM is quite ordinary and does not stand out in any way.
 
Yeah, well. The backup did go through the first time and a couple of times afterwards but then it crashed the VM again (and now I have turned the backup off again - I know...)

So I will try to also check the iothread box and see what happens.

The VM has two disks, one 15GB system disk and one 1TB data disk (currently about half full). Both have ext4 file systems. Qemu agent is installed (latest version from the debian 11 repo).

In another thread, someone suggested to try a backup with the 1TB disk disconnected to see whether the size of the disk was the problem (as I have a few other VMs that don't crash when the backup comes; they are pretty much similar, just without the data disk), but also with just the system disk connected, the VM would crash (that was before I switched to Virtio single). So it does not seem to be related to the disk size. But what then? Other than the data disk, this VM is quite ordinary and does not stand out in any way.
Are there services that might interfere with the freezing in the VM? There are reports about mysql and cPanel being problematic. Any logs inside the VM? Anything in /var/log/syslog? What's your pveversion -v?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!