Possible bug after upgrading to 7.2: VM freeze if backing up large disks

ITT · May 19, 2022

Already installed new pve-qemu-kvm now, after migrating the VMs i will report.

ITT · May 20, 2022

Fabian_E said:
The package pve-qemu-kvm=6.2.0-8 which reverts the problematic change for QEMU's RBD driver, is now available on the pvetest repository.

You can test it by:

Adding/Enabling the pvetest repository (can also be done in the Repostiories panel in the UI).

apt update && apt install pve-qemu-kvm

Removing/disabling the pvetest repository again.

VMs need to be stopped/started or migrated to pick up the new version.

Done this yesterday, issue is gone

spark 5 · May 20, 2022

ITT said:
Done this yesterday, issue is gone

great to hear that ... thanks for that information
the support told us yesterday the same

came in german, but here:

Wir haben nach viele Versuchen (z.B. downgrading von pve-qemu-kvm) via rbd journaling auf dem Failover Cluster geschwenkt.
Auf welche Version haben Sie ein Downgrade durchgeführt? Haben Sie nach dem Downgrade die Live-Migration der VMs angehalten und gestartet? Eins von diese Schritte sind notwendig, damit die VMs die neue Qemu-Version laden können.

Tritt das Problem bei VM-Backups auf oder nur allgemein? Obwohl die Laufwerke nicht so groß sind, mein erster Eindruck ist, dass es sich nach dem Problem anhört, über das in unserem Forum berichtet wurde [1] und das kürzlich durch Rückgängigmachen eines Commits behoben wurde [2]. Die korrigierte Version (pve-qemu-kvm=6.2.0-8) befindet sich in unserem pvetest [3] Repository.
Wenn Sie dieses Paket ausprobieren möchten, um zu sehen, ob es hilft

ITT · May 20, 2022

For other readers here:
So, it is NOT needed to fiddling around with KRBD etc.
Only pve-qemu-kvm=6.2.0-8 is the correct Workaround.

spark 5 · May 20, 2022

hi ... my update
we tried today to cleanup our second cluster, after we moved everything to our rbd mirror
we stopped some vms to make a final backup. some of these vms stuck with timeout.
after that, we installed the devel pve-qemu-kvm=6.2.0-8, but this does not help.

at the end, after many tear drops, i had an idea.

- stop stuck vm
- rbd copy pool/image pool/image_neu
- change scsi device name in vm config
- tada ... vm up and running ... sleeping will be possible this night
- now we stopped the vm and moved the image to the correct name

so, for the complete disaster night nightmare ... copy rbd image will help, but also only as a workaround

what do we found there ... unbelievable

next we try to reboot alle the vms ... but not this night
after that, we have to repair the rbd mirror settings and bring our cluster 1 back to party

spark 5 · May 23, 2022

one update again ... makes the parts more strange

maybe, since we enabled journal based replication, we never stopped a vm
so, we cleanup the pool config and removed all the journaling features ...

what should i say ... our vms are booting

i think, we run in a another problem.
but, we also had problems with snapshot replication an peformance after updates ... will tell you later.

thanks to everybody

Sp00nman · May 25, 2022

Fabian_E said:

The package pve-qemu-kvm=6.2.0-8 which reverts the problematic change for QEMU's RBD driver, is now available on the pvetest repository.

You can test it by:

Adding/Enabling the pvetest repository (can also be done in the Repostiories panel in the UI).
apt update && apt install pve-qemu-kvm
Removing/disabling the pvetest repository again.
VMs need to be stopped/started or migrated to pick up the new version.

ITT said:
Done this yesterday, issue is gone

After deploying this package across my cluster the issue is much improved, but i still had 3 workloads shutdown (compared to 10 plus before updating kvm pkg) over night which powered off during a backup process.

Interestingly the hosts that powered off weren't actually included in the backup but all make use of RBD storage for data storage (video archive)

Any further ideas from the community on what i can try to solve this issue once and for all?

I saw mentioned in an earlier post someone referring to disabling krdb flag in the disk config, is that correct?

Thanks in advance.

itNGO · May 25, 2022

Sp00nman said:
Fabian_E said:

The package pve-qemu-kvm=6.2.0-8 which reverts the problematic change for QEMU's RBD driver, is now available on the pvetest repository.

You can test it by:

Adding/Enabling the pvetest repository (can also be done in the Repostiories panel in the UI).

apt update && apt install pve-qemu-kvm

Removing/disabling the pvetest repository again.

VMs need to be stopped/started or migrated to pick up the new version.

After deploying this package across my cluster the issue is much improved, but i still had 3 workloads shutdown (compared to 10 plus before updating kvm pkg) over night which powered off during a backup process.

Interestingly the hosts that powered off weren't actually included in the backup but all make use of RBD storage for data storage (video archive)

Any further ideas from the community on what i can try to solve this issue once and for all?

I saw mentioned in an earlier post someone referring to disabling krdb flag in the disk config, is that correct?

View attachment 37324

Thanks in advance.

Please try to pin Kernel Linux 5.13.19-6-pve for now. Has helped with several current raising problems with 7.2-4 the last days.

Sp00nman · May 25, 2022

Hey itNGO

Thanks for your input -

My issue is that i already "cleverly" purged all the 5.13.* kernels from my cluster - so there will be some work in getting the 5.13* kernel downloaded and redeployed to entire cluster.

Was hoping for a quicker fix.

t.lamprecht · May 26, 2022

Sp00nman said:
Interestingly the hosts that powered off weren't actually included in the backup but all make use of RBD storage for data storage (video archive)

That sounds a bit odd.. Anything in the kernel log / journalctl, could even be an OOM kill (which then would be rather unrelated to any such bugs and a sign of memory over commitment, if it really was a OOM kill)

Sp00nman said:
I saw mentioned in an earlier post someone referring to disabling krdb flag in the disk config, is that correct?

Enabling KRBD, not disabling. What can also help is enabling IO-Thread (and the VirtIO SCSI single controller, if you're using SCSI for the VM disks), also using the default No Cache cache mode for disks, as others, like e.g. writeback, while being sometimes seemingly faster, can cause much more erratic IO and sensitivity for host memory pressure for the guest.

Sp00nman · May 27, 2022

itNGO said:
Please try to pin Kernel Linux 5.13.19-6-pve for now. Has helped with several current raising problems with 7.2-4 the last days.

Hi all,

I bit the bullet and rolled back the ~~firmware~~ kernel across the cluster.
This has now resolved this issue.

t.lamprecht · May 27, 2022

Sp00nman said:
I bit the bullet and rolled back the firmware across the cluster.

Firmware or Kernel package? (just to be sure to understand what helped you)

Sp00nman · May 27, 2022

t.lamprecht said:
Firmware or Kernel package? (just to be sure to understand what helped you)

Apologies its been a long week!

Kernel not Firmware - edited original post

petemcdonnell · May 29, 2022

We completed a PVE and Ceph upgrade from 6.4 to 7.2 Friday night. A 9.5TB NVR VM backup started shortly thereafter. We noticed the operation of the VM ground to a halt and the backup performance was fairly slow (71Mbps read, 52Mbps write).
I applied the update to pve-qemu-kvm as described above and restarted just that NVR VM. I also enabled the IO-Thread option for the VM drives.
After a restart of that VM, I started a new backup. VM operation is great even with the backup running. Also of note, the backup performance is substantially better (better than when on 6.4 as well). Backup is now running at ~225 Mbps read, 212Mbps write!

Is it a concern that I only restarted this one VM to use the newer version of pve-qemu-kvm? Restarting all other VMs is a big project which requires a lot of coordination.

t.lamprecht · May 29, 2022

Thanks for your feedback.

petemcdonnell said:
Is it a concern that I only restarted this one VM to use the newer version of pve-qemu-kvm? Restarting all other VMs is a big project which requires a lot of coordination.

Is this a cluster? As then you could live-migrate the VMs to already updated PVE hosts, that way they also get started with the new PVE-QEMU version without any interruption in the guest itself.

petemcdonnell · May 29, 2022

t.lamprecht said:
Thanks for your feedback.

Is this a cluster? As then you could live-migrate the VMs to already updated PVE hosts, that way they also get started with the new PVE-QEMU version without any interruption in the guest itself.

That's a good idea. Thanks for the reply!

proxwolfe · Feb 4, 2023

Sorry for reviving this old thread but...

is this issue still an issue or has it been resolved? Because I am experiencing the same symptoms under PVE 7.3-4...

Only one of my VMs is affected. It has 1TB second virtual disk (unlike all the others that are not affected).

Activating KRBD for both Ceph pools (main disk is on nvme pool and second disk is on hdd pool) and migrating the VM to another node (and back) did not help.

But switching to Virtio Single seems to have done the trick (I am currently backing up and it is taking a while but I got beyond the dreaded fs-freeze command. So I am cautiously optimistic the backup is finally going through after a couple of nerve wrecking days...

fiona · Feb 6, 2023

Hi,

proxwolfe said:
Sorry for reviving this old thread but...

is this issue still an issue or has it been resolved? Because I am experiencing the same symptoms under PVE 7.3-4...

yes, since pve-qemu >= 6.2.0-8 with this commit.

proxwolfe said:
Only one of my VMs is affected. It has 1TB second virtual disk (unlike all the others that are not affected).

Activating KRBD for both Ceph pools (main disk is on nvme pool and second disk is on hdd pool) and migrating the VM to another node (and back) did not help.

then it's most likely a different issue.

proxwolfe said:
But switching to Virtio Single seems to have done the trick (I am currently backing up and it is taking a while but I got beyond the dreaded fs-freeze command. So I am cautiously optimistic the backup is finally going through after a couple of nerve wrecking days...

Glad you found a workaround. What fs-freeze does depends on what is installed in the VM. What filesystems are there? Do you have the latest QEMU guest agent installed? Maybe it caused enough IO to block out the QEMU main thread. With VirtIO SCSI single controller + iothread, IO handling gets done in their own threads.

proxwolfe · Feb 6, 2023

fiona said:
Glad you found a workaround. What fs-freeze does depends on what is installed in the VM. What filesystems are there? Do you have the latest QEMU guest agent installed? Maybe it caused enough IO to block out the QEMU main thread. With VirtIO SCSI single controller + iothread, IO handling gets done in their own threads.

Yeah, well. The backup did go through the first time and a couple of times afterwards but then it crashed the VM again (and now I have turned the backup off again - I know...)

So I will try to also check the iothread box and see what happens.

The VM has two disks, one 15GB system disk and one 1TB data disk (currently about half full). Both have ext4 file systems. Qemu agent is installed (latest version from the debian 11 repo).

In another thread, someone suggested to try a backup with the 1TB disk disconnected to see whether the size of the disk was the problem (as I have a few other VMs that don't crash when the backup comes; they are pretty much similar, just without the data disk), but also with just the system disk connected, the VM would crash (that was before I switched to Virtio single). So it does not seem to be related to the disk size. But what then? Other than the data disk, this VM is quite ordinary and does not stand out in any way.

fiona · Feb 6, 2023

proxwolfe said:
Yeah, well. The backup did go through the first time and a couple of times afterwards but then it crashed the VM again (and now I have turned the backup off again - I know...)

So I will try to also check the iothread box and see what happens.

The VM has two disks, one 15GB system disk and one 1TB data disk (currently about half full). Both have ext4 file systems. Qemu agent is installed (latest version from the debian 11 repo).

In another thread, someone suggested to try a backup with the 1TB disk disconnected to see whether the size of the disk was the problem (as I have a few other VMs that don't crash when the backup comes; they are pretty much similar, just without the data disk), but also with just the system disk connected, the VM would crash (that was before I switched to Virtio single). So it does not seem to be related to the disk size. But what then? Other than the data disk, this VM is quite ordinary and does not stand out in any way.

Are there services that might interfere with the freezing in the VM? There are reports about mysql and cPanel being problematic. Any logs inside the VM? Anything in /var/log/syslog? What's your pveversion -v?

Possible bug after upgrading to 7.2: VM freeze if backing up large disks

Renowned Member

Renowned Member

Active Member

Renowned Member

Active Member

Active Member

Well-Known Member

Renowned Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

New Member

Proxmox Staff Member

New Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

We value your privacy