Done this yesterday, issue is goneThe packagepve-qemu-kvm=6.2.0-8
which reverts the problematic change for QEMU's RBD driver, is now available on thepvetest
repository.
You can test it by:
- Adding/Enabling the pvetest repository (can also be done in the
Repostiories
panel in the UI).apt update && apt install pve-qemu-kvm
- Removing/disabling the
pvetest
repository again.- VMs need to be stopped/started or migrated to pick up the new version.
great to hear that ... thanks for that informationDone this yesterday, issue is gone
After deploying this package across my cluster the issue is much improved, but i still had 3 workloads shutdown (compared to 10 plus before updating kvm pkg) over night which powered off during a backup process.Done this yesterday, issue is gone
Please try to pin Kernel Linux 5.13.19-6-pve for now. Has helped with several current raising problems with 7.2-4 the last days.Fabian_E said:
The package pve-qemu-kvm=6.2.0-8 which reverts the problematic change for QEMU's RBD driver, is now available on the pvetest repository.
You can test it by:
- Adding/Enabling the pvetest repository (can also be done in the Repostiories panel in the UI).
- apt update && apt install pve-qemu-kvm
- Removing/disabling the pvetest repository again.
- VMs need to be stopped/started or migrated to pick up the new version.
After deploying this package across my cluster the issue is much improved, but i still had 3 workloads shutdown (compared to 10 plus before updating kvm pkg) over night which powered off during a backup process.
Interestingly the hosts that powered off weren't actually included in the backup but all make use of RBD storage for data storage (video archive)
Any further ideas from the community on what i can try to solve this issue once and for all?
I saw mentioned in an earlier post someone referring to disabling krdb flag in the disk config, is that correct?
View attachment 37324
Thanks in advance.
That sounds a bit odd.. Anything in the kernel log /Interestingly the hosts that powered off weren't actually included in the backup but all make use of RBD storage for data storage (video archive)
journalctl
, could even be an OOM kill (which then would be rather unrelated to any such bugs and a sign of memory over commitment, if it really was a OOM kill)Enabling KRBD, not disabling. What can also help is enabling IO-Thread (and theI saw mentioned in an earlier post someone referring to disabling krdb flag in the disk config, is that correct?
VirtIO SCSI single
controller, if you're using SCSI for the VM disks), also using the default No Cache
cache mode for disks, as others, like e.g. writeback, while being sometimes seemingly faster, can cause much more erratic IO and sensitivity for host memory pressure for the guest.Please try to pin Kernel Linux 5.13.19-6-pve for now. Has helped with several current raising problems with 7.2-4 the last days.
Firmware or Kernel package? (just to be sure to understand what helped you)I bit the bullet and rolled back the firmware across the cluster.
Apologies its been a long week!Firmware or Kernel package? (just to be sure to understand what helped you)
Is this a cluster? As then you could live-migrate the VMs to already updated PVE hosts, that way they also get started with the new PVE-QEMU version without any interruption in the guest itself.Is it a concern that I only restarted this one VM to use the newer version of pve-qemu-kvm? Restarting all other VMs is a big project which requires a lot of coordination.
That's a good idea. Thanks for the reply!Thanks for your feedback.
Is this a cluster? As then you could live-migrate the VMs to already updated PVE hosts, that way they also get started with the new PVE-QEMU version without any interruption in the guest itself.
yes, sinceSorry for reviving this old thread but...
is this issue still an issue or has it been resolved? Because I am experiencing the same symptoms under PVE 7.3-4...
pve-qemu >= 6.2.0-8
with this commit.then it's most likely a different issue.Only one of my VMs is affected. It has 1TB second virtual disk (unlike all the others that are not affected).
Activating KRBD for both Ceph pools (main disk is on nvme pool and second disk is on hdd pool) and migrating the VM to another node (and back) did not help.
Glad you found a workaround. WhatBut switching to Virtio Single seems to have done the trick (I am currently backing up and it is taking a while but I got beyond the dreaded fs-freeze command. So I am cautiously optimistic the backup is finally going through after a couple of nerve wrecking days...
fs-freeze
does depends on what is installed in the VM. What filesystems are there? Do you have the latest QEMU guest agent installed? Maybe it caused enough IO to block out the QEMU main thread. With VirtIO SCSI single controller + iothread, IO handling gets done in their own threads.Yeah, well. The backup did go through the first time and a couple of times afterwards but then it crashed the VM again (and now I have turned the backup off again - I know...)Glad you found a workaround. Whatfs-freeze
does depends on what is installed in the VM. What filesystems are there? Do you have the latest QEMU guest agent installed? Maybe it caused enough IO to block out the QEMU main thread. With VirtIO SCSI single controller + iothread, IO handling gets done in their own threads.
Are there services that might interfere with the freezing in the VM? There are reports about mysql and cPanel being problematic. Any logs inside the VM? Anything inYeah, well. The backup did go through the first time and a couple of times afterwards but then it crashed the VM again (and now I have turned the backup off again - I know...)
So I will try to also check the iothread box and see what happens.
The VM has two disks, one 15GB system disk and one 1TB data disk (currently about half full). Both have ext4 file systems. Qemu agent is installed (latest version from the debian 11 repo).
In another thread, someone suggested to try a backup with the 1TB disk disconnected to see whether the size of the disk was the problem (as I have a few other VMs that don't crash when the backup comes; they are pretty much similar, just without the data disk), but also with just the system disk connected, the VM would crash (that was before I switched to Virtio single). So it does not seem to be related to the disk size. But what then? Other than the data disk, this VM is quite ordinary and does not stand out in any way.
/var/log/syslog
? What's your pveversion -v
?