Possible bug after upgrading to 7.2: VM freeze if backing up large disks

ITT

Active Member
Mar 19, 2021
250
63
28
42
The package pve-qemu-kvm=6.2.0-8 which reverts the problematic change for QEMU's RBD driver, is now available on the pvetest repository.

You can test it by:
  1. Adding/Enabling the pvetest repository (can also be done in the Repostiories panel in the UI).
  2. apt update && apt install pve-qemu-kvm
  3. Removing/disabling the pvetest repository again.
  4. VMs need to be stopped/started or migrated to pick up the new version.
Done this yesterday, issue is gone ;)
 
  • Like
Reactions: lfl and spark 5

spark 5

Member
Sep 19, 2018
24
3
8
21
Done this yesterday, issue is gone ;)
great to hear that ... thanks for that information
the support told us yesterday the same

came in german, but here:

Wir haben nach viele Versuchen (z.B. downgrading von pve-qemu-kvm) via rbd journaling auf dem Failover Cluster geschwenkt.
Auf welche Version haben Sie ein Downgrade durchgeführt? Haben Sie nach dem Downgrade die Live-Migration der VMs angehalten und gestartet? Eins von diese Schritte sind notwendig, damit die VMs die neue Qemu-Version laden können.

Tritt das Problem bei VM-Backups auf oder nur allgemein? Obwohl die Laufwerke nicht so groß sind, mein erster Eindruck ist, dass es sich nach dem Problem anhört, über das in unserem Forum berichtet wurde [1] und das kürzlich durch Rückgängigmachen eines Commits behoben wurde [2]. Die korrigierte Version (pve-qemu-kvm=6.2.0-8) befindet sich in unserem pvetest [3] Repository.
Wenn Sie dieses Paket ausprobieren möchten, um zu sehen, ob es hilft
 
  • Like
Reactions: ITT

spark 5

Member
Sep 19, 2018
24
3
8
21
hi ... my update
we tried today to cleanup our second cluster, after we moved everything to our rbd mirror
we stopped some vms to make a final backup. some of these vms stuck with timeout.
after that, we installed the devel pve-qemu-kvm=6.2.0-8, but this does not help.

at the end, after many tear drops, i had an idea.

- stop stuck vm
- rbd copy pool/image pool/image_neu
- change scsi device name in vm config
- tada ... vm up and running ... sleeping will be possible this night
- now we stopped the vm and moved the image to the correct name

so, for the complete disaster night nightmare ... copy rbd image will help, but also only as a workaround

what do we found there ... unbelievable

next we try to reboot alle the vms ... but not this night
after that, we have to repair the rbd mirror settings and bring our cluster 1 back to party
 

spark 5

Member
Sep 19, 2018
24
3
8
21
one update again ... makes the parts more strange

maybe, since we enabled journal based replication, we never stopped a vm
so, we cleanup the pool config and removed all the journaling features ...

what should i say ... our vms are booting

i think, we run in a another problem.
but, we also had problems with snapshot replication an peformance after updates ... will tell you later.

thanks to everybody
 
May 28, 2020
31
10
13
42
South Africa
Fabian_E said:


The package pve-qemu-kvm=6.2.0-8 which reverts the problematic change for QEMU's RBD driver, is now available on the pvetest repository.

You can test it by:
  1. Adding/Enabling the pvetest repository (can also be done in the Repostiories panel in the UI).
  2. apt update && apt install pve-qemu-kvm
  3. Removing/disabling the pvetest repository again.
  4. VMs need to be stopped/started or migrated to pick up the new version.


Done this yesterday, issue is gone ;)
After deploying this package across my cluster the issue is much improved, but i still had 3 workloads shutdown (compared to 10 plus before updating kvm pkg) over night which powered off during a backup process.

Interestingly the hosts that powered off weren't actually included in the backup but all make use of RBD storage for data storage (video archive)

Any further ideas from the community on what i can try to solve this issue once and for all?

I saw mentioned in an earlier post someone referring to disabling krdb flag in the disk config, is that correct?

1653464292629.png

Thanks in advance.
 

itNGO

Well-Known Member
Jun 12, 2020
556
120
48
44
Germany
it-ngo.com
Fabian_E said:


The package pve-qemu-kvm=6.2.0-8 which reverts the problematic change for QEMU's RBD driver, is now available on the pvetest repository.

You can test it by:
  1. Adding/Enabling the pvetest repository (can also be done in the Repostiories panel in the UI).
  2. apt update && apt install pve-qemu-kvm
  3. Removing/disabling the pvetest repository again.
  4. VMs need to be stopped/started or migrated to pick up the new version.



After deploying this package across my cluster the issue is much improved, but i still had 3 workloads shutdown (compared to 10 plus before updating kvm pkg) over night which powered off during a backup process.

Interestingly the hosts that powered off weren't actually included in the backup but all make use of RBD storage for data storage (video archive)

Any further ideas from the community on what i can try to solve this issue once and for all?

I saw mentioned in an earlier post someone referring to disabling krdb flag in the disk config, is that correct?

View attachment 37324

Thanks in advance.
Please try to pin Kernel Linux 5.13.19-6-pve for now. Has helped with several current raising problems with 7.2-4 the last days.
 
May 28, 2020
31
10
13
42
South Africa
Hey itNGO

Thanks for your input -

My issue is that i already "cleverly" purged all the 5.13.* kernels from my cluster - so there will be some work in getting the 5.13* kernel downloaded and redeployed to entire cluster.

Was hoping for a quicker fix. :)
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
5,329
1,641
164
South Tyrol/Italy
shop.proxmox.com
Interestingly the hosts that powered off weren't actually included in the backup but all make use of RBD storage for data storage (video archive)
That sounds a bit odd.. Anything in the kernel log / journalctl, could even be an OOM kill (which then would be rather unrelated to any such bugs and a sign of memory over commitment, if it really was a OOM kill)
I saw mentioned in an earlier post someone referring to disabling krdb flag in the disk config, is that correct?
Enabling KRBD, not disabling. What can also help is enabling IO-Thread (and the VirtIO SCSI single controller, if you're using SCSI for the VM disks), also using the default No Cache cache mode for disks, as others, like e.g. writeback, while being sometimes seemingly faster, can cause much more erratic IO and sensitivity for host memory pressure for the guest.
 

petemcdonnell

New Member
Oct 23, 2021
12
0
1
47
We completed a PVE and Ceph upgrade from 6.4 to 7.2 Friday night. A 9.5TB NVR VM backup started shortly thereafter. We noticed the operation of the VM ground to a halt and the backup performance was fairly slow (71Mbps read, 52Mbps write).
I applied the update to pve-qemu-kvm as described above and restarted just that NVR VM. I also enabled the IO-Thread option for the VM drives.
After a restart of that VM, I started a new backup. VM operation is great even with the backup running. Also of note, the backup performance is substantially better (better than when on 6.4 as well). Backup is now running at ~225 Mbps read, 212Mbps write!

Is it a concern that I only restarted this one VM to use the newer version of pve-qemu-kvm? Restarting all other VMs is a big project which requires a lot of coordination.
 

t.lamprecht

Proxmox Staff Member
Staff member
Jul 28, 2015
5,329
1,641
164
South Tyrol/Italy
shop.proxmox.com
Thanks for your feedback.
Is it a concern that I only restarted this one VM to use the newer version of pve-qemu-kvm? Restarting all other VMs is a big project which requires a lot of coordination.
Is this a cluster? As then you could live-migrate the VMs to already updated PVE hosts, that way they also get started with the new PVE-QEMU version without any interruption in the guest itself.
 

petemcdonnell

New Member
Oct 23, 2021
12
0
1
47
Thanks for your feedback.

Is this a cluster? As then you could live-migrate the VMs to already updated PVE hosts, that way they also get started with the new PVE-QEMU version without any interruption in the guest itself.
That's a good idea. Thanks for the reply!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!