Problems with Proxmox 8.1.4 and creating snapshots

Jan 26, 2024
16
1
3
Hello everyone,
since yesterday morning (that I upgraded several systems from 8.1.3 to 8.1.4 in reposistory enterprise) I have serious problems when creating snapshots.
N.3 clusters with 3 nodes each with CEPH datastore
N.5 servers with ZFS datastore

The behavior is the same.

When I activate a snapshot (I tried without RAM) if the machine is big enough (during several attempts I had no problems with small VMs, let's say with a VM from 300 Gbyte up, yes, always) it happens that the snapshot window remains active and the procedure does not finish.
The VM loses connectivity on the AGENT (in fact the IP of the VM is no longer displayed in the Proxmox panel), the agent seems to be working, it is running, but the only way to get back to full functionality is to restart the VM. Verified with both Debian VMs, Ubuntu, Windows 2019 server, etc.


I attach a few screenshots.

1706258849222.png

1706258885508.png
 
The "no more storage space" message is a false error (during the snapshot stop process).
The datastore has several Tbytes free and the same problem happens to me with both ZFS and CEPH on different servers.
In fact, many machines use IOTHREAD, but... backups happen correctly without the slightest problem, it's just the snapshots that don't work.
 
This one:

root@iml-host-px03:/etc/pve/qemu-server# cat 326.conf
#SONDA YOROI TYPE 2 NODE 3
#
#- MANAGEMENT
#- DMZ
#- DMZ-2
agent: 1
balloon: 0
boot: order=scsi0;ide2
cores: 4
cpu: Skylake-Server-noTSX-IBRS
ide2: none,media=cdrom
memory: 16384
meta: creation-qemu=7.2.0,ctime=1682428407
name: SRVSNDGENKUT2N3
net0: virtio=72:7C:2D:AD:C1:48,bridge=vmbr8
net1: virtio=AE:03:95:04:B1:F8,bridge=vmbr1
net2: virtio=CE:A4:EA:86:A6:0B,bridge=vmbr2
numa: 0
onboot: 1
ostype: l26
protection: 1
scsi0: CEPH-RBD:vm-326-disk-0,discard=on,iothread=1,size=300G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=86a1e7d9-b8f8-42ef-9a07-d43ad43ebac3
sockets: 2
tablet: 0
tags: sonda;sonda_genku_06
vmgenid: 78d8c65e-2909-412f-bc7c-445b9f353926

*********************************

... but it's just an example
I might try disabling IOTHREAD...
 
Thank you! I see that you have protection activated. Does each of the affected VMs have this? Theoretically it shouldn't matter, but perhaps an error has crept in somewhere in combination with this. Have you tried it this way or generally with a newly created VM?
 
Hmmmm, I tried disabling the IOTHREAD and... now on this VM the snapshots are instantaneous.
Regarding storage with IOTHREAD I have a long ongoing issue.
I currently use QEMU in patchlevel 5 and not the latest release, 6, because 6 like 4 gave me stuck problems on datastores with IOTHREAD active.
Patchlevel 5 on the other hand solves this problem but introduces excessive CPU load. I know, but VMs run "better."

I know Fiona is working on a final patch on this, it was already posted in DEVEL yesterday.

Yes, all the VMs I use use protection. But I don't think it's related to that the problem, in fact on some VMs the snapshot works on others it doesn't, even on different clusters.

It could really be a problem related to the IOTHREAD....
 
Interesting

A VM that was methodically giving me error yesterday in activating the snapshot, now restarted with IOTHREAD disabled, activates the snapshot in a few moments.

I don't know if it's size related, I've seen VMs with 50 Gbytes of storage activate the snapshot with no problems and this one with 2 disks one of which was 6 Tbytes always gave problems.

Now it works great.

I guess I'll have to wait for patchlevel 7 that Fiona is working on to finally solve the problem.

I will still do more testing, but at the moment the cause seems to be the active IOTHREAD....

PS "sb-jw", ... I didn't mean to be rude in the way I answered you earlier.
 
Hi,
A VM that was methodically giving me error yesterday in activating the snapshot, now restarted with IOTHREAD disabled, activates the snapshot in a few moments.
so this was also with pve-qemu-kvm=8.1.2-5?

Is the krbd setting active for the CEPH-RBD storage in /etc/pve/storage.cfg? If not, then RBD snapshots will be taken via QEMU -> librbd rather than via the RBD storage plugin in Proxmox VE. That is also something you could test.
 
Hi Fiona,

yes, on all clusters I have (CEPH datastore) and on all individual servers (ZFS datastore) I use pve-qemu-kvm version 8.1.2-5.
I know it loads the CPU abnormally but I have experienced stuck storage on the -4 version (and if I understand correctly the -6 is a rollback to the -4), so I opted for the most convenient solution for me (until the problem is resolved).

In the meantime I'll tell you that I've done other tests on other VMs that were methodically going "snapshot freeze" and now by disabling IOTHREAD storage they are running smoothly.

And no, KRBD is disabled, this is the configuration of my datastores:

rbd: CEPH-RBD
content rootdir,images
krbd 0
pool ceph_pool
 
After dozens more trials and tests, I can confirm.
Where yesterday the VM snapshot would freeze, disabling IOTHREAD now works perfectly.
Never a problem after dozens and dozens of tests.
Please note that I use pve-qemu-kvm patchlevel -5, I don't know if -4 would present different behavior.

Instead, with backups (on the PBS), no problem, with and without IOTHREAD (OK as well known there is an additional CPU load with patchlevel 5, but other than that...).
 
If only the patches for the rebase onto 8.1.5 are applied, the fixes for the IOthread issue won't be included. But they are not in conflict (except for patch numbering which is easily fixed up), so both series can be applied.
 
If only the patches for the rebase onto 8.1.5 are applied, the fixes for the IOthread issue won't be included. But they are not in conflict (except for patch numbering which is easily fixed up), so both series can be applied.
OK, so we will have to wait for a subsequent revision of QEMU 8.1.5 for the inclusion of IOTHREAD-related patches, perhaps after a stabilization process....
 
That depends on when the patch for the iothread issue is applied. If it is applied before the next version bump of the package, it would be in the first revision.
 
  • Like
Reactions: Luca De Andreis
Hi @fiona ,

I saw that in testing the 8.1.5 release of Qemu was moved, reading the list of the various patches introduced I don't seem to have seen the one related to IOTHREAD, correct ? In that case I will stay on 8.1.2-5, the only one that works for me without problems (apart from an additional CPU load).
 
Yes, while the patch was applied in git now, the version with that fix has not yet been moved to public repositories. It should be soonish if no problems pop up during internal testing.
 
FYI, the version with the fix pve-qemu-kvm=8.1.5-2 is now available on the no-subscription repository.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!