The other day, I had to do some updates on several VMs across our network and wanted to take some snapshots including RAM beforehand. The machines are running on different PVE clusters, all clusters consisting of three nodes each with PVE 7.2-4 and hyperconverged ceph 16.2.9 underneath. All VMs have 32GB of RAM. When snapshot tasks of the machines with a lot of RAM utilization didn't complete after some time, I took a look at the logs and noticed that they were still trying to save their RAM with little to no progress over several minutes. A look at the ceph logs revealed that after some time during this process, IOPS went up to about 3000 while write speed dropped to ~3MB/s. When I stopped the snapshot tasks, the VMs crashed, leaving them stopped and locked.
Although the snapshot processes didn't finish they would still be listed, but trying to remove them through the GUI wouldn't work. The only way to remove the faulty snapshots was through the CLI with
So, I'm facing two major issues at the moment:
1. Stopping a snapshot process under PVE 7.2-4 ends up crashing and locking the VM
2. while snapshotting RAM of a VM, ceph starts to underperform heavily after some time
I did some testing with another PVE cluster in our network, still running PVE 6.4 and ceph 14.2.22 and didn't encounter any of those issues. Furthermore, I was unable to replicate issue #2 with any other storage backend than ceph.
Although the snapshot processes didn't finish they would still be listed, but trying to remove them through the GUI wouldn't work. The only way to remove the faulty snapshots was through the CLI with
qm delsnapshot <vmid> <snapshotname> --force
.So, I'm facing two major issues at the moment:
1. Stopping a snapshot process under PVE 7.2-4 ends up crashing and locking the VM
2. while snapshotting RAM of a VM, ceph starts to underperform heavily after some time
I did some testing with another PVE cluster in our network, still running PVE 6.4 and ceph 14.2.22 and didn't encounter any of those issues. Furthermore, I was unable to replicate issue #2 with any other storage backend than ceph.