Problem snapshotting VMs with current PVE version (2024-06-27) using pve-qemu-kvm >= 8.2

meyergru

Member
Jan 28, 2023
35
6
8
I have just updated some PVE instances to the newest kernel 6.8.8-2-pve and observe a strange new behaviour which I cannot fully lay my hands on:

When I snapshot my VMs, some of them will get a snapshot, but afterwards, they stay shut down. Some VMs do this, some do not and I cannot see any notable difference.

This problem was seen on both LVM- and ZFS-based PVE instances just after I rebooted them into kernel 6.8.8-2-pve yesterday. My first guess was that the kernel was the culprit, but neither pinning 6.8.4-3-pve nor 6.8.8-1-pve helped.

In the case of LVM, there was this task log:

Code:
snapshotting 'drive-virtio0' (local-lvm:vm-601-disk-1)
  WARNING: You have not turned on protection against thin pools running out of space.
  WARNING: Set activation/thin_pool_autoextend_threshold below 100 to trigger automatic extension of thin pools before they get full.
  Consider pruning pve VG archive with more then 4500 MiB in 25953 files (check archiving is needed in lvm.conf).
  Consider pruning pve VG archive with more then 4501 MiB in 25954 files (check archiving is needed in lvm.conf).
  Logical volume "snap_vm-601-disk-1_xxx" created.
  WARNING: Sum of all thin volume sizes (<18.53 TiB) exceeds the size of thin pool pve/data and the size of whole volume group (<931.01 GiB).
snapshotting 'drive-efidisk0' (local-lvm:vm-601-disk-0)
  WARNING: You have not turned on protection against thin pools running out of space.
  WARNING: Set activation/thin_pool_autoextend_threshold below 100 to trigger automatic extension of thin pools before they get full.
  Consider pruning pve VG archive with more then 4501 MiB in 25955 files (check archiving is needed in lvm.conf).
  Consider pruning pve VG archive with more then 4501 MiB in 25956 files (check archiving is needed in lvm.conf).
  Logical volume "snap_vm-601-disk-0_xxx" created.
  WARNING: Sum of all thin volume sizes (<18.53 TiB) exceeds the size of thin pool pve/data and the size of whole volume group (<931.01 GiB).
VM 601 qmp command 'savevm-end' failed - client closed connection
TASK OK

The warnings can most likely be ignored, there is enough space for these snapshots.

For the same problem on a ZFS-based PVE:

Code:
snapshotting 'drive-virtio0' (local-zfs:vm-601-disk-1)
snapshotting 'drive-virtio1' (local-zfs:vm-601-disk-2)
snapshotting 'drive-efidisk0' (local-zfs:vm-601-disk-0)
snapshotting 'drive-tpmstate0' (local-zfs:vm-601-disk-3)
VM 601 qmp command 'savevm-end' failed - client closed connection
guest-fsfreeze-thaw problems - VM 601 not running
TASK OK

Because of this, I had the suspicion that the problem was with freeze/thaw and qemu guest agent, but the problem persists even if I disable using the guest agent. I found no other obvious correlation, like machine type or obvious things that could explain why some VMs do not restart and others do.

I found this in the PVE logs:

Code:
Jun 28 01:51:44 ironside pvedaemon[20518]: <root@pam> snapshot VM 601: test
Jun 28 01:51:44 ironside pvedaemon[14091]: <root@pam> starting task UPID:ironside:00005026:0004E403:667DFB10:qmsnapshot:601:root@pam:
Jun 28 01:51:44 ironside dmeventd[537]: No longer monitoring thin pool pve-data-tpool.
Jun 28 01:51:44 ironside dmeventd[537]: Monitoring thin pool pve-data-tpool.
Jun 28 01:51:44 ironside dmeventd[537]: No longer monitoring thin pool pve-data-tpool.
Jun 28 01:51:44 ironside dmeventd[537]: Monitoring thin pool pve-data-tpool.
Jun 28 01:51:44 ironside QEMU[16755]: kvm: ../block/graph-lock.c:260: bdrv_graph_rdlock_main_loop: Assertion `!qemu_in_coroutine()' failed.
Jun 28 01:51:44 ironside pvedaemon[20518]: VM 601 qmp command failed - VM 601 qmp command 'savevm-end' failed - client closed connection
Jun 28 01:51:44 ironside pvedaemon[20518]: VM 601 qmp command 'savevm-end' failed - client closed connection
Jun 28 01:51:44 ironside kernel: fwbr601i0: port 2(tap601i0) entered disabled state
Jun 28 01:51:44 ironside kernel: tap601i0 (unregistering): left allmulticast mode
Jun 28 01:51:44 ironside kernel: fwbr601i0: port 2(tap601i0) entered disabled state
Jun 28 01:51:45 ironside systemd[1]: 601.scope: Deactivated successfully.
Jun 28 01:51:45 ironside systemd[1]: 601.scope: Consumed 1min 13.894s CPU time.
Jun 28 01:51:45 ironside pvedaemon[14091]: <root@pam> end task UPID:ironside:00005026:0004E403:667DFB10:qmsnapshot:601:root@pam: OK
Jun 28 01:51:45 ironside qmeventd[20548]: Starting cleanup for 601
Jun 28 01:51:45 ironside kernel: fwbr601i0: port 1(fwln601i0) entered disabled state
Jun 28 01:51:45 ironside kernel: vmbr0: port 2(fwpr601p0) entered disabled state
Jun 28 01:51:45 ironside kernel: fwln601i0 (unregistering): left allmulticast mode
Jun 28 01:51:45 ironside kernel: fwln601i0 (unregistering): left promiscuous mode
Jun 28 01:51:45 ironside kernel: fwbr601i0: port 1(fwln601i0) entered disabled state
Jun 28 01:51:45 ironside kernel: fwpr601p0 (unregistering): left allmulticast mode
Jun 28 01:51:45 ironside kernel: fwpr601p0 (unregistering): left promiscuous mode
Jun 28 01:51:45 ironside kernel: vmbr0: port 2(fwpr601p0) entered disabled state
Jun 28 01:51:45 ironside qmeventd[20548]: Finished cleanup for 601

This line could be it:
Jun 28 01:51:44 ironside QEMU[16755]: kvm: ../block/graph-lock.c:260: bdrv_graph_rdlock_main_loop: Assertion `!qemu_in_coroutine()' failed.


I think that some of the other packages that have been updated may be responsible, my apt history shows:

Code:
Start-Date: 2024-06-19  07:59:21
Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold upgrade --with-new-pkgs --auto-remove
Install: proxmox-kernel-6.8.8-1-pve-signed:amd64 (6.8.8-1, automatic)
Upgrade: libpve-rs-perl:amd64 (0.8.8, 0.8.9), pve-firmware:amd64 (3.11-1, 3.12-1), zfs-zed:amd64 (2.2.3-pve2, 2.2.4-pve1), zfs-initramfs:amd64 (2.2.3-pve2, 2.2.4-pve1), spl:amd64 (2.2.3-pve2, 2.2.4-pve1), libnvpair3linux:amd64 (2.2.3-pve2, 2.2.4-pve1), libpve-cluster-api-perl:amd64 (8.0.6, 8.0.7), pve-ha-manager:amd64 (4.0.4, 4.0.5), libuutil3linux:amd64 (2.2.3-pve2, 2.2.4-pve1), libpve-storage-perl:amd64 (8.2.1, 8.2.2), libzpool5linux:amd64 (2.2.3-pve2, 2.2.4-pve1), libpve-guest-common-perl:amd64 (5.1.2, 5.1.3), proxmox-kernel-6.8:amd64 (6.8.4-3, 6.8.8-1), pve-cluster:amd64 (8.0.6, 8.0.7), proxmox-backup-file-restore:amd64 (3.2.3-1, 3.2.4-1), pve-esxi-import-tools:amd64 (0.7.0, 0.7.1), pve-container:amd64 (5.1.10, 5.1.12), proxmox-backup-client:amd64 (3.2.3-1, 3.2.4-1), pve-manager:amd64 (8.2.2, 8.2.4), libpve-notify-perl:amd64 (8.0.6, 8.0.7), libzfs4linux:amd64 (2.2.3-pve2, 2.2.4-pve1), zfsutils-linux:amd64 (2.2.3-pve2, 2.2.4-pve1), libpve-cluster-perl:amd64 (8.0.6, 8.0.7)
End-Date: 2024-06-19  08:05:31

Start-Date: 2024-06-20  06:08:14
Commandline: /usr/bin/unattended-upgrade
Remove: proxmox-kernel-6.8.4-2-pve-signed:amd64 (6.8.4-2)
End-Date: 2024-06-20  06:08:19

Start-Date: 2024-06-26  13:55:03
Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold upgrade --with-new-pkgs --auto-remove
Upgrade: libpve-storage-perl:amd64 (8.2.2, 8.2.3)
End-Date: 2024-06-26  13:55:24

Start-Date: 2024-06-27  22:05:15
Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold upgrade --with-new-pkgs --auto-remove
Install: proxmox-kernel-6.8.8-2-pve-signed:amd64 (6.8.8-2, automatic)
Upgrade: pve-qemu-kvm:amd64 (8.1.5-6, 9.0.0-3), proxmox-kernel-6.8:amd64 (6.8.8-1, 6.8.8-2)
End-Date: 2024-06-27  22:08:27

Most notably, there was a huge jump in pve-qemu-kvm:amd64 (8.1.5-6, 9.0.0-3), so I guess the problem lies there or probably in the updated storage drivers (libpve-storage-perl:amd64 (8.2.2, 8.2.3)).


This struck me hard, because I use cv4pve-snapshot which snapshots every hour. Had to disable this for now. However, the problem can be reproduced by using the GUI as well.
 
Last edited:
Maybe roll back to the previous pve-qemu-kvm release and see if that works better?

If it helps, this approach works for keeping on the v8 series of qemu.
 
Last edited:
  • Like
Reactions: meyergru
Yes, this sounds related to upgrading from QEMU 8.1 to 9.0, as that introduced graph locking and the assertion you hit.
But as snapshots work in general for us I'd think that this needs some extra trigger that we do not have in our test lab or production loads.
So can you please also post the configuration of some/an affected VM? E.g.: qm config VMID
 
I can confirm: Rolling back via apt-get install pve-qemu-kvm:amd64=8.1.5-6 plus rebooting made the problem go away.
But after that, upgrading to 8.2.2-1 and a reboot introduced the problem again. So, it is not only 9.x, but 8.2.x already. In my case, the original jump was straight from 8.1.5-6 to 9.0.0-3 without an intermediate step, so I did not notice this earlier.

FWIW, here is a config of one of the affected VMs:

Code:
agent: 1,freeze-fs-on-backup=0
balloon: 0
bios: ovmf
boot: order=virtio0
cores: 4
cpu: custom-mine
description: # Docker VM
efidisk0: local-zfs:vm-601-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide0: local-zfs:vm-601-cloudinit,media=cdrom
ide2: none,media=cdrom
machine: q35
memory: 10240
meta: creation-qemu=7.1.0,ctime=1673135587
name: docker
net0: virtio=C9:A9:9C:99:99:99,bridge=vmbr0,firewall=1,tag=10
numa: 0
onboot: 1
ostype: l26
parent: xxxx
protection: 1
scsihw: virtio-scsi-single
smbios1: uuid=13da4be1-0166-4ab7-a392-d555c7d07711
sockets: 1
tags: lan
tpmstate0: local-zfs:vm-601-disk-3,size=4M,version=v2.0
vga: virtio
virtio0: local-zfs:vm-601-disk-1,discard=on,iothread=1,size=112G
virtio1: local-zfs:vm-601-disk-2,discard=on,iothread=1,size=4G
vmgenid: 7ec935a5-847b-4949-8df2-dd58f2fe6cce

P.S.: @justinclift: Thanks for the pinning tip. That helps to keep it working for the time being, if you pin 8.1.*.
 
Last edited:
Do you use VirtIO-Block for the disks for all affected VMs? I.e., does this happen with a VM that uses (VirtIO) SCSI?

ps. you should be able to use apt-mark hold pve-qemu-kvm as a slightly simpler variant to hold back a package from upgrading. apt-mark unhold <pkgs> reveres this and apt-mark showhold shows all currently held-back packages.
 
Last edited:
Right. As soon as I detached the disk and reattached it via SCSI instead of VIRTIO, the problem went away. I have most of my VMs with virtio disks because of reportedly higher performance. I did not see this earlier, because the "controller" is still Virtio SCSI (single) in the VM view. You can only see the difference in the hardware section.

And yes, apt-mark hold works fine, too. Plus, you see what version it would get updated to, if it was not held back.
 
Thanks for your feedback, seems like virtioblk has some issues with (non-live) snapshots of VMs using the q35 machine type and OVMF (this was a red herring, it's mostly snapshots without VM state), at least that's how closely I could narrow it down after I could not reproduce this on some existing VMs. We'll take a look.

I have most of my VMs with virtio disks because of reportedly higher performance. I did not see this earlier, because the "controller" is still Virtio SCSI (single) in the VM view. You can only see the difference in the hardware section.
Yes, VirtIO provides better performance, but VirtIO Block (which was the first technique) and SCSI with VirtIO-SCSI as bus both provide very good performance, with the SCSI one being a slightly better choice most of the time. Nonetheless, failing snapshots is clearly a bug, but using SCSI can be a good workaround – at least for Linux based VMs, which are relatively flexible when changing hardware.
 
Last edited:
  • Like
Reactions: meyergru and iwik
I am pretty sure that the problem is not limited to q35, as I have another affected machine with this config:

Code:
agent: 0
bios: ovmf
boot: order=virtio0
cores: 4
cpu: host
description: # Docker VM
efidisk0: local-lvm:vm-601-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
ide2: none,media=cdrom
memory: 4096
meta: creation-qemu=7.1.0,ctime=1673135587
name: docker
net0: virtio=C9:A9:9C:99:99:97,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
parent: autohourly240628010115
protection: 1
scsihw: virtio-scsi-single
smbios1: uuid=c5d8417e-db17-4720-a879-8e385c163663
sockets: 1
tags: lan
vga: virtio
virtio0: local-lvm:vm-601-disk-1,discard=on,iothread=1,size=64G
vmgenid: 6a287cab-6e6d-4565-a6e4-517347eb43e5
 
Yes, that was a red herring of my initial testing, I already edited my previous reply (probably in the same moment you posted this ^^).

FIWCT now it's mostly important to have VirtioBlock as disk bus and skip saving the vmstate (i.e. uncheck the "Include RAM" checkbox) when doing a snapshot to trigger this.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!