Problem snapshotting VMs with current PVE version (2024-06-27) using pve-qemu-kvm >= 8.2

meyergru · Jun 28, 2024

I have just updated some PVE instances to the newest kernel 6.8.8-2-pve and observe a strange new behaviour which I cannot fully lay my hands on:

When I snapshot my VMs, some of them will get a snapshot, but afterwards, they stay shut down. Some VMs do this, some do not and I cannot see any notable difference.

This problem was seen on both LVM- and ZFS-based PVE instances just after I rebooted them into kernel 6.8.8-2-pve yesterday. My first guess was that the kernel was the culprit, but neither pinning 6.8.4-3-pve nor 6.8.8-1-pve helped.

In the case of LVM, there was this task log:

Code:

snapshotting 'drive-virtio0' (local-lvm:vm-601-disk-1)
  WARNING: You have not turned on protection against thin pools running out of space.
  WARNING: Set activation/thin_pool_autoextend_threshold below 100 to trigger automatic extension of thin pools before they get full.
  Consider pruning pve VG archive with more then 4500 MiB in 25953 files (check archiving is needed in lvm.conf).
  Consider pruning pve VG archive with more then 4501 MiB in 25954 files (check archiving is needed in lvm.conf).
  Logical volume "snap_vm-601-disk-1_xxx" created.
  WARNING: Sum of all thin volume sizes (<18.53 TiB) exceeds the size of thin pool pve/data and the size of whole volume group (<931.01 GiB).
snapshotting 'drive-efidisk0' (local-lvm:vm-601-disk-0)
  WARNING: You have not turned on protection against thin pools running out of space.
  WARNING: Set activation/thin_pool_autoextend_threshold below 100 to trigger automatic extension of thin pools before they get full.
  Consider pruning pve VG archive with more then 4501 MiB in 25955 files (check archiving is needed in lvm.conf).
  Consider pruning pve VG archive with more then 4501 MiB in 25956 files (check archiving is needed in lvm.conf).
  Logical volume "snap_vm-601-disk-0_xxx" created.
  WARNING: Sum of all thin volume sizes (<18.53 TiB) exceeds the size of thin pool pve/data and the size of whole volume group (<931.01 GiB).
VM 601 qmp command 'savevm-end' failed - client closed connection
TASK OK

The warnings can most likely be ignored, there is enough space for these snapshots.

For the same problem on a ZFS-based PVE:

Code:

snapshotting 'drive-virtio0' (local-zfs:vm-601-disk-1)
snapshotting 'drive-virtio1' (local-zfs:vm-601-disk-2)
snapshotting 'drive-efidisk0' (local-zfs:vm-601-disk-0)
snapshotting 'drive-tpmstate0' (local-zfs:vm-601-disk-3)
VM 601 qmp command 'savevm-end' failed - client closed connection
guest-fsfreeze-thaw problems - VM 601 not running
TASK OK

Because of this, I had the suspicion that the problem was with freeze/thaw and qemu guest agent, but the problem persists even if I disable using the guest agent. I found no other obvious correlation, like machine type or obvious things that could explain why some VMs do not restart and others do.

I found this in the PVE logs:

Code:

Jun 28 01:51:44 ironside pvedaemon[20518]: <root@pam> snapshot VM 601: test
Jun 28 01:51:44 ironside pvedaemon[14091]: <root@pam> starting task UPID:ironside:00005026:0004E403:667DFB10:qmsnapshot:601:root@pam:
Jun 28 01:51:44 ironside dmeventd[537]: No longer monitoring thin pool pve-data-tpool.
Jun 28 01:51:44 ironside dmeventd[537]: Monitoring thin pool pve-data-tpool.
Jun 28 01:51:44 ironside dmeventd[537]: No longer monitoring thin pool pve-data-tpool.
Jun 28 01:51:44 ironside dmeventd[537]: Monitoring thin pool pve-data-tpool.
Jun 28 01:51:44 ironside QEMU[16755]: kvm: ../block/graph-lock.c:260: bdrv_graph_rdlock_main_loop: Assertion `!qemu_in_coroutine()' failed.
Jun 28 01:51:44 ironside pvedaemon[20518]: VM 601 qmp command failed - VM 601 qmp command 'savevm-end' failed - client closed connection
Jun 28 01:51:44 ironside pvedaemon[20518]: VM 601 qmp command 'savevm-end' failed - client closed connection
Jun 28 01:51:44 ironside kernel: fwbr601i0: port 2(tap601i0) entered disabled state
Jun 28 01:51:44 ironside kernel: tap601i0 (unregistering): left allmulticast mode
Jun 28 01:51:44 ironside kernel: fwbr601i0: port 2(tap601i0) entered disabled state
Jun 28 01:51:45 ironside systemd[1]: 601.scope: Deactivated successfully.
Jun 28 01:51:45 ironside systemd[1]: 601.scope: Consumed 1min 13.894s CPU time.
Jun 28 01:51:45 ironside pvedaemon[14091]: <root@pam> end task UPID:ironside:00005026:0004E403:667DFB10:qmsnapshot:601:root@pam: OK
Jun 28 01:51:45 ironside qmeventd[20548]: Starting cleanup for 601
Jun 28 01:51:45 ironside kernel: fwbr601i0: port 1(fwln601i0) entered disabled state
Jun 28 01:51:45 ironside kernel: vmbr0: port 2(fwpr601p0) entered disabled state
Jun 28 01:51:45 ironside kernel: fwln601i0 (unregistering): left allmulticast mode
Jun 28 01:51:45 ironside kernel: fwln601i0 (unregistering): left promiscuous mode
Jun 28 01:51:45 ironside kernel: fwbr601i0: port 1(fwln601i0) entered disabled state
Jun 28 01:51:45 ironside kernel: fwpr601p0 (unregistering): left allmulticast mode
Jun 28 01:51:45 ironside kernel: fwpr601p0 (unregistering): left promiscuous mode
Jun 28 01:51:45 ironside kernel: vmbr0: port 2(fwpr601p0) entered disabled state
Jun 28 01:51:45 ironside qmeventd[20548]: Finished cleanup for 601

This line could be it:
Jun 28 01:51:44 ironside QEMU[16755]: kvm: ../block/graph-lock.c:260: bdrv_graph_rdlock_main_loop: Assertion `!qemu_in_coroutine()' failed.

I think that some of the other packages that have been updated may be responsible, my apt history shows:

Code:

Start-Date: 2024-06-19  07:59:21
Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold upgrade --with-new-pkgs --auto-remove
Install: proxmox-kernel-6.8.8-1-pve-signed:amd64 (6.8.8-1, automatic)
Upgrade: libpve-rs-perl:amd64 (0.8.8, 0.8.9), pve-firmware:amd64 (3.11-1, 3.12-1), zfs-zed:amd64 (2.2.3-pve2, 2.2.4-pve1), zfs-initramfs:amd64 (2.2.3-pve2, 2.2.4-pve1), spl:amd64 (2.2.3-pve2, 2.2.4-pve1), libnvpair3linux:amd64 (2.2.3-pve2, 2.2.4-pve1), libpve-cluster-api-perl:amd64 (8.0.6, 8.0.7), pve-ha-manager:amd64 (4.0.4, 4.0.5), libuutil3linux:amd64 (2.2.3-pve2, 2.2.4-pve1), libpve-storage-perl:amd64 (8.2.1, 8.2.2), libzpool5linux:amd64 (2.2.3-pve2, 2.2.4-pve1), libpve-guest-common-perl:amd64 (5.1.2, 5.1.3), proxmox-kernel-6.8:amd64 (6.8.4-3, 6.8.8-1), pve-cluster:amd64 (8.0.6, 8.0.7), proxmox-backup-file-restore:amd64 (3.2.3-1, 3.2.4-1), pve-esxi-import-tools:amd64 (0.7.0, 0.7.1), pve-container:amd64 (5.1.10, 5.1.12), proxmox-backup-client:amd64 (3.2.3-1, 3.2.4-1), pve-manager:amd64 (8.2.2, 8.2.4), libpve-notify-perl:amd64 (8.0.6, 8.0.7), libzfs4linux:amd64 (2.2.3-pve2, 2.2.4-pve1), zfsutils-linux:amd64 (2.2.3-pve2, 2.2.4-pve1), libpve-cluster-perl:amd64 (8.0.6, 8.0.7)
End-Date: 2024-06-19  08:05:31

Start-Date: 2024-06-20  06:08:14
Commandline: /usr/bin/unattended-upgrade
Remove: proxmox-kernel-6.8.4-2-pve-signed:amd64 (6.8.4-2)
End-Date: 2024-06-20  06:08:19

Start-Date: 2024-06-26  13:55:03
Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold upgrade --with-new-pkgs --auto-remove
Upgrade: libpve-storage-perl:amd64 (8.2.2, 8.2.3)
End-Date: 2024-06-26  13:55:24

Start-Date: 2024-06-27  22:05:15
Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold upgrade --with-new-pkgs --auto-remove
Install: proxmox-kernel-6.8.8-2-pve-signed:amd64 (6.8.8-2, automatic)
Upgrade: pve-qemu-kvm:amd64 (8.1.5-6, 9.0.0-3), proxmox-kernel-6.8:amd64 (6.8.8-1, 6.8.8-2)
End-Date: 2024-06-27  22:08:27

Most notably, there was a huge jump in pve-qemu-kvm:amd64 (8.1.5-6, 9.0.0-3), so I guess the problem lies there or probably in the updated storage drivers (libpve-storage-perl:amd64 (8.2.2, 8.2.3)).

This struck me hard, because I use cv4pve-snapshot which snapshots every hour. Had to disable this for now. However, the problem can be reproduced by using the GUI as well.

justinclift · Jun 28, 2024

Maybe roll back to the previous pve-qemu-kvm release and see if that works better?

If it helps, this approach works for keeping on the v8 series of qemu.

t.lamprecht · Jun 28, 2024

Yes, this sounds related to upgrading from QEMU 8.1 to 9.0, as that introduced graph locking and the assertion you hit.
But as snapshots work in general for us I'd think that this needs some extra trigger that we do not have in our test lab or production loads.
So can you please also post the configuration of some/an affected VM? E.g.: qm config VMID

meyergru · Jun 28, 2024

I can confirm: Rolling back via apt-get install pve-qemu-kvm:amd64=8.1.5-6 plus rebooting made the problem go away.
But after that, upgrading to 8.2.2-1 and a reboot introduced the problem again. So, it is not only 9.x, but 8.2.x already. In my case, the original jump was straight from 8.1.5-6 to 9.0.0-3 without an intermediate step, so I did not notice this earlier.

FWIW, here is a config of one of the affected VMs:

Code:

agent: 1,freeze-fs-on-backup=0
balloon: 0
bios: ovmf
boot: order=virtio0
cores: 4
cpu: custom-mine
description: # Docker VM
efidisk0: local-zfs:vm-601-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
ide0: local-zfs:vm-601-cloudinit,media=cdrom
ide2: none,media=cdrom
machine: q35
memory: 10240
meta: creation-qemu=7.1.0,ctime=1673135587
name: docker
net0: virtio=C9:A9:9C:99:99:99,bridge=vmbr0,firewall=1,tag=10
numa: 0
onboot: 1
ostype: l26
parent: xxxx
protection: 1
scsihw: virtio-scsi-single
smbios1: uuid=13da4be1-0166-4ab7-a392-d555c7d07711
sockets: 1
tags: lan
tpmstate0: local-zfs:vm-601-disk-3,size=4M,version=v2.0
vga: virtio
virtio0: local-zfs:vm-601-disk-1,discard=on,iothread=1,size=112G
virtio1: local-zfs:vm-601-disk-2,discard=on,iothread=1,size=4G
vmgenid: 7ec935a5-847b-4949-8df2-dd58f2fe6cce

P.S.: @justinclift: Thanks for the pinning tip. That helps to keep it working for the time being, if you pin 8.1.*.

t.lamprecht · Jun 28, 2024

Do you use VirtIO-Block for the disks for all affected VMs? I.e., does this happen with a VM that uses (VirtIO) SCSI?

ps. you should be able to use apt-mark hold pve-qemu-kvm as a slightly simpler variant to hold back a package from upgrading. apt-mark unhold <pkgs> reveres this and apt-mark showhold shows all currently held-back packages.

meyergru · Jun 28, 2024

Right. As soon as I detached the disk and reattached it via SCSI instead of VIRTIO, the problem went away. I have most of my VMs with virtio disks because of reportedly higher performance. I did not see this earlier, because the "controller" is still Virtio SCSI (single) in the VM view. You can only see the difference in the hardware section.

And yes, apt-mark hold works fine, too. Plus, you see what version it would get updated to, if it was not held back.

t.lamprecht · Jun 28, 2024

Thanks for your feedback, seems like virtioblk has some issues with (non-live) snapshots of VMs ~~using the q35 machine type and OVMF~~ (this was a red herring, it's mostly snapshots without VM state), at least that's how closely I could narrow it down after I could not reproduce this on some existing VMs. We'll take a look.

meyergru said:
I have most of my VMs with virtio disks because of reportedly higher performance. I did not see this earlier, because the "controller" is still Virtio SCSI (single) in the VM view. You can only see the difference in the hardware section.

Yes, VirtIO provides better performance, but VirtIO Block (which was the first technique) and SCSI with VirtIO-SCSI as bus both provide very good performance, with the SCSI one being a slightly better choice most of the time. Nonetheless, failing snapshots is clearly a bug, but using SCSI can be a good workaround – at least for Linux based VMs, which are relatively flexible when changing hardware.

meyergru · Jun 28, 2024

I am pretty sure that the problem is not limited to q35, as I have another affected machine with this config:

Code:

agent: 0
bios: ovmf
boot: order=virtio0
cores: 4
cpu: host
description: # Docker VM
efidisk0: local-lvm:vm-601-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
ide2: none,media=cdrom
memory: 4096
meta: creation-qemu=7.1.0,ctime=1673135587
name: docker
net0: virtio=C9:A9:9C:99:99:97,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
parent: autohourly240628010115
protection: 1
scsihw: virtio-scsi-single
smbios1: uuid=c5d8417e-db17-4720-a879-8e385c163663
sockets: 1
tags: lan
vga: virtio
virtio0: local-lvm:vm-601-disk-1,discard=on,iothread=1,size=64G
vmgenid: 6a287cab-6e6d-4565-a6e4-517347eb43e5

t.lamprecht · Jun 28, 2024

Yes, that was a red herring of my initial testing, I already edited my previous reply (probably in the same moment you posted this ^^).

FIWCT now it's mostly important to have VirtioBlock as disk bus and skip saving the vmstate (i.e. uncheck the "Include RAM" checkbox) when doing a snapshot to trigger this.

fiona · Jun 28, 2024

Hi,
a preliminary fix has been sent to the mailing list for discussion: https://lists.proxmox.com/pipermail/pve-devel/2024-June/064336.html

Thank you for the report and Thomas for the initial analysis!

michabbs · Jun 30, 2024

I confirm the same problem here. Downgrading to pve-qemu-kvm:amd64=8.1.5-6 did the job!

nodesolutions · Jul 1, 2024

@fiona - Any ideas when package with fix will be available?

fiona · Jul 2, 2024

Hi,

nodesolutions said:
@fiona - Any ideas when package with fix will be available?

it's currently going through internal testing and should land in the test repository in the following days.

fiona · Jul 2, 2024

FYI, the package pve-qemu-kvm=9.0.0-4 with the fix is currently available on the pvetest repository. If you'd like to install the package, you can temporarily enable the repository, run apt update, run apt install pve-qemu-kvm and disable the repository again (e.g. via the Repositories section in the UI), running apt update again.

To have a VM pick up the new version, you need to shutdown+start the VM, migrate to an already upgraded node or use the Reboot button in the UI (reboot within the guest is not enough).

DieselDrax · Jul 2, 2024

fiona said:
FYI, the package pve-qemu-kvm=9.0.0-4 with the fix is currently available on the pvetest repository. If you'd like to install the package, you can temporarily enable the repository, run apt update, run apt install pve-qemu-kvm and disable the repository again (e.g. via the Repositories section in the UI), running apt update again.

To have a VM pick up the new version, you need to shutdown+start the VM, migrate to an already upgraded node or use the Reboot button in the UI (reboot within the guest is not enough).

I've installed the updated packages, including the test version of pve-qemu-kvm, and I can confirm my VMs no longer crash during fs freeze/unfreeze operations.

PeterSuh-Q3 · Jul 26, 2024

It has been confirmed that a way to use VMs while maintaining the final version of QEMU 9 is possible.

Using OVMF (UEFI), which is UEFI instead of legacy BIOS, it is possible to boot in USB mode.

fiona · Jul 26, 2024

Hi,

PeterSuh-Q3 said:
It has been confirmed that a way to use VMs while maintaining the final version of QEMU 9 is possible.

Using OVMF (UEFI), which is UEFI instead of legacy BIOS, it is possible to boot in USB mode.

this thread is about an issue with snapshots that has been resolved. What you say sounds like the issue that has been reported here: https://forum.proxmox.com/threads/q...-no-subscription-as-of-now.149772/post-679433

A fix has been proposed as part of: https://lists.proxmox.com/pipermail/pve-devel/2024-July/064904.html

Search

Search

Problem snapshotting VMs with current PVE version (2024-06-27) using pve-qemu-kvm >= 8.2

meyergru

Member

justinclift

Active Member

t.lamprecht

Proxmox Staff Member

meyergru

Member

t.lamprecht

Proxmox Staff Member

meyergru

Member

t.lamprecht

Proxmox Staff Member

meyergru

Member

t.lamprecht

Proxmox Staff Member

fiona

Proxmox Staff Member

michabbs

Active Member

nodesolutions

Member

fiona

Proxmox Staff Member

fiona

Proxmox Staff Member

DieselDrax

New Member

PeterSuh-Q3

New Member

fiona

Proxmox Staff Member