I have just updated some PVE instances to the newest kernel 6.8.8-2-pve and observe a strange new behaviour which I cannot fully lay my hands on:
When I snapshot my VMs, some of them will get a snapshot, but afterwards, they stay shut down. Some VMs do this, some do not and I cannot see any notable difference.
This problem was seen on both LVM- and ZFS-based PVE instances just after I rebooted them into kernel 6.8.8-2-pve yesterday. My first guess was that the kernel was the culprit, but neither pinning 6.8.4-3-pve nor 6.8.8-1-pve helped.
In the case of LVM, there was this task log:
The warnings can most likely be ignored, there is enough space for these snapshots.
For the same problem on a ZFS-based PVE:
Because of this, I had the suspicion that the problem was with freeze/thaw and qemu guest agent, but the problem persists even if I disable using the guest agent. I found no other obvious correlation, like machine type or obvious things that could explain why some VMs do not restart and others do.
I found this in the PVE logs:
This line could be it:
Jun 28 01:51:44 ironside QEMU[16755]: kvm: ../block/graph-lock.c:260: bdrv_graph_rdlock_main_loop: Assertion `!qemu_in_coroutine()' failed.
I think that some of the other packages that have been updated may be responsible, my apt history shows:
Most notably, there was a huge jump in pve-qemu-kvm:amd64 (8.1.5-6, 9.0.0-3), so I guess the problem lies there or probably in the updated storage drivers (libpve-storage-perl:amd64 (8.2.2, 8.2.3)).
This struck me hard, because I use cv4pve-snapshot which snapshots every hour. Had to disable this for now. However, the problem can be reproduced by using the GUI as well.
When I snapshot my VMs, some of them will get a snapshot, but afterwards, they stay shut down. Some VMs do this, some do not and I cannot see any notable difference.
This problem was seen on both LVM- and ZFS-based PVE instances just after I rebooted them into kernel 6.8.8-2-pve yesterday. My first guess was that the kernel was the culprit, but neither pinning 6.8.4-3-pve nor 6.8.8-1-pve helped.
In the case of LVM, there was this task log:
Code:
snapshotting 'drive-virtio0' (local-lvm:vm-601-disk-1)
WARNING: You have not turned on protection against thin pools running out of space.
WARNING: Set activation/thin_pool_autoextend_threshold below 100 to trigger automatic extension of thin pools before they get full.
Consider pruning pve VG archive with more then 4500 MiB in 25953 files (check archiving is needed in lvm.conf).
Consider pruning pve VG archive with more then 4501 MiB in 25954 files (check archiving is needed in lvm.conf).
Logical volume "snap_vm-601-disk-1_xxx" created.
WARNING: Sum of all thin volume sizes (<18.53 TiB) exceeds the size of thin pool pve/data and the size of whole volume group (<931.01 GiB).
snapshotting 'drive-efidisk0' (local-lvm:vm-601-disk-0)
WARNING: You have not turned on protection against thin pools running out of space.
WARNING: Set activation/thin_pool_autoextend_threshold below 100 to trigger automatic extension of thin pools before they get full.
Consider pruning pve VG archive with more then 4501 MiB in 25955 files (check archiving is needed in lvm.conf).
Consider pruning pve VG archive with more then 4501 MiB in 25956 files (check archiving is needed in lvm.conf).
Logical volume "snap_vm-601-disk-0_xxx" created.
WARNING: Sum of all thin volume sizes (<18.53 TiB) exceeds the size of thin pool pve/data and the size of whole volume group (<931.01 GiB).
VM 601 qmp command 'savevm-end' failed - client closed connection
TASK OK
The warnings can most likely be ignored, there is enough space for these snapshots.
For the same problem on a ZFS-based PVE:
Code:
snapshotting 'drive-virtio0' (local-zfs:vm-601-disk-1)
snapshotting 'drive-virtio1' (local-zfs:vm-601-disk-2)
snapshotting 'drive-efidisk0' (local-zfs:vm-601-disk-0)
snapshotting 'drive-tpmstate0' (local-zfs:vm-601-disk-3)
VM 601 qmp command 'savevm-end' failed - client closed connection
guest-fsfreeze-thaw problems - VM 601 not running
TASK OK
Because of this, I had the suspicion that the problem was with freeze/thaw and qemu guest agent, but the problem persists even if I disable using the guest agent. I found no other obvious correlation, like machine type or obvious things that could explain why some VMs do not restart and others do.
I found this in the PVE logs:
Code:
Jun 28 01:51:44 ironside pvedaemon[20518]: <root@pam> snapshot VM 601: test
Jun 28 01:51:44 ironside pvedaemon[14091]: <root@pam> starting task UPID:ironside:00005026:0004E403:667DFB10:qmsnapshot:601:root@pam:
Jun 28 01:51:44 ironside dmeventd[537]: No longer monitoring thin pool pve-data-tpool.
Jun 28 01:51:44 ironside dmeventd[537]: Monitoring thin pool pve-data-tpool.
Jun 28 01:51:44 ironside dmeventd[537]: No longer monitoring thin pool pve-data-tpool.
Jun 28 01:51:44 ironside dmeventd[537]: Monitoring thin pool pve-data-tpool.
Jun 28 01:51:44 ironside QEMU[16755]: kvm: ../block/graph-lock.c:260: bdrv_graph_rdlock_main_loop: Assertion `!qemu_in_coroutine()' failed.
Jun 28 01:51:44 ironside pvedaemon[20518]: VM 601 qmp command failed - VM 601 qmp command 'savevm-end' failed - client closed connection
Jun 28 01:51:44 ironside pvedaemon[20518]: VM 601 qmp command 'savevm-end' failed - client closed connection
Jun 28 01:51:44 ironside kernel: fwbr601i0: port 2(tap601i0) entered disabled state
Jun 28 01:51:44 ironside kernel: tap601i0 (unregistering): left allmulticast mode
Jun 28 01:51:44 ironside kernel: fwbr601i0: port 2(tap601i0) entered disabled state
Jun 28 01:51:45 ironside systemd[1]: 601.scope: Deactivated successfully.
Jun 28 01:51:45 ironside systemd[1]: 601.scope: Consumed 1min 13.894s CPU time.
Jun 28 01:51:45 ironside pvedaemon[14091]: <root@pam> end task UPID:ironside:00005026:0004E403:667DFB10:qmsnapshot:601:root@pam: OK
Jun 28 01:51:45 ironside qmeventd[20548]: Starting cleanup for 601
Jun 28 01:51:45 ironside kernel: fwbr601i0: port 1(fwln601i0) entered disabled state
Jun 28 01:51:45 ironside kernel: vmbr0: port 2(fwpr601p0) entered disabled state
Jun 28 01:51:45 ironside kernel: fwln601i0 (unregistering): left allmulticast mode
Jun 28 01:51:45 ironside kernel: fwln601i0 (unregistering): left promiscuous mode
Jun 28 01:51:45 ironside kernel: fwbr601i0: port 1(fwln601i0) entered disabled state
Jun 28 01:51:45 ironside kernel: fwpr601p0 (unregistering): left allmulticast mode
Jun 28 01:51:45 ironside kernel: fwpr601p0 (unregistering): left promiscuous mode
Jun 28 01:51:45 ironside kernel: vmbr0: port 2(fwpr601p0) entered disabled state
Jun 28 01:51:45 ironside qmeventd[20548]: Finished cleanup for 601
This line could be it:
Jun 28 01:51:44 ironside QEMU[16755]: kvm: ../block/graph-lock.c:260: bdrv_graph_rdlock_main_loop: Assertion `!qemu_in_coroutine()' failed.
I think that some of the other packages that have been updated may be responsible, my apt history shows:
Code:
Start-Date: 2024-06-19 07:59:21
Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold upgrade --with-new-pkgs --auto-remove
Install: proxmox-kernel-6.8.8-1-pve-signed:amd64 (6.8.8-1, automatic)
Upgrade: libpve-rs-perl:amd64 (0.8.8, 0.8.9), pve-firmware:amd64 (3.11-1, 3.12-1), zfs-zed:amd64 (2.2.3-pve2, 2.2.4-pve1), zfs-initramfs:amd64 (2.2.3-pve2, 2.2.4-pve1), spl:amd64 (2.2.3-pve2, 2.2.4-pve1), libnvpair3linux:amd64 (2.2.3-pve2, 2.2.4-pve1), libpve-cluster-api-perl:amd64 (8.0.6, 8.0.7), pve-ha-manager:amd64 (4.0.4, 4.0.5), libuutil3linux:amd64 (2.2.3-pve2, 2.2.4-pve1), libpve-storage-perl:amd64 (8.2.1, 8.2.2), libzpool5linux:amd64 (2.2.3-pve2, 2.2.4-pve1), libpve-guest-common-perl:amd64 (5.1.2, 5.1.3), proxmox-kernel-6.8:amd64 (6.8.4-3, 6.8.8-1), pve-cluster:amd64 (8.0.6, 8.0.7), proxmox-backup-file-restore:amd64 (3.2.3-1, 3.2.4-1), pve-esxi-import-tools:amd64 (0.7.0, 0.7.1), pve-container:amd64 (5.1.10, 5.1.12), proxmox-backup-client:amd64 (3.2.3-1, 3.2.4-1), pve-manager:amd64 (8.2.2, 8.2.4), libpve-notify-perl:amd64 (8.0.6, 8.0.7), libzfs4linux:amd64 (2.2.3-pve2, 2.2.4-pve1), zfsutils-linux:amd64 (2.2.3-pve2, 2.2.4-pve1), libpve-cluster-perl:amd64 (8.0.6, 8.0.7)
End-Date: 2024-06-19 08:05:31
Start-Date: 2024-06-20 06:08:14
Commandline: /usr/bin/unattended-upgrade
Remove: proxmox-kernel-6.8.4-2-pve-signed:amd64 (6.8.4-2)
End-Date: 2024-06-20 06:08:19
Start-Date: 2024-06-26 13:55:03
Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold upgrade --with-new-pkgs --auto-remove
Upgrade: libpve-storage-perl:amd64 (8.2.2, 8.2.3)
End-Date: 2024-06-26 13:55:24
Start-Date: 2024-06-27 22:05:15
Commandline: /usr/bin/apt-get -y -o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold upgrade --with-new-pkgs --auto-remove
Install: proxmox-kernel-6.8.8-2-pve-signed:amd64 (6.8.8-2, automatic)
Upgrade: pve-qemu-kvm:amd64 (8.1.5-6, 9.0.0-3), proxmox-kernel-6.8:amd64 (6.8.8-1, 6.8.8-2)
End-Date: 2024-06-27 22:08:27
Most notably, there was a huge jump in pve-qemu-kvm:amd64 (8.1.5-6, 9.0.0-3), so I guess the problem lies there or probably in the updated storage drivers (libpve-storage-perl:amd64 (8.2.2, 8.2.3)).
This struck me hard, because I use cv4pve-snapshot which snapshots every hour. Had to disable this for now. However, the problem can be reproduced by using the GUI as well.
Last edited: