Proxmox VE 9.0 BETA released!

First of all, thanks for this awesome beta release — great work!

We've noticed one major issue with this version: VM boot times (using OVMF) have become extremely slow. It now takes about 1 minute and 30 seconds from pressing "Start" to seeing any console output or UEFI POST completion.

This behavior did not occur on PVE 8.x — VM startup was almost instant in comparison.

During this delay, CPU usage is consistently high (around 75–80%, regardless of how many cores are assigned), and RAM usage hovers at about 100 MB.

Below are two screenshots showing the VM state during this stall:

2025-07-25_22-25_1.png

2025-07-25_22-25_3.png

And here's the config of one of the affected VMs — though we’ve confirmed the issue occurs on all VMs across all nodes in the cluster:

Code:
agent: 1
bios: ovmf
boot: order=virtio0
cores: 4
cpu: x86-64-v3
efidisk0: ceph-nvme01:vm-117-disk-1,efitype=4m,pre-enrolled-keys=1,size=528K
hotplug: disk,network
machine: q35
memory: 4096
name: <redacted>
net0: virtio=F2:AC:38:7B:9A:A6,bridge=vmbr0,tag=100
numa: 0
onboot: 1
ostype: l26
scsihw: virtio-scsi-single
smbios1: uuid=8b8a56a6-0684-4b2b-a50a-69b23b789235
sockets: 1
tablet: 0
tags: <redacted>
vga: virtio
virtio0: ceph-nvme01:vm-117-disk-0,discard=on,iothread=1,size=20G
vmgenid: 3890b812-a1fc-4331-a4c0-49d88c7632a6

Some more information:

Bash:
root@node02 ~ # pveversion
pve-manager/9.0.0~11/c474e5a0b4bd391d (running kernel: 6.14.8-2-pve)
root@node02 ~ # qemu-system-x86_64 --version
QEMU emulator version 10.0.2 (pve-qemu-kvm_10.0.2-4)
Copyright (c) 2003-2025 Fabrice Bellard and the QEMU Project developers

Has anyone else seen this behavior? Could this be a known issue in the beta?
 
  • Like
Reactions: SInisterPisces
[B]Support for snapshots as volume chains on Directory/NFS/CIFS storages (technology preview).[/B]
Am I correct in assuming that this function is similar to LVM and that each time a snapshot is taken with QCOW2's external snapshot function, a QCOW2 file is created and a chain configuration is created?
I took a snapshot of a virtual machine with a QCOW2 disk on NFS storage with “Allow Snapshot as volume chain” enabled when newly mounted, and it seems to be taken with the same internal snapshot as the previous version.
Is this not available for NFS in 9.0 Beta1?

nfs: NFS
export /NFS
path /mnt/pve/NFS
server 192.168.1.183
content images
prune-backups keep-all=1
snapshot-as-volume-chain 1


# qemu-img info /mnt/pve/NFS/images/101/vm-101-disk-0.qcow2
image: /mnt/pve/NFS/images/101/vm-101-disk-0.qcow2
file format: qcow2
virtual size: 32 GiB (34359738368 bytes)
disk size: 5.73 MiB
cluster_size: 65536
Snapshot list:
ID TAG VM_SIZE DATE VM_CLOCK ICOUNT
1 TEST 0 B 2025-07-26 10:34:06 0010:40:02.217 --
Format specific information:
compat: 1.1
compression type: zlib
lazy refcounts: false
refcount bits: 16
corrupt: false
extended l2: false
Child node '/file':
filename: /mnt/pve/NFS/images/101/vm-101-disk-0.qcow2
protocol type: file
file length: 32 GiB (3436530944
 
Hi,
We've noticed one major issue with this version: VM boot times (using OVMF) have become extremely slow. It now takes about 1 minute and 30 seconds from pressing "Start" to seeing any console output or UEFI POST completion.

This behavior did not occur on PVE 8.x — VM startup was almost instant in comparison.
it is a known issue without using the proper cache mode, but Proxmox VE 9 should set the proper cache mode for the EFI disk: https://git.proxmox.com/?p=pve-stor...3cb0c3398c9fc19d305d9c36a74a4797715d009e#l564

efidisk0: ceph-nvme01:vm-117-disk-1,efitype=4m,pre-enrolled-keys=1,size=528K
What do you get when you query the cache mode for the image
Code:
rbd config image get vm-117-disk-1 rbd_cache_policy
?

Is this a PVE-managed or an external cluster? Do you maybe override the rbd_cache_policy in your Ceph client configuration or somewhere?
 
  • Like
Reactions: Bent
Hi,

it is a known issue without using the proper cache mode, but Proxmox VE 9 should set the proper cache mode for the EFI disk: https://git.proxmox.com/?p=pve-stor...3cb0c3398c9fc19d305d9c36a74a4797715d009e#l564


What do you get when you query the cache mode for the image
Code:
rbd config image get vm-117-disk-1 rbd_cache_policy
?

Is this a PVE-managed or an external cluster? Do you maybe override the rbd_cache_policy in your Ceph client configuration or somewhere?

Thanks for the quick follow-up!

Here's the output I get:

Bash:
root@node02 ~ # rbd --pool ceph-nvme01 config image get vm-117-disk-1 rbd_cache_policy
rbd: rbd_cache_policy is not set

I also checked the Ceph client configuration:

Bash:
root@node02 ~ # grep rbd_cache_policy /etc/ceph/ceph.conf /etc/pve/ceph.conf
root@node02 ~ #

So there's no global override set either.

Just to clarify: this Ceph cluster was created and is fully managed by Proxmox VE (PVE-managed), no external cluster or manual Ceph deployment.

Could using KRBD be the issue?

1753574478400.png

Would you recommend manually setting the rbd_cache_policy to writeback for the EFI disk, or is there a more appropriate fix in PVE 9 for this scenario?
 
Last edited:
I have a few small issues with PVE9: When I create a vmdk with a descriptor and a flat vmdk on a NFS datastore (move over from VMware in order to import it), I can not longer create additional disks on the same NFS datastore, however other datastores work. When I migrate the vmdk (consisting of two files: descriptor and flat vmdk) on the same datastore to qcow2, I get an error. When I move them to a different datastore it is fine, however even after deleting the original vmdk, it only deletes the descriptor vmdk and lets the flat vmdk lying around which results in another error trying to move the qcow2 back to the original nfs datastore. However when I delete the left over flat vmdk, I'm able to move the qcow back to the original datastore and add additional disks in the original nfs datastore.

The LVM snapshot-as-volume-chain works well. However, IDE trim/SCSI UNMAP/NVME blkdiscard does not work within the guest and proxmox apparently does not issue the commands itself before deleting a snapshot lv. I tried on iSCSI and NVMe/TCP. As a result when I create a snapshot on LVM, write some data and delete the snapshot again the physical size used on the underlying storage (in my case a netapp aff a150) blows up in size to twice the data written (snapshot lv and the merge back to the original parent volume when deleting the snapshot (lv)). It would be nice if proxmox runs blockdiscard on the snapshot lv before deleting it or there would be a way on lvm to blkdiscard all unsued space (I searched for it, but was not able to find it). Many enterprise customers search for a replacement for vmfs on shared blockstorage (having the ability to do snapshots and thinprovisioning). I have a few customers who run ocfs2 without issues (apart the recent incompatibility to liburing) for many years, maybe proxmox should officially support ocfs2 or another clusterfilesystem to fill that gap.

I also have found an ovf from ontap simulator which has four disks, when I import the ovf into proxmox it only imports two disks. I tried to reproduce the issue with an ovf with 4 disks that I build myself but was unable to. If you want access to that ovf in order to reproduce it, please drop me an email at thomas at glanzmann dot de. The last issue is also present on PVE8.
 
Last edited:
The LVM snapshot-as-volume-chain works well. However, IDE trim/SCSI UNMAP/NVME blkdiscard does not work within the guest and proxmox apparently does not issue the commands itself before deleting a snapshot lv. I tried on iSCSI and NVMe/TCP. As a result when I create a snapshot on LVM, write some data and delete the snapshot again the physical size used on the underlying storage (in my case a netapp aff a150) blows up in size to twice the data written (snapshot lv and the merge back to the original parent volume when deleting the snapshot (lv)). It would be nice if proxmox runs blockdiscard on the snapshot lv before deleting it or there would be a way on lvm to blkdiscard all unsued space (I searched for it, but was not able to find it).
can you try to add issue_discards=1 in lvm.conf ? (I'm sure it's working at lvremove , I don't known if it's mandatory when discard send by the guest os)

Many enterprise customers search for a replacement for vmfs on shared blockstorage (having the ability to do snapshots and thinprovisioning). I have a few customers who run ocfs2 without issues (apart the recent incompatibility to liburing) for many years, maybe proxmox should officially support ocfs2 or another clusterfilesystem to fill that gap.
Do you have good performance with ocfs2 ? Each time I'm trying I have pretty bad performance without preallocation full. and internal snapshots are breaking performance too. (Should help the new external snapshot too as we can keep preallocation of metadatas now)
 
  • Like
Reactions: Johannes S
Another issue I’ve encountered since upgrading to the PVE 9 Beta is that live migrations frequently / or almost always fail with the following error:

ERROR: online migrate failure - unable to parse migration status 'device' - aborting

This problem did not occur under PVE 8 on the same cluster configuration. Repeated attempts (sometimes up to a dozen retries) eventually succeed, but the behavior is clearly unstable and inconsistent compared to the previous version.

For reference, here is a relevant migration log excerpt:

Code:
task started by HA resource agent
2025-07-27 20:12:23 use dedicated network address for sending migration traffic (<redacted>)
2025-07-27 20:12:23 starting migration of VM 119 to node 'node01' (<redacted>)
2025-07-27 20:12:23 starting VM 119 on remote node 'node01'
2025-07-27 20:12:26 start remote tunnel
2025-07-27 20:12:27 ssh tunnel ver 1
2025-07-27 20:12:27 starting online/live migration on unix:/run/qemu-server/119.migrate
2025-07-27 20:12:27 set migration capabilities
2025-07-27 20:12:27 migration downtime limit: 100 ms
2025-07-27 20:12:27 migration cachesize: 1.0 GiB
2025-07-27 20:12:27 set migration parameters
2025-07-27 20:12:27 start migrate command to unix:/run/qemu-server/119.migrate
2025-07-27 20:12:28 migration active, transferred 683.0 MiB of 6.0 GiB VM-state, 1.3 GiB/s
2025-07-27 20:12:28 xbzrle: send updates to 17 pages in 0.0 B encoded memory, cache-miss 1.36%
2025-07-27 20:12:29 migration active, transferred 1.4 GiB of 6.0 GiB VM-state, 958.8 MiB/s
2025-07-27 20:12:29 xbzrle: send updates to 17 pages in 0.0 B encoded memory, cache-miss 1.36%
2025-07-27 20:12:30 migration active, transferred 2.2 GiB of 6.0 GiB VM-state, 871.4 MiB/s
2025-07-27 20:12:30 xbzrle: send updates to 17 pages in 0.0 B encoded memory, cache-miss 1.36%
2025-07-27 20:12:31 ERROR: online migrate failure - unable to parse migration status 'device' - aborting
2025-07-27 20:12:31 aborting phase 2 - cleanup resources
2025-07-27 20:12:31 migrate_cancel
2025-07-27 20:12:33 ERROR: migration finished with problems (duration 00:00:10)
TASK ERROR: migration problems

Here is the journalctl log of another try (not frr-vrf is not in use):

Code:
Jul 27 20:17:40 node01 pmxcfs[1648]: [status] notice: received log
Jul 27 20:17:41 node01 pmxcfs[1648]: [status] notice: received log
Jul 27 20:17:52 node01 pmxcfs[1648]: [status] notice: received log
Jul 27 20:17:52 node01 sshd-session[26070]: Accepted publickey for root from <redacted> port 60732 ssh2: RSA SHA256:<redacted>
Jul 27 20:17:52 node01 sshd-session[26070]: pam_unix(sshd:session): session opened for user root(uid=0) by root(uid=0)
Jul 27 20:17:52 node01 systemd-logind[1259]: New session 114 of user root.
Jul 27 20:17:52 node01 systemd[1]: Started session-114.scope - Session 114 of User root.
Jul 27 20:17:52 node01 sshd-session[26078]: Received disconnect from <redacted> port 60732:11: disconnected by user
Jul 27 20:17:52 node01 sshd-session[26078]: Disconnected from user root <redacted> port 60732
Jul 27 20:17:52 node01 sshd-session[26070]: pam_unix(sshd:session): session closed for user root
Jul 27 20:17:52 node01 systemd-logind[1259]: Session 114 logged out. Waiting for processes to exit.
Jul 27 20:17:52 node01 systemd[1]: session-114.scope: Deactivated successfully.
Jul 27 20:17:52 node01 systemd-logind[1259]: Removed session 114.
Jul 27 20:17:52 node01 sshd-session[26082]: Accepted publickey for root from <redacted> port 60748 ssh2: RSA SHA256:<redacted>
Jul 27 20:17:52 node01 sshd-session[26082]: pam_unix(sshd:session): session opened for user root(uid=0) by root(uid=0)
Jul 27 20:17:52 node01 systemd-logind[1259]: New session 115 of user root.
Jul 27 20:17:52 node01 systemd[1]: Started session-115.scope - Session 115 of User root.
Jul 27 20:17:52 node01 sshd-session[26090]: Received disconnect from <redacted> port 60748:11: disconnected by user
Jul 27 20:17:52 node01 sshd-session[26090]: Disconnected from user root <redacted> port 60748
Jul 27 20:17:52 node01 sshd-session[26082]: pam_unix(sshd:session): session closed for user root
Jul 27 20:17:52 node01 systemd[1]: session-115.scope: Deactivated successfully.
Jul 27 20:17:52 node01 systemd-logind[1259]: Session 115 logged out. Waiting for processes to exit.
Jul 27 20:17:52 node01 systemd-logind[1259]: Removed session 115.
Jul 27 20:17:52 node01 sshd-session[26093]: Accepted publickey for root from <redacted> port 60764 ssh2: RSA SHA256:<redacted>
Jul 27 20:17:52 node01 sshd-session[26093]: pam_unix(sshd:session): session opened for user root(uid=0) by root(uid=0)
Jul 27 20:17:52 node01 systemd-logind[1259]: New session 116 of user root.
Jul 27 20:17:52 node01 systemd[1]: Started session-116.scope - Session 116 of User root.
Jul 27 20:17:52 node01 sshd-session[26106]: Received disconnect from <redacted> port 60764:11: disconnected by user
Jul 27 20:17:52 node01 sshd-session[26106]: Disconnected from user root <redacted> port 60764
Jul 27 20:17:52 node01 sshd-session[26093]: pam_unix(sshd:session): session closed for user root
Jul 27 20:17:52 node01 systemd-logind[1259]: Session 116 logged out. Waiting for processes to exit.
Jul 27 20:17:52 node01 systemd[1]: session-116.scope: Deactivated successfully.
Jul 27 20:17:52 node01 systemd-logind[1259]: Removed session 116.
Jul 27 20:17:52 node01 sshd-session[26109]: Accepted publickey for root from <redacted> port 60768 ssh2: RSA SHA256:<redacted>
Jul 27 20:17:52 node01 sshd-session[26109]: pam_unix(sshd:session): session opened for user root(uid=0) by root(uid=0)
Jul 27 20:17:52 node01 systemd-logind[1259]: New session 117 of user root.
Jul 27 20:17:52 node01 systemd[1]: Started session-117.scope - Session 117 of User root.
Jul 27 20:17:53 node01 qm[26120]: start VM 119: UPID:node01:00006608:00013F20:68866D51:qmstart:119:root@pam:
Jul 27 20:17:53 node01 qm[26118]: <root@pam> starting task UPID:node01:00006608:00013F20:68866D51:qmstart:119:root@pam:
Jul 27 20:17:53 node01 kernel:  rbd0: p1 p2 p3
Jul 27 20:17:53 node01 kernel: rbd: rbd0: capacity 21474836480 features 0x3d
Jul 27 20:17:53 node01 kernel: rbd: rbd1: capacity 134217728000 features 0x3d
Jul 27 20:17:54 node01 kernel: rbd: rbd2: capacity 193273528320 features 0x3d
Jul 27 20:17:54 node01 kernel: rbd: rbd7: capacity 1048576 features 0x3d
Jul 27 20:17:54 node01 systemd[1]: Started 119.scope.
Jul 27 20:17:54 node01 zebra[1392]: libyang Invalid boolean value "". (/frr-vrf:lib/vrf/state/active)
Jul 27 20:17:54 node01 zebra[1392]: libyang Invalid type uint32 empty value. (/frr-vrf:lib/vrf/state/id)
Jul 27 20:17:54 node01 kernel: tap119i0: entered promiscuous mode
Jul 27 20:17:54 node01 ovs-vsctl[26441]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port tap119i0
Jul 27 20:17:54 node01 ovs-vsctl[26441]: ovs|00002|db_ctl_base|ERR|no port named tap119i0
Jul 27 20:17:54 node01 ovs-vsctl[26443]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port fwln119i0
Jul 27 20:17:54 node01 ovs-vsctl[26443]: ovs|00002|db_ctl_base|ERR|no port named fwln119i0
Jul 27 20:17:54 node01 ovs-vsctl[26444]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl -- add-port vmbr0 tap119i0 tag=100 -- set Interface tap119i0 mtu_request=1500
Jul 27 20:17:55 node01 qm[26120]: VM 119 started with PID 26424.
Jul 27 20:17:55 node01 qm[26118]: <root@pam> end task UPID:node01:00006608:00013F20:68866D51:qmstart:119:root@pam: OK
Jul 27 20:17:55 node01 sshd-session[26117]: Received disconnect from <redacted> port 60768:11: disconnected by user
Jul 27 20:17:55 node01 sshd-session[26117]: Disconnected from user root <redacted> port 60768
Jul 27 20:17:55 node01 sshd-session[26109]: pam_unix(sshd:session): session closed for user root
Jul 27 20:17:55 node01 systemd[1]: session-117.scope: Deactivated successfully.
Jul 27 20:17:55 node01 systemd[1]: session-117.scope: Consumed 1.113s CPU time, 134.3M memory peak.
Jul 27 20:17:55 node01 systemd-logind[1259]: Session 117 logged out. Waiting for processes to exit.
Jul 27 20:17:55 node01 systemd-logind[1259]: Removed session 117.
Jul 27 20:17:55 node01 sshd-session[26472]: Accepted publickey for root from <redacted> port 60782 ssh2: RSA SHA256:<redacted>
Jul 27 20:17:55 node01 sshd-session[26472]: pam_unix(sshd:session): session opened for user root(uid=0) by root(uid=0)
Jul 27 20:17:55 node01 systemd-logind[1259]: New session 118 of user root.
Jul 27 20:17:55 node01 systemd[1]: Started session-118.scope - Session 118 of User root.
Jul 27 20:18:00 node01 sshd-session[26557]: Accepted publickey for root from <redacted> port 45590 ssh2: RSA SHA256:<redacted>
Jul 27 20:18:00 node01 sshd-session[26557]: pam_unix(sshd:session): session opened for user root(uid=0) by root(uid=0)
Jul 27 20:18:00 node01 systemd-logind[1259]: New session 119 of user root.
Jul 27 20:18:00 node01 systemd[1]: Started session-119.scope - Session 119 of User root.
Jul 27 20:18:01 node01 qm[26569]: stop VM 119: UPID:node01:000067C9:00014218:68866D59:qmstop:119:root@pam:
Jul 27 20:18:01 node01 qm[26566]: <root@pam> starting task UPID:node01:000067C9:00014218:68866D59:qmstop:119:root@pam:
Jul 27 20:18:01 node01 QEMU[26424]: kvm: terminating on signal 15 from pid 26569 (task UPID:node01:000067C9:00014218:68866D59:qmstop:119:root@pam:)
Jul 27 20:18:01 node01 kernel:  rbd0: p1 p2 p3
Jul 27 20:18:01 node01 ovs-vsctl[26776]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port fwln119i0
Jul 27 20:18:01 node01 ovs-vsctl[26776]: ovs|00002|db_ctl_base|ERR|no port named fwln119i0
Jul 27 20:18:01 node01 ovs-vsctl[26777]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port tap119i0
Jul 27 20:18:01 node01 qmeventd[1255]: read: Connection reset by peer
Jul 27 20:18:01 node01 systemd[1]: 119.scope: Deactivated successfully.
Jul 27 20:18:01 node01 systemd[1]: 119.scope: Consumed 2.288s CPU time, 3.6G memory peak.
Jul 27 20:18:01 node01 qm[26566]: <root@pam> end task UPID:node01:000067C9:00014218:68866D59:qmstop:119:root@pam: OK
Jul 27 20:18:01 node01 sshd-session[26565]: Received disconnect from <redacted> port 45590:11: disconnected by user
Jul 27 20:18:01 node01 sshd-session[26565]: Disconnected from user root <redacted> port 45590
Jul 27 20:18:01 node01 sshd-session[26557]: pam_unix(sshd:session): session closed for user root
Jul 27 20:18:01 node01 systemd[1]: session-119.scope: Deactivated successfully.
Jul 27 20:18:01 node01 systemd[1]: session-119.scope: Consumed 735ms CPU time, 125.7M memory peak.
Jul 27 20:18:01 node01 systemd-logind[1259]: Session 119 logged out. Waiting for processes to exit.
Jul 27 20:18:01 node01 systemd-logind[1259]: Removed session 119.
Jul 27 20:18:01 node01 sshd-session[26480]: Received disconnect from <redacted> port 60782:11: disconnected by user
Jul 27 20:18:01 node01 sshd-session[26480]: Disconnected from user root <redacted> port 60782
Jul 27 20:18:01 node01 sshd-session[26472]: pam_unix(sshd:session): session closed for user root
Jul 27 20:18:01 node01 systemd-logind[1259]: Session 118 logged out. Waiting for processes to exit.
Jul 27 20:18:01 node01 systemd[1]: session-118.scope: Deactivated successfully.
Jul 27 20:18:01 node01 systemd[1]: session-118.scope: Consumed 4.839s CPU time, 120M memory peak.
Jul 27 20:18:01 node01 systemd-logind[1259]: Removed session 118.
Jul 27 20:18:02 node01 sshd-session[26882]: Accepted publickey for root from <redacted> port 45594 ssh2: RSA SHA256:<redacted>
Jul 27 20:18:02 node01 sshd-session[26882]: pam_unix(sshd:session): session opened for user root(uid=0) by root(uid=0)
Jul 27 20:18:02 node01 systemd-logind[1259]: New session 120 of user root.
Jul 27 20:18:02 node01 systemd[1]: Started session-120.scope - Session 120 of User root.
Jul 27 20:18:02 node01 sshd-session[26890]: Received disconnect from <redacted> port 45594:11: disconnected by user
Jul 27 20:18:02 node01 sshd-session[26890]: Disconnected from user root <redacted> port 45594
Jul 27 20:18:02 node01 sshd-session[26882]: pam_unix(sshd:session): session closed for user root
Jul 27 20:18:02 node01 systemd[1]: session-120.scope: Deactivated successfully.
Jul 27 20:18:02 node01 systemd-logind[1259]: Session 120 logged out. Waiting for processes to exit.
Jul 27 20:18:02 node01 systemd-logind[1259]: Removed session 120.
Jul 27 20:18:02 node01 pmxcfs[1648]: [status] notice: received log
 
Last edited:
However, IDE trim/SCSI UNMAP/NVME blkdiscard does not work within the guest and proxmox apparently does not issue the commands itself before deleting a snapshot lv.
FWIW, we pushed out a fix for the combination of the new blockdev QEMU API we're using for new VMs on PVE 9 and discard only somewhat recently (Friday late afternoon CEST) with qemu-server version 9.0.8, so you might want to ensure you got that version already and then recheck after starting the guests freshly.
 
  • Like
Reactions: fiona
Thanks for the quick follow-up!

Here's the output I get:

Bash:
root@node02 ~ # rbd --pool ceph-nvme01 config image get vm-117-disk-1 rbd_cache_policy
rbd: rbd_cache_policy is not set

I also checked the Ceph client configuration:

Bash:
root@node02 ~ # grep rbd_cache_policy /etc/ceph/ceph.conf /etc/pve/ceph.conf
root@node02 ~ #

So there's no global override set either.

Just to clarify: this Ceph cluster was created and is fully managed by Proxmox VE (PVE-managed), no external cluster or manual Ceph deployment.

Could using KRBD be the issue?
Hi, thanks for testing! I also see a slight increase in boot time on PVE 9 with KRBD, but not (yet) as extreme as 1min30secs. Could you please
  1. open a new thread and mention me (@fweber) there
  2. check whether setting the machine version to 9.2+pve1 (Hardware->Machine, check the Advanced Settings) helps and include your observations in the new thread?
 
Hi, thanks for testing! I also see a slight increase in boot time on PVE 9 with KRBD, but not (yet) as extreme as 1min30secs. Could you please
  1. open a new thread and mention me (@fweber) there
  2. check whether setting the machine version to 9.2+pve1 (Hardware->Machine, check the Advanced Settings) helps and include your observations in the new thread?
@Bent Please also mention @fiona in the new thread.

Proposed patch that should fix the issue even with newer machine type: https://lore.proxmox.com/pve-devel/20250728084155.14151-1-f.ebner@proxmox.com/T/#u

Testing with older machine type would still be appreciated too, to verify it's the same issue.
 
  • Like
Reactions: fweber
Another issue I’ve encountered since upgrading to the PVE 9 Beta is that live migrations frequently / or almost always fail with the following error:

ERROR: online migrate failure - unable to parse migration status 'device' - aborting

This problem did not occur under PVE 8 on the same cluster configuration. Repeated attempts (sometimes up to a dozen retries) eventually succeed, but the behavior is clearly unstable and inconsistent compared to the previous version.

For reference, here is a relevant migration log excerpt:

Code:
task started by HA resource agent
2025-07-27 20:12:23 use dedicated network address for sending migration traffic (<redacted>)
2025-07-27 20:12:23 starting migration of VM 119 to node 'node01' (<redacted>)
2025-07-27 20:12:23 starting VM 119 on remote node 'node01'
2025-07-27 20:12:26 start remote tunnel
2025-07-27 20:12:27 ssh tunnel ver 1
2025-07-27 20:12:27 starting online/live migration on unix:/run/qemu-server/119.migrate
2025-07-27 20:12:27 set migration capabilities
2025-07-27 20:12:27 migration downtime limit: 100 ms
2025-07-27 20:12:27 migration cachesize: 1.0 GiB
2025-07-27 20:12:27 set migration parameters
2025-07-27 20:12:27 start migrate command to unix:/run/qemu-server/119.migrate
2025-07-27 20:12:28 migration active, transferred 683.0 MiB of 6.0 GiB VM-state, 1.3 GiB/s
2025-07-27 20:12:28 xbzrle: send updates to 17 pages in 0.0 B encoded memory, cache-miss 1.36%
2025-07-27 20:12:29 migration active, transferred 1.4 GiB of 6.0 GiB VM-state, 958.8 MiB/s
2025-07-27 20:12:29 xbzrle: send updates to 17 pages in 0.0 B encoded memory, cache-miss 1.36%
2025-07-27 20:12:30 migration active, transferred 2.2 GiB of 6.0 GiB VM-state, 871.4 MiB/s
2025-07-27 20:12:30 xbzrle: send updates to 17 pages in 0.0 B encoded memory, cache-miss 1.36%
2025-07-27 20:12:31 ERROR: online migrate failure - unable to parse migration status 'device' - aborting
2025-07-27 20:12:31 aborting phase 2 - cleanup resources
2025-07-27 20:12:31 migrate_cancel
2025-07-27 20:12:33 ERROR: migration finished with problems (duration 00:00:10)
TASK ERROR: migration problems
Please share the VM configuration, i.e. qm config 119. Is there anything noticeable in the system logs of the source node?
 
  • Like
Reactions: Bent
Please share the VM configuration, i.e. qm config 119. Is there anything noticeable in the system logs of the source node?


@fiona Nothing of any interest on the source node either. I reproduced the issue once again and here is the log from node01 (source) and node02 (destination). After dozens of tries it will eventually work (potential race condition?):

*this affects multiple VMs not just vmid 119

Code:
# vm 119 configuration
agent: 1
bios: ovmf
boot: order=virtio0
cores: 8
cpu: x86-64-v3
efidisk0: ceph-nvme01:vm-119-disk-2,efitype=4m,pre-enrolled-keys=1,size=528K
hotplug: disk,network
machine: q35
memory: 6144
name: <redacted>
net0: virtio=BA:7D:9C:74:69:90,bridge=vmbr0,queues=4,tag=100
numa: 0
onboot: 1
ostype: l26
scsihw: virtio-scsi-single
smbios1: uuid=05c12855-29af-4886-86ca-1dd07a6159ca
sockets: 1
tablet: 0
tags: ha
vga: virtio
virtio0: ceph-nvme01:vm-119-disk-1,cache=writeback,discard=on,iothread=1,size=20G
virtio1: ceph-nvme01:vm-119-disk-3,cache=writeback,discard=on,iothread=1,size=125G
virtio2: ceph-nvme01:vm-119-disk-0,cache=writeback,discard=on,iothread=1,size=180G
vmgenid: cde10473-eeaf-48ef-9aaa-fba728be9cc2

Code:
# node01 = source
Jul 28 14:00:00 node01 pvedaemon[2660]: <<redacted>@pve> starting task UPID:node01:0008A144:00627C76:68876640:hamigrate:119:<redacted>@pve:
Jul 28 14:00:01 node01 pvedaemon[2660]: <<redacted>@pve> end task UPID:node01:0008A144:00627C76:68876640:hamigrate:119:<redacted>@pve: OK
Jul 28 14:00:10 node01 pve-ha-lrm[565657]: <root@pam> starting task UPID:node01:0008A19B:00628060:6887664A:qmigrate:119:root@pam:
Jul 28 14:00:12 node01 pmxcfs[1648]: [status] notice: received log
Jul 28 14:00:13 node01 pmxcfs[1648]: [status] notice: received log
Jul 28 14:00:15 node01 pve-ha-lrm[565657]: Task 'UPID:node01:0008A19B:00628060:6887664A:qmigrate:119:root@pam:' still active, waiting
Jul 28 14:00:20 node01 pve-ha-lrm[565657]: Task 'UPID:node01:0008A19B:00628060:6887664A:qmigrate:119:root@pam:' still active, waiting
Jul 28 14:00:22 node01 pmxcfs[1648]: [status] notice: received log
Jul 28 14:00:23 node01 pmxcfs[1648]: [status] notice: received log
Jul 28 14:00:24 node01 pve-ha-lrm[565659]: migration problems
Jul 28 14:00:24 node01 pve-ha-lrm[565657]: <root@pam> end task UPID:node01:0008A19B:00628060:6887664A:qmigrate:119:root@pam: migration problems
Jul 28 14:00:24 node01 pve-ha-lrm[565657]: service vm:119 not moved (migration error)

Code:
# node02 = target
Jul 28 14:00:00 node02 pmxcfs[1652]: [status] notice: received log
Jul 28 14:00:01 node02 pmxcfs[1652]: [status] notice: received log
Jul 28 14:00:01 node02 CRON[589430]: pam_unix(cron:session): session opened for user root(uid=0) by root(uid=0)
Jul 28 14:00:01 node02 CRON[589432]: (root) CMD ( /usr/bin/ipmiutil wdt -r >/dev/null 2>&1)
Jul 28 14:00:01 node02 CRON[589430]: pam_unix(cron:session): session closed for user root
Jul 28 14:00:10 node02 pmxcfs[1652]: [status] notice: received log
Jul 28 14:00:10 node02 sshd-session[589519]: Accepted publickey for root from <redacted> port 49562 ssh2: RSA SHA256:<redacted>
Jul 28 14:00:10 node02 sshd-session[589519]: pam_unix(sshd:session): session opened for user root(uid=0) by root(uid=0)
Jul 28 14:00:10 node02 systemd-logind[1264]: New session 1353 of user root.
Jul 28 14:00:10 node02 systemd[1]: Started session-1353.scope - Session 1353 of User root.
Jul 28 14:00:10 node02 sshd-session[589527]: Received disconnect from <redacted> port 49562:11: disconnected by user
Jul 28 14:00:10 node02 sshd-session[589527]: Disconnected from user root <redacted> port 49562
Jul 28 14:00:10 node02 sshd-session[589519]: pam_unix(sshd:session): session closed for user root
Jul 28 14:00:10 node02 systemd[1]: session-1353.scope: Deactivated successfully.
Jul 28 14:00:10 node02 systemd-logind[1264]: Session 1353 logged out. Waiting for processes to exit.
Jul 28 14:00:10 node02 systemd-logind[1264]: Removed session 1353.
Jul 28 14:00:11 node02 sshd-session[589531]: Accepted publickey for root from <redacted> port 49568 ssh2: RSA SHA256:<redacted>
Jul 28 14:00:11 node02 sshd-session[589531]: pam_unix(sshd:session): session opened for user root(uid=0) by root(uid=0)
Jul 28 14:00:11 node02 systemd-logind[1264]: New session 1354 of user root.
Jul 28 14:00:11 node02 systemd[1]: Started session-1354.scope - Session 1354 of User root.
Jul 28 14:00:11 node02 sshd-session[589539]: Received disconnect from <redacted> port 49568:11: disconnected by user
Jul 28 14:00:11 node02 sshd-session[589539]: Disconnected from user root <redacted> port 49568
Jul 28 14:00:11 node02 sshd-session[589531]: pam_unix(sshd:session): session closed for user root
Jul 28 14:00:11 node02 systemd[1]: session-1354.scope: Deactivated successfully.
Jul 28 14:00:11 node02 systemd-logind[1264]: Session 1354 logged out. Waiting for processes to exit.
Jul 28 14:00:11 node02 systemd-logind[1264]: Removed session 1354.
Jul 28 14:00:11 node02 sshd-session[589542]: Accepted publickey for root from <redacted> port 49576 ssh2: RSA SHA256:<redacted>
Jul 28 14:00:11 node02 sshd-session[589542]: pam_unix(sshd:session): session opened for user root(uid=0) by root(uid=0)
Jul 28 14:00:11 node02 systemd-logind[1264]: New session 1355 of user root.
Jul 28 14:00:11 node02 systemd[1]: Started session-1355.scope - Session 1355 of User root.
Jul 28 14:00:11 node02 sshd-session[589551]: Received disconnect from <redacted> port 49576:11: disconnected by user
Jul 28 14:00:11 node02 sshd-session[589551]: Disconnected from user root <redacted> port 49576
Jul 28 14:00:11 node02 sshd-session[589542]: pam_unix(sshd:session): session closed for user root
Jul 28 14:00:11 node02 systemd[1]: session-1355.scope: Deactivated successfully.
Jul 28 14:00:11 node02 systemd-logind[1264]: Session 1355 logged out. Waiting for processes to exit.
Jul 28 14:00:11 node02 systemd-logind[1264]: Removed session 1355.
Jul 28 14:00:11 node02 sshd-session[589554]: Accepted publickey for root from <redacted> port 49590 ssh2: RSA SHA256:<redacted>
Jul 28 14:00:11 node02 sshd-session[589554]: pam_unix(sshd:session): session opened for user root(uid=0) by root(uid=0)
Jul 28 14:00:11 node02 systemd-logind[1264]: New session 1356 of user root.
Jul 28 14:00:11 node02 systemd[1]: Started session-1356.scope - Session 1356 of User root.
Jul 28 14:00:12 node02 qm[589564]: start VM 119: UPID:node02:0008FEFC:00652C8B:6887664C:qmstart:119:root@pam:
Jul 28 14:00:12 node02 qm[589563]: <root@pam> starting task UPID:node02:0008FEFC:00652C8B:6887664C:qmstart:119:root@pam:
Jul 28 14:00:12 node02 kernel:  rbd0: p1 p2 p3
Jul 28 14:00:12 node02 kernel: rbd: rbd0: capacity 21474836480 features 0x3d
Jul 28 14:00:12 node02 kernel: rbd: rbd1: capacity 134217728000 features 0x3d
Jul 28 14:00:12 node02 kernel: rbd: rbd2: capacity 193273528320 features 0x3d
Jul 28 14:00:12 node02 kernel: rbd: rbd6: capacity 1048576 features 0x3d
Jul 28 14:00:12 node02 systemd[1]: Started 119.scope.
Jul 28 14:00:12 node02 zebra[1393]: libyang Invalid boolean value "". (/frr-vrf:lib/vrf/state/active)
Jul 28 14:00:12 node02 zebra[1393]: libyang Invalid type uint32 empty value. (/frr-vrf:lib/vrf/state/id)
Jul 28 14:00:13 node02 kernel: tap119i0: entered promiscuous mode
Jul 28 14:00:13 node02 ovs-vsctl[589891]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port tap119i0
Jul 28 14:00:13 node02 ovs-vsctl[589891]: ovs|00002|db_ctl_base|ERR|no port named tap119i0
Jul 28 14:00:13 node02 ovs-vsctl[589893]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port fwln119i0
Jul 28 14:00:13 node02 ovs-vsctl[589893]: ovs|00002|db_ctl_base|ERR|no port named fwln119i0
Jul 28 14:00:13 node02 ovs-vsctl[589894]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl -- add-port vmbr0 tap119i0 tag=100 -- set Interface tap119i0 mtu_request=1500
Jul 28 14:00:13 node02 qm[589564]: VM 119 started with PID 589868.
Jul 28 14:00:13 node02 qm[589563]: <root@pam> end task UPID:node02:0008FEFC:00652C8B:6887664C:qmstart:119:root@pam: OK
Jul 28 14:00:13 node02 sshd-session[589562]: Received disconnect from <redacted> port 49590:11: disconnected by user
Jul 28 14:00:13 node02 sshd-session[589562]: Disconnected from user root <redacted> port 49590
Jul 28 14:00:13 node02 sshd-session[589554]: pam_unix(sshd:session): session closed for user root
Jul 28 14:00:13 node02 systemd-logind[1264]: Session 1356 logged out. Waiting for processes to exit.
Jul 28 14:00:13 node02 systemd[1]: session-1356.scope: Deactivated successfully.
Jul 28 14:00:13 node02 systemd[1]: session-1356.scope: Consumed 1.118s CPU time, 135.3M memory peak.
Jul 28 14:00:13 node02 systemd-logind[1264]: Removed session 1356.
Jul 28 14:00:13 node02 sshd-session[589922]: Accepted publickey for root from <redacted> port 49596 ssh2: RSA SHA256:<redacted>
Jul 28 14:00:13 node02 sshd-session[589922]: pam_unix(sshd:session): session opened for user root(uid=0) by root(uid=0)
Jul 28 14:00:13 node02 systemd-logind[1264]: New session 1357 of user root.
Jul 28 14:00:13 node02 systemd[1]: Started session-1357.scope - Session 1357 of User root.
Jul 28 14:00:22 node02 sshd-session[590014]: Accepted publickey for root from <redacted> port 60584 ssh2: RSA SHA256:<redacted>
Jul 28 14:00:22 node02 sshd-session[590014]: pam_unix(sshd:session): session opened for user root(uid=0) by root(uid=0)
Jul 28 14:00:22 node02 systemd-logind[1264]: New session 1358 of user root.
Jul 28 14:00:22 node02 systemd[1]: Started session-1358.scope - Session 1358 of User root.
Jul 28 14:00:22 node02 qm[590023]: <root@pam> starting task UPID:node02:000900C8:006530B0:68876656:qmstop:119:root@pam:
Jul 28 14:00:22 node02 qm[590024]: stop VM 119: UPID:node02:000900C8:006530B0:68876656:qmstop:119:root@pam:
Jul 28 14:00:22 node02 QEMU[589868]: kvm: terminating on signal 15 from pid 590024 (task UPID:node02:000900C8:006530B0:68876656:qmstop:119:root@pam:)
Jul 28 14:00:22 node02 kernel:  rbd0: p1 p2 p3
Jul 28 14:00:22 node02 ovs-vsctl[590029]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port fwln119i0
Jul 28 14:00:22 node02 ovs-vsctl[590029]: ovs|00002|db_ctl_base|ERR|no port named fwln119i0
Jul 28 14:00:22 node02 ovs-vsctl[590030]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port tap119i0
Jul 28 14:00:22 node02 qmeventd[1262]: read: Connection reset by peer
Jul 28 14:00:22 node02 systemd[1]: 119.scope: Deactivated successfully.
Jul 28 14:00:22 node02 systemd[1]: 119.scope: Consumed 3.417s CPU time, 5.5G memory peak.
Jul 28 14:00:23 node02 qm[590023]: <root@pam> end task UPID:node02:000900C8:006530B0:68876656:qmstop:119:root@pam: OK
Jul 28 14:00:23 node02 sshd-session[590022]: Received disconnect from <redacted> port 60584:11: disconnected by user
Jul 28 14:00:23 node02 sshd-session[590022]: Disconnected from user root <redacted> port 60584
Jul 28 14:00:23 node02 sshd-session[590014]: pam_unix(sshd:session): session closed for user root
Jul 28 14:00:23 node02 systemd[1]: session-1358.scope: Deactivated successfully.
Jul 28 14:00:23 node02 systemd[1]: session-1358.scope: Consumed 739ms CPU time, 126M memory peak.
Jul 28 14:00:23 node02 systemd-logind[1264]: Session 1358 logged out. Waiting for processes to exit.
Jul 28 14:00:23 node02 systemd-logind[1264]: Removed session 1358.
Jul 28 14:00:23 node02 sshd-session[589930]: Received disconnect from <redacted> port 49596:11: disconnected by user
Jul 28 14:00:23 node02 sshd-session[589930]: Disconnected from user root <redacted> port 49596
Jul 28 14:00:23 node02 sshd-session[589922]: pam_unix(sshd:session): session closed for user root
Jul 28 14:00:23 node02 systemd-logind[1264]: Session 1357 logged out. Waiting for processes to exit.
Jul 28 14:00:23 node02 systemd[1]: session-1357.scope: Deactivated successfully.
Jul 28 14:00:23 node02 systemd[1]: session-1357.scope: Consumed 7.580s CPU time, 120.5M memory peak.
Jul 28 14:00:23 node02 systemd-logind[1264]: Removed session 1357.
Jul 28 14:00:24 node02 sshd-session[590343]: Accepted publickey for root from <redacted> port 60600 ssh2: RSA SHA256:<redacted>
Jul 28 14:00:24 node02 sshd-session[590343]: pam_unix(sshd:session): session opened for user root(uid=0) by root(uid=0)
Jul 28 14:00:24 node02 systemd-logind[1264]: New session 1359 of user root.
Jul 28 14:00:24 node02 systemd[1]: Started session-1359.scope - Session 1359 of User root.
Jul 28 14:00:24 node02 sshd-session[590351]: Received disconnect from <redacted> port 60600:11: disconnected by user
Jul 28 14:00:24 node02 sshd-session[590351]: Disconnected from user root <redacted> port 60600
Jul 28 14:00:24 node02 sshd-session[590343]: pam_unix(sshd:session): session closed for user root
Jul 28 14:00:24 node02 systemd[1]: session-1359.scope: Deactivated successfully.
Jul 28 14:00:24 node02 systemd-logind[1264]: Session 1359 logged out. Waiting for processes to exit.
Jul 28 14:00:24 node02 systemd-logind[1264]: Removed session 1359.
Jul 28 14:00:24 node02 pmxcfs[1652]: [status] notice: received log
 
Last edited:
  • Like
Reactions: fiona
Awesome, thank you very much! The proposed patch fixes the migration issues for me <3

Had to change the patch a bit as it was not able to be applied to a clean PVE 9.0.0~11 host. Here is my changed proposal for the patch:

Code:
diff --git a/src/PVE/QemuMigrate.pm b/src/PVE/QemuMigrate.pm
index edaf2f25..5b854292 100644
--- a/src/PVE/QemuMigrate.pm
+++ b/src/PVE/QemuMigrate.pm
@@ -1354,7 +1354,7 @@ sub phase2 {
             next;
         }
 
-        if (!defined($status) || $status !~ m/^(active|completed|failed|cancelled)$/im) {
+        if (!defined($status) || $status !~ m/^(active|cancelled|completed|device|failed)$/im) {
             die $merr if $merr;
             die "unable to parse migration status '$status' - aborting\n";
         }
@@ -1394,7 +1394,7 @@ sub phase2 {
             die "aborting\n";
         }
 
-        if ($status ne 'active') {
+        if ($status ne 'active' && $status ne 'device') {
             $self->log('info', "migration status: $status");
             last;
         }

To apply the patch I used:

Code:
patch /usr/share/perl5/PVE/QemuMigrate.pm < ./QemuMigrate_pm.patch
 
Last edited:
Hello,

after updating one of my PVE hosts from 8.4.1 to 9.0-beta it seems i'm hitting that apparmor bug in an unprivileged LXC (which was running fine before):


- unprivileged podman container installed using the community scripts
- a non-root user running in that container (/etc/subuid & /etc/subgid properly setup for that user)
- any podman command in the container results in:

```
$ podman ps
cannot clone: Permission denied
Error: cannot re-exec process
```

the host's `journalctl -k` shows

```
juil. 29 18:49:54 pve1 kernel: audit: type=1400 audit(1753807794.198:312): apparmor="DENIED" operation="userns_create" class="namespace" profile="lxc-603_</var/lib/lxc>" pid=1389361 comm="podman>
```

I tried playing with those sysctl, but it didn't change anything

```
# sysctl -a | grep apparmor
kernel.apparmor_display_secid_mode = 0
kernel.apparmor_restrict_unprivileged_io_uring = 0
kernel.apparmor_restrict_unprivileged_unconfined = 0
kernel.apparmor_restrict_unprivileged_userns = 0
kernel.apparmor_restrict_unprivileged_userns_complain = 0
kernel.apparmor_restrict_unprivileged_userns_force = 0
kernel.unprivileged_userns_apparmor_policy = 1
```

I'll try booting from `6.8.12-13-pve` and report if it changed anything (with expectation that a lot of other thing will break, but anyways...)
 
Hello,

after updating one of my PVE hosts from 8.4.1 to 9.0-beta it seems i'm hitting that apparmor bug in an unprivileged LXC (which was running fine before):


- unprivileged podman container installed using the community scripts
- a non-root user running in that container (/etc/subuid & /etc/subgid properly setup for that user)
- any podman command in the container results in:

```
$ podman ps
cannot clone: Permission denied
Error: cannot re-exec process
```

the host's `journalctl -k` shows

```
juil. 29 18:49:54 pve1 kernel: audit: type=1400 audit(1753807794.198:312): apparmor="DENIED" operation="userns_create" class="namespace" profile="lxc-603_</var/lib/lxc>" pid=1389361 comm="podman>
```

I tried playing with those sysctl, but it didn't change anything

```
# sysctl -a | grep apparmor
kernel.apparmor_display_secid_mode = 0
kernel.apparmor_restrict_unprivileged_io_uring = 0
kernel.apparmor_restrict_unprivileged_unconfined = 0
kernel.apparmor_restrict_unprivileged_userns = 0
kernel.apparmor_restrict_unprivileged_userns_complain = 0
kernel.apparmor_restrict_unprivileged_userns_force = 0
kernel.unprivileged_userns_apparmor_policy = 1
```

I'll try booting from `6.8.12-13-pve` and report if it changed anything (with expectation that a lot of other thing will break, but anyways...)
Do you have the nesting and keyctl features enabled in the CTs options? If not, please try that.
 
Do you have the nesting and keyctl features enabled in the CTs options? If not, please try that.
sorry i forgot the container's config file:

Code:
arch: amd64
cores: 4
features: keyctl=1,mknod=1,nesting=1
hostname: podman1
memory: 8192
net0: name=eth0,bridge=vmbr0,hwaddr=BC:24:11:FB:00:77,ip=dhcp,type=veth
onboot: 1
ostype: debian
rootfs: ssd:vm-603-disk-0,size=100G
swap: 512
tags: community-script;container
unprivileged: 1
lxc.cgroup2.devices.allow: c 10:200 rwm
lxc.mount.entry: /dev/net/tun dev/net/tun none bind,create=file
lxc.idmap: u 0 100000 165536
lxc.idmap: g 0 100000 165536

So yes, they both are enabled. The tun stuff is to be able to connect to a tailnet (that works fine) (the idmap matches /etc/subuid and /etc/subgid in the container - if that would help on anything)
 
So yes, they both are enabled. The tun stuff is to be able to connect to a tailnet (that works fine) (the idmap matches /etc/subuid and /etc/subgid in the container - if that would help on anything)
i also tried kernel 6.8.12-13-pve with same results, so i guess this is caused by some userland changes...