Note all of these things worked fine before upgrading to 9.
I've upgraded our production cluster to pve9 and most things are good but there are clearly some problems with virtio networking. Two things stand out (and there may be more).
First, our cluster has compute, storage and hybrid nodes. The storage nodes are all R740XD and the compute nodes are a mix of R740 and R750 and the hybrid node is a R750. The hybrid node is essentially the same as the other R750s in the stack, with a newer bios, and guests with virtio network drivers cannot be successfully migrated to or from this host. When migrating to the host the following error is observed (this is from any other node in the stack)
When migrating from the node to any other node in the stack, the receiving node generates the following error.
guests without the virtio network driver migrate fine to and from.
The second issue took me almost 24 hours to identify because it is so odd. We have a load balancer that talks to three vms that live on a different network segment. The vms all seem to work fine and are on the network without issue. From certain network segments an ssl session can be established but then when the vm starts sending data, it disappears / never makes it to the requester. This does not happen across all network segments. We moved the vms to the same network as the load balancer and it then worked. We initially thought it was a firewall issue and spent a lot of time on that but as a test I changed the interface to e1000 on one of the vms and everything worked on its old network segment. There is clearly some weird problem here.
It seems there is some issue with virtio network in the latest qemu. We are running the following across all hosts:
I've upgraded our production cluster to pve9 and most things are good but there are clearly some problems with virtio networking. Two things stand out (and there may be more).
First, our cluster has compute, storage and hybrid nodes. The storage nodes are all R740XD and the compute nodes are a mix of R740 and R750 and the hybrid node is a R750. The hybrid node is essentially the same as the other R750s in the stack, with a newer bios, and guests with virtio network drivers cannot be successfully migrated to or from this host. When migrating to the host the following error is observed (this is from any other node in the stack)
Code:
Aug 10 18:16:57 hybr-01-prod systemd[1]: Started 101.scope.
Aug 10 18:17:02 hybr-01-prod QEMU[1473513]: kvm: Features 0x130afffaf unsupported. Allowed features: 0x1c0010179bfffe7
Aug 10 18:17:02 hybr-01-prod QEMU[1473513]: kvm: Failed to load virtio-net:virtio
Aug 10 18:17:02 hybr-01-prod QEMU[1473513]: kvm: error while loading state for instance 0x0 of device '0000:00:12.0/virtio-net'
Aug 10 18:17:02 hybr-01-prod QEMU[1473513]: kvm: Putting registers after init: Failed to set XCRs: Invalid argument
When migrating from the node to any other node in the stack, the receiving node generates the following error.
Code:
Aug 10 18:51:00 comp-06-prod QEMU[1578269]: kvm: get_pci_config_device: Bad config data: i=0x10 read: a1 device: 1 cmask: ff wmask: c0 w1cmask:0
Aug 10 18:51:00 comp-06-prod QEMU[1578269]: kvm: Failed to load PCIDevice:config
Aug 10 18:51:00 comp-06-prod QEMU[1578269]: kvm: Failed to load virtio-net:virtio
Aug 10 18:51:00 comp-06-prod QEMU[1578269]: kvm: error while loading state for instance 0x0 of device '0000:00:12.0/virtio-net'
Aug 10 18:51:00 comp-06-prod QEMU[1578269]: kvm: load of migration failed: Invalid argument
guests without the virtio network driver migrate fine to and from.
The second issue took me almost 24 hours to identify because it is so odd. We have a load balancer that talks to three vms that live on a different network segment. The vms all seem to work fine and are on the network without issue. From certain network segments an ssl session can be established but then when the vm starts sending data, it disappears / never makes it to the requester. This does not happen across all network segments. We moved the vms to the same network as the load balancer and it then worked. We initially thought it was a firewall issue and spent a lot of time on that but as a test I changed the interface to e1000 on one of the vms and everything worked on its old network segment. There is clearly some weird problem here.
It seems there is some issue with virtio network in the latest qemu. We are running the following across all hosts:
Code:
QEMU emulator version 10.0.2 (pve-qemu-kvm_10.0.2-4)
Copyright (c) 2003-2025 Fabrice Bellard and the QEMU Project developers
pve-manager/9.0.4/39d8a4de7dfb2c40 (running kernel: 6.14.8-2-pve)