Kubernetes overlay networking breaks when upgrading from PVE 9.1 to PVE 9.2.3

joelvdvoort · May 30, 2026

I have an odd problem since upgrading my Proxmox cluster from PVE 9.1 to PVE 9.2.3. It consists of three physical proxmox nodes joined together in a cluster. On proxmox I host 3 talos linux control planes and 3 workers. After the update I'm no longer able to reach anything over the pod and service cidr's when the workloads are hosted on a Talos node that's on a different physical proxmox node then the one I'm testing from. Once I migrate the talos nodes together on the same physical proxmox, then traffic on the overlay network starts working again. To rule out kubernetes or the cni's, I've installed different Talos clusters with Cilium and kube-proxy and without kube-proxy and even reinstalled it once with just the standard Flannel cni. To further rule out Talos I've installed a k3s cluster with the built-in flannel cni. Still the same, no traffic over the pod and service cidr's unless the nodes are on the same physical proxmox node.

So today I decided to reinstall proxmox from the 9.1 release iso and reinstalled the Talos nodes, ran my tests and traffic in the overlay network just works. To test even further I upgraded again to 9.2.3 and after that, overlay networking is broken again.

Any help would be greatly appreciated.

gaetanc · May 31, 2026

I had the same issue after an update from PVE 8.x to 9.x (ended up being 9.2.3). I tried to rollback to 9.1 by forcing the packages but it didn't solve it.

I narrowed down the problem to a network communication problem between kubelet, but I couldn't figure what. I spend a bunch of time disabling the all the firewall checkbox but nothing worked.

After some prompting with Gemini, I got this:

Bash:
VirtIO Checksum Offloading Bug (The Sneaky Hypervisor Issue)
This is a notorious issue with Proxmox, virtio network drivers, and Kubernetes VXLANs. Sometimes, the hardware checksum offloading gets confused by the encapsulated VXLAN packets, corrupts the checksum, and the receiving node silently drops the packet because it thinks it's malformed.

The Fix: SSH into your CoreOS nodes and disable TX offloading on your primary network interface (usually eth0 or ens18) to see if it fixes the issue.

Bash:

sudo ethtool -K ens18 tx off

And it worked for me. I've no clue why I happened after the upgrade, everything was fine before and I didn't change anything in the VMs...

To make it permanent (don't forget to adapt your network interface):

Bash:

nmcli connection show
sudo nmcli connection modify "ens18" ethtool.feature-tx off
sudo nmcli connection up "ens18"

joelvdvoort · May 31, 2026

@gaetanc

Excellent find! I didn't have such luck when asking copilot and resorted to doing a lot of packet captures. Interestingly the traffic still reaches the virtio nics and the cillium_vxlan interface (where vxlan encapsulation is stripped and the actual UDP packet with pod ip's passes through). I had gotten as far as confirming that the traffic does not reach the pod itself using a netcat listener.

To confirm your ethtool fix works, I had to work around the fact that there's no SSH or shell on Talos Linux:

Bash:

kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -n kube-system --overrides='{"spec": {"hostNetwork": true, "nodeName": "talos-w1", "containers": [{"name": "tmp-shell1", "image": "nicolaka/netshoot", "stdin": true, "tty": true, "securityContext": {"privileged": true}}]}}'

This runs a privileged network troubleshooting container and from there I could run the ethtool against the host NIC ens18. After that, it works like you said; my netcat listener received whatever keystrokes I sent from the pod on the other node.

joelvdvoort · May 31, 2026

I've also done a test using two Talos nodes with E1000 adapters and apparently that works too!

EDIT: there’s a bug registered: https://bugzilla.proxmox.com/show_bug.cgi?id=7627

fiona · Jun 1, 2026

Hi @joelvdvoort and @gaetanc,
did you already try with kernel 7.0.6-2 to see if the backport there fixes the issue?

If it does not, could test with pve-qemu-kvm=10.2.1-2 and pve-qemu-kvm=10.1.2-7. A VM needs to be started fresh after installing to run with the newly installed package version.

What physical NIC do you have?

joelvdvoort · Jun 1, 2026

Hi @fiona,

I've tested it on the latest 7.0.6-2 kernel as well as the older 6.x kernel present using the proxmox boot tool; both did not work.

lshw gives me the following info on the physical nics:

Code:

 *-network
       description: Ethernet interface
       product: RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller
       vendor: Realtek Semiconductor Co., Ltd.
       physical id: 0
       bus info: pci@0000:02:00.0
       logical name: nic0
       version: 15
       serial: [redacted]
       size: 1Gbit/s
       capacity: 1Gbit/s
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress msix bus_master cap_list ethernet physical tp mii 10bt 10bt-fd 100bt 100bt-fd 1000bt-fd autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=r8169 driverversion=7.0.6-2-pve duplex=full firmware=rtl8168h-2_0.0.2 02/26/15 latency=0 link=yes multicast=yes port=twisted pair speed=1Gbit/s
       resources: irq:16 ioport:3000(size=256) memory:80804000-80804fff memory:80800000-80803fff

I downgraded the pve-qemu-kvm packages and tested everyting in order. This one pve-qemu-kvm=10.2.1-2 was not working for me and after installing pve-qemu-kvm=10.1.2-7 and rebooting the hosts and vm's, it starting working correctly. With pve-qemu-kvm=10.1.2-7 vxlan overlay networking still works.

fiona · Jun 2, 2026

Thank you for testing! Between QEMU 10.1 and 10.2 support for negotiating extended VirtIO features was added, so that is likely related to the regression. Unfortunately, we were not yet able to reproduce the issue.

Could you go to the VM monitor qm monitor ID with the numerical ID of the VM and run the following commands

Code:

qm> info virtio
/machine/peripheral/net1/virtio-backend [virtio-net]
/machine/peripheral/virtio0/virtio-backend [virtio-blk]
/machine/peripheral/balloon0/virtio-backend [virtio-balloon]
qm> info virtio-status /machine/peripheral/net1/virtio-backend
/machine/peripheral/net1/virtio-backend:
  device_name:             virtio-net (vhost)
  device_id:               1
...

Once with QEMU 11 or 10.2 and once with QEMU 10.1 so that the outputs can be compared.

fiona · Jun 2, 2026

If you see the VIRTIO_NET_F_HOST_UDP_TUNNEL_GSO_CSUM with QEMU 11.0, see also the following post:

Post in thread 'QEMU 11.0 available on pve-test and pve-no-subscription as of now'

Jun 2, 2026

@shanreich was finally able to reproduce the issue. When the physical NIC does not support the

Code:

tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]

features, the issue can happen. When the physical NIC does support them, it works. The issue seems to be that the feature negotiation between QEMU/host/guest (AFAICT wrongly) decides to turn on these features for the vNIC even if the underlying physical NIC does not support them. Although they are also turned on for the bridge and tap interfaces. We'll investigate further where exactly the negotiation/setting of...

joelvdvoort · Jun 2, 2026

fiona said:
Thank you for testing! Between QEMU 10.1 and 10.2 support for negotiating extended VirtIO features was added, so that is likely related to the regression. Unfortunately, we were not yet able to reproduce the issue.

Could you go to the VM monitor qm monitor ID with the numerical ID of the VM and run the following commands

Code:

qm> info virtio /machine/peripheral/net1/virtio-backend [virtio-net] /machine/peripheral/virtio0/virtio-backend [virtio-blk] /machine/peripheral/balloon0/virtio-backend [virtio-balloon] qm> info virtio-status /machine/peripheral/net1/virtio-backend /machine/peripheral/net1/virtio-backend: device_name: virtio-net (vhost) device_id: 1 ...

Once with QEMU 11 or 10.2 and once with QEMU 10.1 so that the outputs can be compared.

Please see that attached files for the output. Before I ran the virtio commands I did my end to end tests as well.

joelvdvoort · Jun 2, 2026

fiona said:
If you see the VIRTIO_NET_F_HOST_UDP_TUNNEL_GSO_CSUM with QEMU 11.0, see also the following post:

Post in thread 'QEMU 11.0 available on pve-test and pve-no-subscription as of now'

Jun 2, 2026

@shanreich was finally able to reproduce the issue. When the physical NIC does not support the

Code:

tx-udp_tnl-segmentation: off [fixed] tx-udp_tnl-csum-segmentation: off [fixed]

features, the issue can happen. When the physical NIC does support them, it works. The issue seems to be that the feature negotiation between QEMU/host/guest (AFAICT wrongly) decides to turn on these features for the vNIC even if the underlying physical NIC does not support them. Although they are also turned on for the bridge and tap interfaces. We'll investigate further where exactly the negotiation/setting of...

fiona

Please see the attached file for virtio output. I also ran my end to end tests; even with the patch it's still broken.

Additionally see the output below:

Bash:

root@pve1:~# qm showcmd 101 --pretty | grep net
  -netdev 'type=tap,id=net0,ifname=tap101i0,script=/usr/libexec/qemu-server/pve-bridge,downscript=/usr/libexec/qemu-server/pve-bridgedown,vhost=on' \
  -device 'virtio-net-pci,mac=BC:24:11:D1:ED:48,netdev=net0,bus=pci.0,addr=0x12,id=net0,rx_queue_size=1024,tx_queue_size=256,bootindex=102,host_mtu=1500,guest_tunnel_csum=off,host_tunnel_csum=off' \

EDIT: I saw in the other thread that the cause has been found. So the problem gets triggered when the underlying physical nic does not support checksum offloading.

eugene-bg · Jun 2, 2026

I have the same issue with my Proxmox 9.2.3.

I also run Talos 1.13.3(Issue persists with older version of Talos eg 1.12.8) in Proxmox 9.2.3.
I have 2 identical environments: Proxmox 9.2.3(broken) and Proxmox 9.1.6(working just fine). 3 host nodes, hardware Dell R740/R750/R340.
Both running 6 nodes Talos 1.13.3 with k8s 1.36.1 distributed across nodes: 1 cp per host node and 1 worker per host node.

K8S pods refuse to communicate with each other and corresponding services eg coredns if placed on different nodes in Proxmox 9.2.3. If I migrate all K8S nodes to same Proxmox host - everything starts working.
Environment that is running older version of Proxmox 9.1.6 has no issues. Everything is working. Pods can talk to each other and their services.

Also tried Flannel and Cilium to make sure it is not a CNI configuration issue with the same results - Both CNIs are working on Proxmox 9.1.6 and stopped working in Proxmox 9.2.3

Both K8S clusters are created using same IaC Terraform code. Both Proxmox environments has been configured manually with very basic settings (NTP, Bond). Proxmox 9.2.3 environment known to be working before the upgrade.

Attached requested `qm` debug outputs for both Proxmox 9.2.3(qemu11) and 9.1.6(qemu10)

Please let me know if you need more debug info

fiona · Jun 3, 2026

We are still investigating the root cause. Another workaround should be using machine version 10.1 for the affected VMs.

earzur · Jun 4, 2026

Thanks all for the fix ! ran into this on monday when i tried upgrading our cluster...

i did this (i run only q35 VM, take care to not break i440fx vms !)

Code:

qm list | grep -v VMID | awk '{ print $1; }' | xargs -I '{}' qm set '{}' --machine pc-q35-10.1

and then proceeded to reboot the VMs (qm shutdown xxx; qm start xxx) one by one (same context: talos v1.13.3 nodes losing cross-pve-host EVPN networking, biggest pain was longhorn volumes broken)

Thanks again

xrobau · Jun 7, 2026

I did just notice that running 'ifup eth0' (or whatever your physical interfaces are) fixes this. Is there something to do with vxlans binding somehow to the wrong interface?

Disabling tx offload may just be a red herring. It certainly works, but maybe that is triggering the fix, whatever it may be.

fiona · Jun 8, 2026

Hi @xrobau,

xrobau said:
I did just notice that running 'ifup eth0' (or whatever your physical interfaces are) fixes this. Is there something to do with vxlans binding somehow to the wrong interface?

seems like this does not work on our local reproducer. Are you sure that you have the same issue and it was not just the interface being down for you? Or maybe there was still some effect from an earlier command you tried active? When it is not working (after a fresh boot) what is the output when you run the more verbose command ifup -vd eth0?

gjrodenburg · Jun 9, 2026

I managed to fix my talos cluster by running the following (or adding the args line manually to the config):

qm set 195 --args '-global virtio-net-pci.host_tunnel_csum=off -global virtio-net-pci.csum=off'

eugene-bg · Jun 9, 2026

gjrodenburg said:
I managed to fix my talos cluster by running the following (or adding the args line manually to the config):

qm set 195 --args '-global virtio-net-pci.host_tunnel_csum=off -global virtio-net-pci.csum=off'

This worked for me!
Thanks a lot!

alucard1238 · Jun 25, 2026

Hi guys,

just a quick note to say thanks to everyone here. We hit the exact same issue on an OKD cluster (cross-host pod/service traffic breaking on PVE 9.2.3 / pve-qemu-kvm 11.0.0-4), and switching the affected VMs to machine type pc-q35-10.1 fixed it for us too.

Thanks a lot!

wintix · Jul 1, 2026

same here, okd broke with paravirtualized network, switchting to machine 10.1 works

fiona · Jul 1, 2026

See:

Post in thread 'QEMU 11.0 available on pve-test and pve-no-subscription as of now'

Jun 26, 2026

Seems like I forgot to follow-up regarding the VirtIO networking issue here before my vacation (only did it in another thread and the bug tracker). The issue was caused by a bug in the virtio-net kernel driver. The guest kernel is outside of Proxmox VE's control, so we are thinking about disabling the host_tunnel feature by default for now. Proposed patches with all the details:
https://lore.proxmox.com/pve-devel/20260626120701.116793-1-f.ebner@proxmox.com/T/

Kubernetes overlay networking breaks when upgrading from PVE 9.1 to PVE 9.2.3

New Member

New Member

VirtIO Checksum Offloading Bug (The Sneaky Hypervisor Issue)​

New Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

Proxmox Staff Member

New Member

Attachments

New Member

Attachments

New Member

Attachments

Proxmox Staff Member

Member

Member

Proxmox Staff Member

Member

New Member

New Member

Member

Proxmox Staff Member

We value your privacy

VirtIO Checksum Offloading Bug (The Sneaky Hypervisor Issue)