Proxmox 5.4 to 6.0 : Strange network issues

Sébastien Riccio · Jul 18, 2019

Hello!

I've recently updated 3 of our 4 nodes Proxmox 5.4 cluster to v6.0.

While I was preparing to update the latest node, I found out that my VMs migrated on the updated nodes are losing network access after a few minutes.

If I live migrate the VM to the node still in 5.4, the network is back in the VM and it stays working.

At soon as I migrate it back to a Proxmox 6 node, the network inside the VM work for a few minutes and then goes bozo again.

I noticed this in the dmesg, on the PVE 6.x nodes:

Code:

[  214.767305] netlink: 'ovs-vswitchd': attribute type 5 has an invalid length.
[  493.020271] device tap108i0 entered promiscuous mode
[  493.066847] fwbr108i0: port 1(tap108i0) entered blocking state
[  493.066851] fwbr108i0: port 1(tap108i0) entered disabled state
[  493.067128] fwbr108i0: port 1(tap108i0) entered blocking state
[  493.067134] fwbr108i0: port 1(tap108i0) entered forwarding state
[  493.088534] netlink: 'ovs-vswitchd': attribute type 5 has an invalid length.
[  493.088859] device fwln108o0 entered promiscuous mode
[  493.110525] fwbr108i0: port 2(fwln108o0) entered blocking state
[  493.110528] fwbr108i0: port 2(fwln108o0) entered disabled state
[  493.110749] fwbr108i0: port 2(fwln108o0) entered blocking state
[  493.110752] fwbr108i0: port 2(fwln108o0) entered forwarding state
[  552.054034] device tap111i0 entered promiscuous mode
[  552.098076] fwbr111i0: port 1(tap111i0) entered blocking state
[  552.098080] fwbr111i0: port 1(tap111i0) entered disabled state
[  552.098358] fwbr111i0: port 1(tap111i0) entered blocking state
[  552.098361] fwbr111i0: port 1(tap111i0) entered forwarding state
[  552.119806] netlink: 'ovs-vswitchd': attribute type 5 has an invalid length.
[  552.120138] device fwln111o0 entered promiscuous mode
[  552.140908] fwbr111i0: port 2(fwln111o0) entered blocking state
[  552.140912] fwbr111i0: port 2(fwln111o0) entered disabled state
[  552.141179] fwbr111i0: port 2(fwln111o0) entered blocking state
[  552.141182] fwbr111i0: port 2(fwln111o0) entered forwarding state
[  684.551647] device tap110i0 entered promiscuous mode
[  684.596609] netlink: 'ovs-vswitchd': attribute type 5 has an invalid length.
[ 1509.438479] device tap109i0 entered promiscuous mode
[ 1509.485649] netlink: 'ovs-vswitchd': attribute type 5 has an invalid length.

(the attribute type 5 thingy seems suspicious to me, it's only on the pve 6 nodes)

Network config is 2 x 2 x10gnic bounds. One bound is trunk port for VM traffic, the other bond is dedicated for network storage.

Interfaces:

Code:

auto lo
iface lo inet loopback
iface eno1 inet manual
iface eno2 inet manual
iface eno3 inet manual
iface eno4 inet manual
iface enp4s0f0 inet manual
iface enp4s0f1 inet manual
iface enp4s0f2 inet manual
iface enp4s0f3 inet manual

#Management + Trunk bond
allow-vmbr0 bond0
iface bond0 inet manual
        ovs_bonds eno1 eno2
        ovs_type OVSBond
        ovs_bridge vmbr0
        ovs_options bond_mode=active-backup

#Storage bond
allow-vmbr1 bond1
iface bond1 inet manual
        ovs_bonds eno3 eno4
        ovs_type OVSBond
        ovs_bridge vmbr1
        ovs_options bond_mode=active-backup

# Management + Trunk bridge
auto vmbr0
iface vmbr0 inet manual
        ovs_type OVSBridge
        ovs_ports bond0 mgmt

# Storage bridge
auto vmbr1
iface vmbr1 inet manual
        ovs_type OVSBridge
        ovs_ports bond1 stor

#Management local interface
allow-vmbr0 mgmt
iface mgmt inet static
        address  10.111.11.104
        netmask  255.255.0.0
        gateway  10.111.0.6
        ovs_type OVSIntPort
        ovs_bridge vmbr0

#Storage local interface
allow-vmbr1 stor
iface stor inet static
        address  10.50.50.104
        netmask  255.255.255.0
        ovs_type OVSIntPort
        ovs_bridge vmbr1

Any idea what could be wrong. Some commands I might try for openvswitch to get some informations ?
I'm a OVS newbie

Thanks you!

Sébastien Riccio · Jul 18, 2019

Actually when it happens there is this in openvswitch log:

Code:

2019-07-18T11:07:11.502Z|00010|ofproto_dpif_xlate(handler108)|WARN|received packet on unknown port 3 while processing icmp6,in_port=3,vlan_tci=0x0000,dl_src=22:0a:0d:fc:13:85,dl_dst=33:33:00:00:00:02,ipv6_src=fe80::200a:dff:fefc:1385,ipv6_dst=ff02::2,ipv6_label=0x00000,nw_tos=0,nw_ecn=0,nw_ttl=255,icmp_type=133,icmp_code=0 on bridge vmbr1

Trying to figure out why this happens

RokaKen · Jul 18, 2019

Sébastien Riccio said:
Hello!

<snip>
[ 1509.485649] netlink: 'ovs-vswitchd': attribute type 5 has an invalid length.

(the attribute type 5 thingy seems suspicious to me, it's only on the pve 6 nodes)

That particular message is known bug that will hopefully go away when the OVS package gets the fix. Also, the other warning you posted may coincide with the problem, but isn't the cause -- they are just IPv6 ICMP messages that OVS can't process. I have those messages in PVE 5.4 without any connectivity problems.

I'm no expert on OVS either, but this tutuorial show how to increase the logging level of OVS which may help identify the root cause.

Sébastien Riccio · Jul 19, 2019

Hi, thank you for your reply.

Yes that's right. After posting I found out about the patch for the "attribute type 5 has an invalid length" and quite understood it's more a log entry problem than a real problem.

Same about ipv6 messages, I found out they are not related to my issue, looks like a coincidence.

What is really strange is that the problem seems to be related to two specific VMs.

I was able to move 6 other VMs on the 6.x nodes without having network issues with them for a whole night.

But these two remaining VMs, after a couple minutes on 6.x node, bam, their network connectivity is down.

I wasn't able yet to identify a common configuration that they don't share with other VMs, that could be source of the problem and also they run fine on the remaining 5.4 host, so lot of head scratching going here

Sébastien Riccio · Jul 19, 2019

A little update here about my issue:

So I thought that it was only some VMs having this issue but it is all VMs. Only some take much longer time to lose connectivity, when some others it's a matter of 4-5 minutes.

I continued investingating and tried to:

Remove openvswitch 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12 and replace it with older 2.7.0-3 from proxmox 5.4.
No change so I reverted it back.

Now I booted the v6 nodes to previous kernel
Linux pm04 4.15.18-18-pve #1 SMP PVE 4.15.18-44 (Wed, 03 Jul 2019 11:19:13 +0200) x86_64 GNU/Linux

instead of new 6.x kernel
Linux pm04 5.0.15-1-pve #1 SMP PVE 5.0.15-1 (Wed, 03 Jul 2019 10:51:57 +0200) x86_64 GNU/Linux

I live migrated all the vm's on the 6.x node with 4.15 kernel and so far no VM has lost connectivity for an hour.
Not saying that it's 100% sure it fixed it but for now it seems better.

I will wait for a few hours to confirm if it's stable.

czk · Aug 2, 2019

I have exactly same problem, similar network config. I already tried few things like upgrading (form sid) and downgrading openvswitch. For now booting older kernel (4.19) seems to be stable workaround.

Sébastien Riccio · Aug 2, 2019

Hello, I am somehow happy that I'm not the only one on this boat. I also can confirm that since we falled back to a 4.x kernel, the situation is stable.
I made an attempt at opening a ticket on the bugzilla, but no feedback so far

https://bugzilla.proxmox.com/show_bug.cgi?id=2296

jermudgeon · Aug 3, 2019

I'm having the attribute type 5 error as well. For me the temporary fallback was to remove openvswitch from the interfaces using the Mellanox 4 driver:

https://forum.proxmox.com/threads/pve-6-and-mellanox-4-x-drivers.56553/#post-260909

dragon2611 · Aug 18, 2019

I believe I've just run into this on a machine where the uplink is 2x 1G in a LACP.

Something not playing nice in openvswitch?

Edit: Interestingly one of my machines has the problem, the other one doesn't seem to, trying Linux Bridge instead of OVS on the one with the issue.

jermudgeon · Aug 18, 2019

I had to revert to linux bridge, and also removed LACP. Interestingly OVS+LACP works just fine with different NICs. I had three nearly identical servers that experienced the problem, but intermittently -- average working uptime per machine was about 36 hours, which meant that on any given day one could expect one or two lockups of this nature.

Would be great to narrow down a bit more to the root cause and get an actual fix. This is a major regression for me, and will clearly screw over anyone affected who is using LACP for VM-facing interfaces.

What model NICs are you having the problem with?

dragon2611 · Aug 18, 2019

Intel I350's I think,

08:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
08:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
0a:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection

2: enp8s0f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
3: enp8s0f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP group default qlen 1000

iankun · Aug 22, 2019

I had the same issues with OVS and migration to proxmox 6, which I solved by moving to linux bond and bridge. It would seem that the way Proxmox implements "tapping" into OVS bridge is not supported and that is why it does not work anymore: http://docs.openvswitch.org/en/latest/faq/issues/

From the link: The short answer is that this is a misuse of a “tap” device. Use an “internal” device implemented by Open vSwitch, which works differently and is designed for this use.

t.lamprecht · Aug 23, 2019

iankun said:
I had the same issues with OVS and migration to proxmox 6, which I solved by moving to linux bond and bridge. It would seem that the way Proxmox implements "tapping" into OVS bridge is not supported and that is why it does not work anymore: http://docs.openvswitch.org/en/latest/faq/issues/

From the link: The short answer is that this is a misuse of a “tap” device. Use an “internal” device implemented by Open vSwitch, which works differently and is designed for this use.

We already create internal typed devices... https://git.proxmox.com/?p=pve-comm...9613eff9525e17ffc7fa9aa4905a7097;hb=HEAD#l257

We're trying to see if we can reproduce this here.

As others already mentioned, the "attribute type 5 has an invalid length." message is not from an error or the like, is in the logging stack which got a bit stricter. FYI, we currently use the direct Debian packages for OVS, as it was a recent version and seemed to work OK here on testing.
That also means that you could try to open a bug report there, we naturally will do that too and investigate ourself if we can reproduce this.

iankun · Aug 23, 2019

If that helps, here is m minimal config that has the issue with the latest updates as of today:

Code:

allow-vmbr0 bond0
iface bond0 inet manual
        ovs_bonds eno8 eno7
        ovs_type OVSBond
        ovs_bridge vmbr0
        mtu 9216
        ovs_options bond_mode=balance-slb
        pre-up ( ifconfig eno7 mtu 9216 && ifconfig eno8 mtu 9216 )

auto lo
iface lo inet loopback

auto eno6
iface eno6 inet static
        address  [REDUCTED]
        netmask  255.255.255.0
        gateway  [REDUCTED]
#Proxmox Net1

auto eno5
iface eno5 inet static
        address  [REDUCTED]
        netmask  255.255.255.0
#Proxmox Net2

iface eno1 inet manual

iface eno2 inet manual

iface eno3 inet manual

iface eno4 inet manual

iface eno7 inet manual

iface eno8 inet manual

allow-vmbr0 vlan3000
iface vlan3000 inet static
        address  [REDUCTED]
        netmask  255.255.255.0
        ovs_type OVSIntPort
        ovs_bridge vmbr0
        mtu 9000
        ovs_options tag=3000
        ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif lldp:enable=true
#Storage 1

allow-vmbr0 vlan3001
iface vlan3001 inet static
        address  [REDUCTED]
        netmask  255.255.255.0
        ovs_type OVSIntPort
        ovs_bridge vmbr0
        mtu 9000
        ovs_options tag=3001
        ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif lldp:enable=true
#Storage 2

auto vmbr0
iface vmbr0 inet manual
        ovs_type OVSBridge
        ovs_ports bond0 vlan3000 vlan3001
        mtu 9216

And here is the hardware (NICs):

Code:

lspci -vv | grep Eth
04:00.0 Ethernet controller: Intel Corporation Ethernet Connection X552 10 GbE SFP+
        Subsystem: Super Micro Computer Inc Ethernet Connection X552 10 GbE SFP+
04:00.1 Ethernet controller: Intel Corporation Ethernet Connection X552 10 GbE SFP+
        Subsystem: Super Micro Computer Inc Ethernet Connection X552 10 GbE SFP+
08:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
09:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
0c:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
0c:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
0c:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
0c:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)

Note that all of the other functionality worked, but VMs had not traffic to or from them. This applies to both IPv6 and IPv4 .

Richard · Aug 29, 2019

iankun said:

If that helps, here is m minimal config that has the issue with the latest updates as of today:

Code:

allow-vmbr0 bond0
iface bond0 inet manual
        ovs_bonds eno8 eno7
        ovs_type OVSBond
        ovs_bridge vmbr0
        mtu 9216
        ovs_options bond_mode=balance-slb
        pre-up ( ifconfig eno7 mtu 9216 && ifconfig eno8 mtu 9216 )

auto lo
iface lo inet loopback

auto eno6
iface eno6 inet static
        address  [REDUCTED]
        netmask  255.255.255.0
        gateway  [REDUCTED]
#Proxmox Net1

auto eno5
iface eno5 inet static
        address  [REDUCTED]
        netmask  255.255.255.0
#Proxmox Net2

iface eno1 inet manual

iface eno2 inet manual

iface eno3 inet manual

iface eno4 inet manual

iface eno7 inet manual

iface eno8 inet manual

allow-vmbr0 vlan3000
iface vlan3000 inet static
        address  [REDUCTED]
        netmask  255.255.255.0
        ovs_type OVSIntPort
        ovs_bridge vmbr0
        mtu 9000
        ovs_options tag=3000
        ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif lldp:enable=true
#Storage 1

allow-vmbr0 vlan3001
iface vlan3001 inet static
        address  [REDUCTED]
        netmask  255.255.255.0
        ovs_type OVSIntPort
        ovs_bridge vmbr0
        mtu 9000
        ovs_options tag=3001
        ovs_extra set interface ${IFACE} external-ids:iface-id=$(hostname -s)-${IFACE}-vif lldp:enable=true
#Storage 2

auto vmbr0
iface vmbr0 inet manual
        ovs_type OVSBridge
        ovs_ports bond0 vlan3000 vlan3001
        mtu 9216

And here is the hardware (NICs):

Code:

lspci -vv | grep Eth
04:00.0 Ethernet controller: Intel Corporation Ethernet Connection X552 10 GbE SFP+
        Subsystem: Super Micro Computer Inc Ethernet Connection X552 10 GbE SFP+
04:00.1 Ethernet controller: Intel Corporation Ethernet Connection X552 10 GbE SFP+
        Subsystem: Super Micro Computer Inc Ethernet Connection X552 10 GbE SFP+
08:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
09:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
0c:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
0c:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
0c:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
0c:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)

Note that all of the other functionality worked, but VMs had not traffic to or from them. This applies to both IPv6 and IPv4 .

You should set lacp=active. From this apart we tested something corresponding to your configuration for a couple of days in both pve5 and pve6 and did not encounter any problems.

Note: if the switch is not configured correctly it may work for a while but sporadic blockings are possible.

iankun · Aug 29, 2019

Richard said:
You should set lacp=active. From this apart we tested something corresponding to your configuration for a couple of days in both pve5 and pve6 and did not encounter any problems.

Note: if the switch is not configured correctly it may work for a while but sporadic blockings are possible.

This is not an LACP link and cards are connected to two separate switches that are not “stacked”. The config works just fine on Proxmox 5 with VMs and CTs and other traffic. It fails on Proxmox 6 for VM and CT traffic but other traffic is fine with no errors. My other configurations that have LACP and balancing have same issues with Proxmox 6. Now I am using strictly Linux bridging and bonding that works just fine for everything. I would suggest to look into kernel compatibility as noted by other users, which seems to fix issues if downgraded on Proxmox 6.

Edit: forgot to add important bit, VLANs. I use VLANs and suspect that is where the problem might be.

Sébastien Riccio · Aug 30, 2019

Hello,

We are also suffering networking problems with kernel 5.x like this original poster and also in another thread and bugzilla i've opened
https://bugzilla.proxmox.com/show_bug.cgi?id=2296

Booting with 4.x kernel from proxmox 5 resolves the problem. But now we wanted to create a windows VM and it seems it need kernel 5.x to work:

Code:

Hyper-V TLB flush support (requested by 'hv-tlbflush' cpu flag)  is not supported by kernel
kvm: kvm_init_vcpu failed: Function not implemented
TASK ERROR: start failed: command '/usr/bin/kvm -id 106 -name win01 -chardev 'socket,id=qmp,path=/var/run/qemu-server/106.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/106.pid -daemonize -smbios 'type=1,uuid=c5b3866a-1a57-4547-a158-e9e247281436' -smp '4,sockets=1,cores=4,maxcpus=4' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc unix:/var/run/qemu-server/106.vnc,password -no-hpet -cpu 'SandyBridge,+pcid,+kvm_pv_unhalt,+kvm_pv_eoi,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,hv_synic,hv_stimer,hv_tlbflush,hv_ipi,enforce,vendor=GenuineIntel' -m 16384 -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'vmgenid,guid=a1721343-0c29-46e4-8a28-e01a901c46c1' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'qxl-vga,id=vga,bus=pci.0,addr=0x2' -chardev 'socket,path=/var/run/qemu-server/106.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -spice 'tls-port=61007,addr=127.0.0.1,tls-ciphers=HIGH,seamless-migration=on' -device 'virtio-serial,id=spice,bus=pci.0,addr=0x9' -chardev 'spicevmc,id=vdagent,name=vdagent' -device 'virtserialport,chardev=vdagent,name=com.redhat.spice.0' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:342a67af728c' -drive 'if=none,id=drive-ide0,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=100' -drive 'file=/mnt/pve/NetApp_SSD/images/106/vm-106-disk-0.qcow2,if=none,id=drive-virtio0,cache=writeback,format=qcow2,aio=threads,detect-zeroes=on' -device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=200' -netdev 'type=tap,id=net0,ifname=tap106i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on,queues=2' -device 'virtio-net-pci,mac=62:4F:A2:93:31:65,netdev=net0,bus=pci.0,addr=0x12,id=net0,vectors=6,mq=on,bootindex=300' -rtc 'driftfix=slew,base=localtime' -machine 'type=pc' -global 'kvm-pit.lost_tick_policy=discard'' failed: exit code 1

So we're kinda stuck here. We can't switch to 5.x until these network issues with 5.x are looked at.
Is it a way to temporary disable this hv-tlbflush flag so we can start the vm in the meantime ?

Thank you for your help

Note: The issue exists with lacp bonding and normal bonding (active-passive), using openvswitch and with different cards brands.
I personnaly haven't tried with linux bridging/bonding as we need features only available using openvswitch.

Note2: Tried to boot again a node with latest kernel available now (5.0.21-1-pve), moved a VM on it. Network in the VM is lost after 2-3 minutes.

t.lamprecht · Aug 30, 2019

Sébastien Riccio said:
Booting with 4.x kernel from proxmox 5 resolves the problem. But now we wanted to create a windows VM and it seems it need kernel 5.x to work:

FYI: You can select an older VM WIndows OS version than it really is as a workaround, vista should hvae tlbflush disabled, maybe even Windows 7 (not sure from top of my head). That should made it work with the older kernel.

t.lamprecht · Aug 30, 2019

iankun said:
Edit: forgot to add important bit, VLANs. I use VLANs and suspect that is where the problem might be.

Do you use "ifupdown2" (needs to be manually installed)?

Sébastien Riccio · Aug 30, 2019

Hello @t.lamprecht,

Thank you for the info. I will try to change the OS version and see where it goes.

By the way, the node i had restarted with 5.x kernel also lost the connection in the host for ceph and management interface and went in fenced mode. So the net issue also affects the host itself, not only the VMs.

Proxmox 5.4 to 6.0 : Strange network issues

Active Member

Active Member

Active Member

Active Member

Active Member

Member

Active Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Member

Proxmox Staff Member

Member

Renowned Member

Member

Active Member

Proxmox Staff Member

Proxmox Staff Member

Active Member

We value your privacy