Very funny / strange / annoying issue I discovered today. Any ideas, suggestions, requests for additional information are very much welcome.
Note: read to the bottom of the thread for details, a possible workaround, and check the risks this workaround entails in this post (more detailed here).
Short description: when the PVE host is using balance-rr bonding mode and both interfaces are connected, the network on the VMs is either not working at all, or not working reliably. Environment description at the bottom of the post.
Working scenarios/details:
- balance-rr bonding works fine for inter-node communication (tested with iperf and VM migrations) and for communication with other nodes (I have some multi-gig adapters on other boxes, connected to the same/different switches; in all scenarios, iperf transfer is in the range of 1.5 - 1.7 Gbps)
- when the PVE host is only connected via a single network interface (either onboard or USB, even as part of a balance-rr bond), everything works
- when the PVE host is using another type of bonding except balance-rr, everything works (tested with balance-alb, balance-tlb)
- VM guests (regardless of the VLAN) work fine in the two scenarios above (single NIC connected, or not using balance-rr): they get a DHCP address in the right network, and they are consistently reachable
Not working scenarios/details:
- when the PVE host is using balance-rr bonding mode and both interfaces are connected, the network access to the VMs is either unreliable (e.g. ssh takes 30 seconds to get in, if it does) or doesn't work at all
- to preempt an obvious question ("why not use balance-alb across the cluster?") - some of the USB adapters don't support it. Bad cheap stuff, I know.
- the VM guests can't get a DHCP IP address when rebooted / lease expires; there is however a request still going to the DHCP server, but no ACK from VM (the below repeats for ~5 minutes; if I take down one of the network interfaces on the PVE host, it suddenly works):
PVE environment:
- 3-node PVE cluster (node1/node2/node3), running latest version (apt full-upgrade ran yesterday, non-enterprise license)
- each node has two physical network interfaces (one onboard, one USB), 1 Gbps each
- all network interfaces are connected to the same physical unmanaged switch (this is a homelab setup, not enterprise)
- all network interfaces are detected and connected (lights on etc)
- each node has a bond0 made up of the two interfaces, using balance-rr mode; there are no VLANs or other manually-defined interfaces at PVE level
Guest VM environment (call it vm-357):
- VM based on Ubuntu 22.04 cloud-init image with some tweaks; none that I recall (at least) in the networking space, apart from disabling ipv6 (fully disabled across the network)
- single virtual network adapter on each host, using virtio (for some reason I haven't yet investigated, all other adapter types aren't even detected by the image)
- network adapter tagged with VLAN tag 3 and using vmbr0 as bridge
Network environment:
- unmanaged switch plugged into managed switch, where VLANs etc. are defined
				
			Note: read to the bottom of the thread for details, a possible workaround, and check the risks this workaround entails in this post (more detailed here).
Short description: when the PVE host is using balance-rr bonding mode and both interfaces are connected, the network on the VMs is either not working at all, or not working reliably. Environment description at the bottom of the post.
Working scenarios/details:
- balance-rr bonding works fine for inter-node communication (tested with iperf and VM migrations) and for communication with other nodes (I have some multi-gig adapters on other boxes, connected to the same/different switches; in all scenarios, iperf transfer is in the range of 1.5 - 1.7 Gbps)
- when the PVE host is only connected via a single network interface (either onboard or USB, even as part of a balance-rr bond), everything works
- when the PVE host is using another type of bonding except balance-rr, everything works (tested with balance-alb, balance-tlb)
- VM guests (regardless of the VLAN) work fine in the two scenarios above (single NIC connected, or not using balance-rr): they get a DHCP address in the right network, and they are consistently reachable
Not working scenarios/details:
- when the PVE host is using balance-rr bonding mode and both interfaces are connected, the network access to the VMs is either unreliable (e.g. ssh takes 30 seconds to get in, if it does) or doesn't work at all
- to preempt an obvious question ("why not use balance-alb across the cluster?") - some of the USB adapters don't support it. Bad cheap stuff, I know.
- the VM guests can't get a DHCP IP address when rebooted / lease expires; there is however a request still going to the DHCP server, but no ACK from VM (the below repeats for ~5 minutes; if I take down one of the network interfaces on the PVE host, it suddenly works):
Mar 16 16:09:33 dnsmasq-dhcp[20342]: DHCPDISCOVER(eth0.3) 10.10.3.7 e2:a2:f4:75:a2:a1
Mar 16 16:09:33 dnsmasq-dhcp[20342]: DHCPOFFER(eth0.3) 10.10.3.7 e2:a2:f4:75:a2:a1
PVE environment:
- 3-node PVE cluster (node1/node2/node3), running latest version (apt full-upgrade ran yesterday, non-enterprise license)
- each node has two physical network interfaces (one onboard, one USB), 1 Gbps each
- all network interfaces are connected to the same physical unmanaged switch (this is a homelab setup, not enterprise)
- all network interfaces are detected and connected (lights on etc)
- each node has a bond0 made up of the two interfaces, using balance-rr mode; there are no VLANs or other manually-defined interfaces at PVE level
root@node1:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback
iface eno1 inet manual
iface usb1 inet manual
auto bond0
iface bond0 inet manual
        bond-slaves eno1 usb1
        bond-miimon 100
        bond-mode balance-rr
auto vmbr0
iface vmbr0 inet static
        address 10.100.100.100/24
        gateway 10.100.100.1
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
root@node1:~#
root@node1:~# pveversion
pve-manager/7.3-6/723bb6ec (running kernel: 5.15.102-1-pve)
Guest VM environment (call it vm-357):
- VM based on Ubuntu 22.04 cloud-init image with some tweaks; none that I recall (at least) in the networking space, apart from disabling ipv6 (fully disabled across the network)
- single virtual network adapter on each host, using virtio (for some reason I haven't yet investigated, all other adapter types aren't even detected by the image)
- network adapter tagged with VLAN tag 3 and using vmbr0 as bridge
root@node1:~# grep net0 /etc/pve/qemu-server/357.conf
net0: virtio=E2:A2:F4:75:A2:A1,bridge=vmbr0,tag=3
root@node1:~#
root@vm-357:~# ifconfig ens18
ens18: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 1500
        ether e2:a2:f4:75:a2:a1  txqueuelen 1000  (Ethernet)
        RX packets 954  bytes 132240 (132.2 KB)
        RX errors 0  dropped 2  overruns 0  frame 0
        TX packets 24 bytes 8208 (8.2 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
Network environment:
- unmanaged switch plugged into managed switch, where VLANs etc. are defined
			
				Last edited: 
				
		
	
										
										
											
	
										
									
								 
	 
	
 
 
		
 
 
		
 ). Time permitting, I will also try to understand why this fixes the thing, unless someone has an easy explanation.
). Time permitting, I will also try to understand why this fixes the thing, unless someone has an easy explanation.