Odd bridge/VLAN/bond behaviour having upgraded two of three cluster machines from 4.4 to 5.0

Jul 13, 2017
11
1
6
50
With 4.4, I could assign IP addresses to two bridges, on different VLANs interfaces, on the same bond. Something, possibly the bridge configuration in Debian Stretch, is not happy with my setup.

All machines use bonded NICs with VLANs. In both cases the last bridge to be configured is used for corosync cluster communications. It is on VLAN v1910. The first bridge interface has the default gateway, 172.17.83.1/24

However with Proxmox 5.0, only the most recently created bridge passes traffic (that I can see). So pvemanager (and GUI running on remaining 4.4 machine), which uses the bridge on v1901, cannot see the 5.0 machines, but they do form part of the quorum via v1910. Also how I have SSH access.

If I move the iface vmbr01 inet static stanza in /etc/network/interfaces below the stanza for vmbr10, I get the opposite configuration. pvemanager:yes, corosync:no. I can't figure it out. Driving me crazy. :confused:

root@lab2:~# brctl show
bridge name bridge id STP enabled interfaces
vmbr0 8000.d8d385b7a3c1 no bond1
vmbr01 8000.d8d385b7a3c0 no bond0.1901
vmbr10 8000.d8d385b7a3c0 no bond0.1910

root@lab2:~# ip link show bond0.1901
11: bond0.1901@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr01 state UP mode DEFAULT group default qlen 1000
link/ether d8:d3:85:b7:a3:c0 brd ff:ff:ff:ff:ff:ff
root@lab2:~# ip link show bond0.1910
26: bond0.1910@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr10 state UP mode DEFAULT group default qlen 1000
link/ether d8:d3:85:b7:a3:c0 brd ff:ff:ff:ff:ff:ff

root@lab2:~# ip -4 address show vmbr01
12: vmbr01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
inet 172.17.83.14/24 brd 172.17.83.255 scope global vmbr01
valid_lft forever preferred_lft forever
root@lab2:~# ip -4 address show vmbr10
27: vmbr10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
inet 172.20.33.100/29 brd 172.20.33.103 scope global vmbr10
valid_lft forever preferred_lft forever

root@lab2:~# ip route show
default via 172.17.83.1 dev vmbr01 onlink
172.17.83.0/24 dev vmbr01 proto kernel scope link src 172.17.83.14
172.20.33.96/29 dev vmbr10 proto kernel scope link src 172.20.33.100

root@lab2:~# ip neigh
172.20.33.97 dev vmbr10 lladdr 00:04:96:97:b9:04 STALE
172.17.83.251 dev vmbr01 FAILED
172.20.33.98 dev vmbr10 lladdr 78:e3:b5:f6:22:20 REACHABLE
172.20.33.99 dev vmbr10 lladdr d8:d3:85:bb:40:c0 REACHABLE
172.17.83.1 dev vmbr01 FAILED

root@lab2:~# ping 172.17.83.1
PING 172.17.83.1 (172.17.83.1) 56(84) bytes of data.
From 172.17.83.14 icmp_seq=1 Destination Host Unreachable


root@lab1:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface eth0 inet manual

iface eth1 inet manual

iface eth2 inet manual

iface eth3 inet manual

auto bond0
iface bond0 inet manual
slaves eth0 eth1
bond_miimon 100
bond_mode 802.3ad
bond_xmit_hash_policy layer3+4
bond_lacp_rate 1
bond_downdelay 200
bond_updelay 200

auto bond1
iface bond1 inet manual
slaves eth2 eth3
bond_miimon 100
bond_mode 802.3ad
bond_xmit_hash_policy layer3+4
bond_lacp_rate 1
bond_downdelay 200
bond_updelay 200

auto bond0.1901
iface bond0.1901 inet manual
vlan-raw-device bond0

auto bond0.1910
iface bond0.1910 inet manual
vlan-raw-device bond0

auto vmbr0
iface vmbr0 inet manual
bridge_ports bond1
bridge_stp off
bridge_fd 0
bridge_vlan_aware yes

auto vmbr01
iface vmbr01 inet static
address 172.17.83.13
netmask 255.255.255.0
gateway 172.17.83.1
bridge_ports bond0.1901
bridge_stp off
bridge_fd 0

auto vmbr10
iface vmbr10 inet static
address 172.20.33.99
netmask 255.255.255.248
bridge_ports bond0.1910
bridge_stp off
bridge_fd 0
 
> However with Proxmox 5.0, only the most recently created bridge passes traffic (that I can see). So pvemanager (and GUI running on remaining 4.4 machine), which uses the bridge on v1901, cannot see the 5.0 machines, but they do form part of the quorum via v1910. Also how I have SSH access.

If you using this configuration, can you confirm, that all VMs which have a NIC in v1901 can communicate with each other ?

Two more questions regarding your config:
* if you're using the VLAN bond0.1910 only for corosync communication you do need to put this interface in a bridge you can directly assign an IP to bond0.1910
* the only tested and supported bonding mode for a corosync network is active/passive ( bonding mode 1)

Finally I would advise you to run tcpdump on the bond interface and have a look at the ARP traffic.

With the command
tcpdump -nnn -i BRIDGE_WITH_VM_NIC -e "arp and (host IP_OF_VM_IN_BRIDGEor host IP_OF_MACHINE_IN_LAN)"

you can have a look at the who-has / reply traffic.
 
Ah.. I found a misconfigured port on my switching fabric. Proxmox 4.4 must have tolerated it, albeit with many a dropped packet no doubt.
One of the VLANs, on one of the physical interface mappings (out of a pair for bonding) within HPE VirtualConnect was incorrectly tagged. So, I guess, as long as the vmbr01 (v1901) interface was the last in /etc/network/interfaces, the last being addressed, the LACP bonding code was working I suppose.

All good now.
Also, I have taken your advice and moved the Corosync traffic of a bridge. Moving the L3 assignment down to the VLAN interface and have removed vmbr10.

Thankyou very much for your assistance.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!