[SOLVED] Proxmox cluster + LACP + VLANs + virtual gateway (pfSense) = no ping to gateway from 2 of 3 hosts

skierka

New Member
Feb 6, 2022
2
0
1
47
Hello,

I have set-up a small homelab with three identical proxmox nodes working in the cluster. All went well until I decided to have gateway/firewall (pfSense) as VM.

Problem:
2 out of 3 nodes cannot access WAN (internet). When I ping from host gateway IP (VM pfsense running on one of the nodes) I can see that ping is out and response is back but somehow it is discarded.
However on one node all works fine but I cannot figure out why.
Please note that it does not that migrating pfSense VM from one node to another, restarting nodes does not change situation... it is stable.

Set-up description:
Each node is connected to L2 managed switch with 2 x NICs using LACP (802.3ad).
Each port group has tagged ports for VLAN 5 (WAN), VLAN 15 (LAN), VLAN 20 (IoT)

pfSense VM has three network adapters on top of vmbr1 (LAN), vmbr1 (WAN) and vmbr0 with VLAN tag = 20.

Network configuration (just IP of vmbr1 is different for each node):
Code:
auto lo
iface lo inet loopback

auto enp1s0
iface enp1s0 inet manual

auto enp2s0
iface enp2s0 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves enp1s0 enp2s0
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2
#LAN1+2 link aggregation

auto bond0.15
iface bond0.15 inet manual
#Bond over VLAN 15 (LAN)

auto bond0.5
iface bond0.5 inet manual
#Bond over VLAN 5 (WAN)

auto vmbr1
iface vmbr1 inet static
        address 192.168.15.100/24
        gateway 192.168.15.1
        bridge-ports bond0.15
        bridge-stp off
        bridge-fd 0
#On top of VLAN 15 (LAN)

auto vmbr0
iface vmbr0 inet manual
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 20-4094
#VLAN 20+

auto vmbr2
iface vmbr2 inet manual
        bridge-ports bond0.5
        bridge-stp off
        bridge-fd 0
#On top of VLAN 5 (WAN)

Problem symptoms:
On node1 (192.168.15.100) I run ping and all packets are lost
Code:
root@pve1:~# ping -c3 192.168.15.1
PING 192.168.15.1 (192.168.15.1) 56(84) bytes of data.

--- 192.168.15.1 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2029ms

but tcp dump on host (and VM pfsense) shows actual traffic is in/out correctly:
Code:
root@pve1:~# tcpdump -envi vmbr1 icmp
tcpdump: listening on vmbr1, link-type EN10MB (Ethernet), snapshot length 262144 bytes

20:22:37.102368 32:fe:cb:a6:cd:36 > 00:0e:c4:d0:4e:92, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 37319, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.15.100 > 192.168.15.1: ICMP echo request, id 42394, seq 1, length 64
20:22:37.102668 00:0e:c4:d0:4e:92 > 50:21:08:80:05:92, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 60139, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.15.1 > 192.168.15.100: ICMP echo reply, id 42394, seq 1, length 64
20:22:38.109750 32:fe:cb:a6:cd:36 > 00:0e:c4:d0:4e:92, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 37438, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.15.100 > 192.168.15.1: ICMP echo request, id 42394, seq 2, length 64
20:22:38.110046 00:0e:c4:d0:4e:92 > 50:21:08:80:05:92, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 56162, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.15.1 > 192.168.15.100: ICMP echo reply, id 42394, seq 2, length 64
20:22:39.133743 32:fe:cb:a6:cd:36 > 00:0e:c4:d0:4e:92, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 37497, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.15.100 > 192.168.15.1: ICMP echo request, id 42394, seq 3, length 64
20:22:39.134029 00:0e:c4:d0:4e:92 > 50:21:08:80:05:92, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 57476, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.15.1 > 192.168.15.100: ICMP echo reply, id 42394, seq 3, length 64
^C
6 packets captured

On node 2 (192.168.15.99) I run ping and all is fine
Code:
ping -c3 192.168.15.1
PING 192.168.15.1 (192.168.15.1) 56(84) bytes of data.
64 bytes from 192.168.15.1: icmp_seq=1 ttl=64 time=0.336 ms
64 bytes from 192.168.15.1: icmp_seq=2 ttl=64 time=0.505 ms
64 bytes from 192.168.15.1: icmp_seq=3 ttl=64 time=0.458 ms

--- 192.168.15.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2039ms
rtt min/avg/max/mdev = 0.336/0.433/0.505/0.071 ms

ip route on both nodes are the same (as network config) - here example from node1 and node2
Code:
root@pve1:~# ip route
default via 192.168.15.1 dev vmbr1 proto kernel onlink 
192.168.15.0/24 dev vmbr1 proto kernel scope link src 192.168.15.100 

root@pve2:~# ip route
default via 192.168.15.1 dev vmbr1 proto kernel onlink 
192.168.15.0/24 dev vmbr1 proto kernel scope link src 192.168.15.101

Any suggestions?
 
Update - when I look into tcpdump I can see that response is sent back to correct IP but MAC address is not correct.
replay is sent back to 50:21:08:80:05:92 but it should be 32:fe:cb:a6:cd:36
and it comes from pfSense from static ARP map in DHCP... my mistake. But it is good to post & read a problem... sometimes you can find a solution yourself ;-). I have fixed set-up and all works like a charm.

Ps. I have changed the bond definition and MAC address is taken from the 1st NIC.. so for 2 out of 3 it did no longer match static ARP entry on the pfSense side.
 
Last edited: