[BUG] Network is only working selectively, can't see why

lifeboy · Jun 13, 2022

I have four nodes with LAN addresses 192.168.131.1, .2, .3 and .4 . These are all configured identically except for the difference of the last octet of an ip address. So NodeA has .1, B has .2, C has .4 and so on.

NodeB for example:

vmbr0 is the LAN and it's is also the LAN on a pfSense guest. vmbr1 is the internet gateway. This allows that pfSense guest to run on any node and still provide both internet and and access to whatever else pfSense exposed access to. I also access the pfSense guest via OpenVPN for remote access.

Since Sunday night however , seemingly out of the blue, NodeB is not able to reach pfSense anymore and neither can pfSense reach NodeB. However, nothing of that sorts is wrong on NodeA, C or D. Also, I can reach NodeB from Node A, C and D.

pfSense can reach NodeA, C and D and all the guests running on there, but nothing running on Node B (that's sort of obvious, since it can't reach NodeB's vmbr0 bridge. However, pSense only "knows" about the LAN bridge to the 192.168.131.0/24 network via vmbr0. So with no specific knowledge of Node B, how it is possible that Node B is offline and A, C and D are fine.

I have a second pfSense guest (as a CARP failover), which displays the exact same behaviour. The who have lan addresses 192.168.131.252 (primary) and 192.168.131.253 (backup). They can ping each other on those addresses, but neither can ping 192.168.131.2 (node B's LAN bridge). It is almost as if there's a ip address blocker somewhere. If only one firewall displayed this behaviour I would have suspected that something went wrong in pfSense, but now both to the same thing. This leads me to conclude that the problem is most likely on Node B.

Is there anything in proxmox that could be blocking traffic? No, the proxmox firewall is not being used.

Any suggestions on how I can find the cause of this?

PS. I also paged through the syslog. One moment all seems fine and then the next the Proxmox Backup Server cannot be reached (timeout). No other indication of a problem.

lifeboy · Jun 14, 2022

As with the other issue I could not resolve once I cleaned up the old bridges (how can that be??) and restarted the node, everything came up daisies. For the life of me, I cannot explain it, and I hope I don't have to plough through weeks like this again!

bobmc · Jun 15, 2022

I would suspect a corrupted arp table. If there's a next time, try clearing the arp table first. arp -c -a from the pfsense shell I think will do it.

lifeboy · Jun 23, 2022

bobmc said:
I would suspect a corrupted arp table. If there's a next time, try clearing the arp table first. arp -c -a from the pfsense shell I think will do it.

Would a restart not clear the arp tables anyway? I restarted everything multiple times, and it seemed to make no difference.

lifeboy · Aug 20, 2022

Unmarked this thread as 'solved', since I was never able to figure out why this happened in the first instance...

Had a DC power test failure 2 days ago, and now suddenly the problem with NodeB is back.

NodeB can communication on the "LAN" via vmbr0 with other hosts on the 192.168.131.0/24 network, it can also reach guests on NodeB, but it cannot reach any other guests on other nodes. It cannot reach either of the pfSense instances. pfSense cannot ping NodeB.

The ARP table to pfSense1 doesn't have an entry for 192.168.131.2, but it has for the other nodes. pfSense2 (which is a CARP standby) doesn't have ARP entries for any of the nodes at this stage. So it doesn't seem to be an ARP cache issue (I have cleared the cache).

I have inspected syslog, compares configs, but cannot figure out what it causing this.

lifeboy · Aug 20, 2022

lifeboy said:
I have inspected syslog, compares configs, but cannot figure out what it causing this.

Here's a strange discovery.

Code:

root@FT1-NodeA:~# udevadm info /sys/class/net/eth0 | grep ID_PATH
E: ID_PATH=pci-0000:18:00.0
E: ID_PATH_TAG=pci-0000_18_00_0
root@FT1-NodeA:~# udevadm info /sys/class/net/eth1 | grep ID_PATH
E: ID_PATH=pci-0000:18:00.1
E: ID_PATH_TAG=pci-0000_18_00_1
root@FT1-NodeA:~# udevadm info /sys/class/net/eth2 | grep ID_PATH
E: ID_PATH=pci-0000:19:00.0
E: ID_PATH_TAG=pci-0000_19_00_0

Code:

root@FT1-NodeB:~# udevadm info /sys/class/net/eth0 | grep ID_PATH
E: ID_PATH=pci-0000:18:00.0
E: ID_PATH_TAG=pci-0000_18_00_0
root@FT1-NodeB:~# udevadm info /sys/class/net/eth1 | grep ID_PATH
root@FT1-NodeB:~# udevadm info /sys/class/net/eth2 | grep ID_PATH

Almost as if eth1 and eth2 don't exists on NodeB. But...

Code:

root@FT1-NodeA:~# ls -la /sys/class/net/eth*
lrwxrwxrwx 1 root root 0 Aug 19 03:40 /sys/class/net/eth0 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.0/net/eth0
lrwxrwxrwx 1 root root 0 Aug 19 03:40 /sys/class/net/eth1 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.1/net/eth1
lrwxrwxrwx 1 root root 0 Aug 19 03:40 /sys/class/net/eth2 -> ../../devices/pci0000:17/0000:17:02.0/0000:19:00.0/net/eth2

Code:

root@FT1-NodeB:~# ls -la /sys/class/net/eth*
lrwxrwxrwx 1 root root 0 Aug 20 13:35 /sys/class/net/eth0 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.0/net/eth0
lrwxrwxrwx 1 root root 0 Aug 20 13:35 /sys/class/net/eth1 -> ../../devices/pci0000:17/0000:17:02.0/0000:19:00.0/net/eth1
lrwxrwxrwx 1 root root 0 Aug 20 13:35 /sys/class/net/eth2 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.1/net/eth2

eth1 and eth2 seem swapped on NodeB.

But I have renamed ports in the same way on both nodes.

Code:

FT1-NodeA:~# cat /etc/systemd/network/10-rename-enp24s0f1.link
[Match]
Path=pci-0000:18:00.1
[Link]
Name=eth1

FT1-NodeA:~# cat /etc/systemd/network/10-rename-enp25s0f0np0.link 
[Match]
Path=pci-0000:19:00.0
[Link]
Name=eth2

Code:

FT1-NodeB:~# cat /etc/systemd/network/10-rename-enp24s0f1.link 
[Match]
Path=pci-0000:18:00.1
[Link]
Name=eth1

FT1-NodeB:~# cat /etc/systemd/network/10-rename-enp25s0f0np0.link
[Match]
Path=pci-0000:19:00.0
[Link]
Name=eth2

So, why does the system swap eth1 and eth2? The rename file clearly tells it otherwise. Any explanation?

lifeboy · Aug 20, 2022

lifeboy said:

Here's a strange discovery.

Code:

root@FT1-NodeA:~# udevadm info /sys/class/net/eth0 | grep ID_PATH
E: ID_PATH=pci-0000:18:00.0
E: ID_PATH_TAG=pci-0000_18_00_0
root@FT1-NodeA:~# udevadm info /sys/class/net/eth1 | grep ID_PATH
E: ID_PATH=pci-0000:18:00.1
E: ID_PATH_TAG=pci-0000_18_00_1
root@FT1-NodeA:~# udevadm info /sys/class/net/eth2 | grep ID_PATH
E: ID_PATH=pci-0000:19:00.0
E: ID_PATH_TAG=pci-0000_19_00_0

Code:

root@FT1-NodeB:~# udevadm info /sys/class/net/eth0 | grep ID_PATH
E: ID_PATH=pci-0000:18:00.0
E: ID_PATH_TAG=pci-0000_18_00_0
root@FT1-NodeB:~# udevadm info /sys/class/net/eth1 | grep ID_PATH
root@FT1-NodeB:~# udevadm info /sys/class/net/eth2 | grep ID_PATH

Almost as if eth1 and eth2 don't exists on NodeB. But...

Code:

root@FT1-NodeA:~# ls -la /sys/class/net/eth*
lrwxrwxrwx 1 root root 0 Aug 19 03:40 /sys/class/net/eth0 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.0/net/eth0
lrwxrwxrwx 1 root root 0 Aug 19 03:40 /sys/class/net/eth1 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.1/net/eth1
lrwxrwxrwx 1 root root 0 Aug 19 03:40 /sys/class/net/eth2 -> ../../devices/pci0000:17/0000:17:02.0/0000:19:00.0/net/eth2

Code:

root@FT1-NodeB:~# ls -la /sys/class/net/eth*
lrwxrwxrwx 1 root root 0 Aug 20 13:35 /sys/class/net/eth0 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.0/net/eth0
lrwxrwxrwx 1 root root 0 Aug 20 13:35 /sys/class/net/eth1 -> ../../devices/pci0000:17/0000:17:02.0/0000:19:00.0/net/eth1
lrwxrwxrwx 1 root root 0 Aug 20 13:35 /sys/class/net/eth2 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.1/net/eth2

eth1 and eth2 seem swapped on NodeB.

But I have renamed ports in the same way on both nodes.

Code:

FT1-NodeA:~# cat /etc/systemd/network/10-rename-enp24s0f1.link
[Match]
Path=pci-0000:18:00.1
[Link]
Name=eth1

FT1-NodeA:~# cat /etc/systemd/network/10-rename-enp25s0f0np0.link
[Match]
Path=pci-0000:19:00.0
[Link]
Name=eth2

Code:

FT1-NodeB:~# cat /etc/systemd/network/10-rename-enp24s0f1.link
[Match]
Path=pci-0000:18:00.1
[Link]
Name=eth1

FT1-NodeB:~# cat /etc/systemd/network/10-rename-enp25s0f0np0.link
[Match]
Path=pci-0000:19:00.0
[Link]
Name=eth2

So, why does the system swap eth1 and eth2? The rename file clearly tells it otherwise. Any explanation?

I have actually tested now to swap the config files around and have 0000:18:00.1 named eth1 and 0000:19:00.0 eth2, but the result is unchanged.

Code:

ls -la /sys/class/net/eth*
lrwxrwxrwx 1 root root 0 Aug 20 15:07 /sys/class/net/eth0 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.0/net/eth0
lrwxrwxrwx 1 root root 0 Aug 20 15:07 /sys/class/net/eth1 -> ../../devices/pci0000:17/0000:17:02.0/0000:19:00.0/net/eth1
lrwxrwxrwx 1 root root 0 Aug 20 15:07 /sys/class/net/eth2 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.1/net/eth2

lifeboy · Aug 20, 2022

lifeboy said:
I have actually tested now to swap the config files around and have 0000:18:00.1 named eth1 and 0000:19:00.0 eth2, but the result is unchanged.

Code:

ls -la /sys/class/net/eth* lrwxrwxrwx 1 root root 0 Aug 20 15:07 /sys/class/net/eth0 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.0/net/eth0 lrwxrwxrwx 1 root root 0 Aug 20 15:07 /sys/class/net/eth1 -> ../../devices/pci0000:17/0000:17:02.0/0000:19:00.0/net/eth1 lrwxrwxrwx 1 root root 0 Aug 20 15:07 /sys/class/net/eth2 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.1/net/eth2

I've been working through https://wiki.debian.org/NetworkInterfaceNames to try to find a solution.

lifeboy · Aug 20, 2022

This is literally a naming bug. If I simply add eth1 to the vmbr0 bridge and use eth2 for corosync, the node works correctly.

I'll wait to see who has an explaination, otherwise I'll file a bug with debian. Or should it be filed with proxmox?

lifeboy · Dec 25, 2022

This morning a restart of a node that had not been restarted for quite some time caused the same symptoms as those reported here. It dawned on me that this might be due to a new running kernel that this swapping of ports occurs. On further investigation, here's what I found.

NodeA was running kernel 5.15.35-1-pve. The latest kernel was installed, but not rebooted.
NodeA is now running kernel 5.15.83-1-pve. The physical ports of eth1 and eth2 are now swapped.

Since I did not install or run an of the kernel versions inbetween, I'm not able to tell at which point this bug was introduced.

This is a bug. Where do I report this please?

Search

Search

[BUG] Network is only working selectively, can't see why

lifeboy

Renowned Member

lifeboy

Renowned Member

bobmc

Renowned Member

lifeboy

Renowned Member

lifeboy

Renowned Member

lifeboy

Renowned Member

lifeboy

Renowned Member

lifeboy

Renowned Member

lifeboy

Renowned Member

lifeboy

Renowned Member

We value your privacy