[BUG] Network is only working selectively, can't see why

lifeboy

Renowned Member
I have four nodes with LAN addresses 192.168.131.1, .2, .3 and .4 . These are all configured identically except for the difference of the last octet of an ip address. So NodeA has .1, B has .2, C has .4 and so on.

NodeB for example:

1655150600732.png

vmbr0 is the LAN and it's is also the LAN on a pfSense guest. vmbr1 is the internet gateway. This allows that pfSense guest to run on any node and still provide both internet and and access to whatever else pfSense exposed access to. I also access the pfSense guest via OpenVPN for remote access.

Since Sunday night however , seemingly out of the blue, NodeB is not able to reach pfSense anymore and neither can pfSense reach NodeB. However, nothing of that sorts is wrong on NodeA, C or D. Also, I can reach NodeB from Node A, C and D.

pfSense can reach NodeA, C and D and all the guests running on there, but nothing running on Node B (that's sort of obvious, since it can't reach NodeB's vmbr0 bridge. However, pSense only "knows" about the LAN bridge to the 192.168.131.0/24 network via vmbr0. So with no specific knowledge of Node B, how it is possible that Node B is offline and A, C and D are fine.

I have a second pfSense guest (as a CARP failover), which displays the exact same behaviour. The who have lan addresses 192.168.131.252 (primary) and 192.168.131.253 (backup). They can ping each other on those addresses, but neither can ping 192.168.131.2 (node B's LAN bridge). It is almost as if there's a ip address blocker somewhere. If only one firewall displayed this behaviour I would have suspected that something went wrong in pfSense, but now both to the same thing. This leads me to conclude that the problem is most likely on Node B.

Is there anything in proxmox that could be blocking traffic? No, the proxmox firewall is not being used.

Any suggestions on how I can find the cause of this?

PS. I also paged through the syslog. One moment all seems fine and then the next the Proxmox Backup Server cannot be reached (timeout). No other indication of a problem.
 
Last edited:
I would suspect a corrupted arp table. If there's a next time, try clearing the arp table first. arp -c -a from the pfsense shell I think will do it.
 
Unmarked this thread as 'solved', since I was never able to figure out why this happened in the first instance...

Had a DC power test failure 2 days ago, and now suddenly the problem with NodeB is back.

NodeB can communication on the "LAN" via vmbr0 with other hosts on the 192.168.131.0/24 network, it can also reach guests on NodeB, but it cannot reach any other guests on other nodes. It cannot reach either of the pfSense instances. pfSense cannot ping NodeB.

The ARP table to pfSense1 doesn't have an entry for 192.168.131.2, but it has for the other nodes. pfSense2 (which is a CARP standby) doesn't have ARP entries for any of the nodes at this stage. So it doesn't seem to be an ARP cache issue (I have cleared the cache).

I have inspected syslog, compares configs, but cannot figure out what it causing this.
 
Last edited:
I have inspected syslog, compares configs, but cannot figure out what it causing this.

Here's a strange discovery.

Code:
root@FT1-NodeA:~# udevadm info /sys/class/net/eth0 | grep ID_PATH
E: ID_PATH=pci-0000:18:00.0
E: ID_PATH_TAG=pci-0000_18_00_0
root@FT1-NodeA:~# udevadm info /sys/class/net/eth1 | grep ID_PATH
E: ID_PATH=pci-0000:18:00.1
E: ID_PATH_TAG=pci-0000_18_00_1
root@FT1-NodeA:~# udevadm info /sys/class/net/eth2 | grep ID_PATH
E: ID_PATH=pci-0000:19:00.0
E: ID_PATH_TAG=pci-0000_19_00_0

Code:
root@FT1-NodeB:~# udevadm info /sys/class/net/eth0 | grep ID_PATH
E: ID_PATH=pci-0000:18:00.0
E: ID_PATH_TAG=pci-0000_18_00_0
root@FT1-NodeB:~# udevadm info /sys/class/net/eth1 | grep ID_PATH
root@FT1-NodeB:~# udevadm info /sys/class/net/eth2 | grep ID_PATH

Almost as if eth1 and eth2 don't exists on NodeB. But...

Code:
root@FT1-NodeA:~# ls -la /sys/class/net/eth*
lrwxrwxrwx 1 root root 0 Aug 19 03:40 /sys/class/net/eth0 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.0/net/eth0
lrwxrwxrwx 1 root root 0 Aug 19 03:40 /sys/class/net/eth1 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.1/net/eth1
lrwxrwxrwx 1 root root 0 Aug 19 03:40 /sys/class/net/eth2 -> ../../devices/pci0000:17/0000:17:02.0/0000:19:00.0/net/eth2

Code:
root@FT1-NodeB:~# ls -la /sys/class/net/eth*
lrwxrwxrwx 1 root root 0 Aug 20 13:35 /sys/class/net/eth0 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.0/net/eth0
lrwxrwxrwx 1 root root 0 Aug 20 13:35 /sys/class/net/eth1 -> ../../devices/pci0000:17/0000:17:02.0/0000:19:00.0/net/eth1
lrwxrwxrwx 1 root root 0 Aug 20 13:35 /sys/class/net/eth2 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.1/net/eth2

eth1 and eth2 seem swapped on NodeB.

But I have renamed ports in the same way on both nodes.

Code:
FT1-NodeA:~# cat /etc/systemd/network/10-rename-enp24s0f1.link
[Match]
Path=pci-0000:18:00.1
[Link]
Name=eth1

FT1-NodeA:~# cat /etc/systemd/network/10-rename-enp25s0f0np0.link 
[Match]
Path=pci-0000:19:00.0
[Link]
Name=eth2

Code:
FT1-NodeB:~# cat /etc/systemd/network/10-rename-enp24s0f1.link 
[Match]
Path=pci-0000:18:00.1
[Link]
Name=eth1

FT1-NodeB:~# cat /etc/systemd/network/10-rename-enp25s0f0np0.link
[Match]
Path=pci-0000:19:00.0
[Link]
Name=eth2

So, why does the system swap eth1 and eth2? The rename file clearly tells it otherwise. Any explanation?
 
Here's a strange discovery.

Code:
root@FT1-NodeA:~# udevadm info /sys/class/net/eth0 | grep ID_PATH
E: ID_PATH=pci-0000:18:00.0
E: ID_PATH_TAG=pci-0000_18_00_0
root@FT1-NodeA:~# udevadm info /sys/class/net/eth1 | grep ID_PATH
E: ID_PATH=pci-0000:18:00.1
E: ID_PATH_TAG=pci-0000_18_00_1
root@FT1-NodeA:~# udevadm info /sys/class/net/eth2 | grep ID_PATH
E: ID_PATH=pci-0000:19:00.0
E: ID_PATH_TAG=pci-0000_19_00_0

Code:
root@FT1-NodeB:~# udevadm info /sys/class/net/eth0 | grep ID_PATH
E: ID_PATH=pci-0000:18:00.0
E: ID_PATH_TAG=pci-0000_18_00_0
root@FT1-NodeB:~# udevadm info /sys/class/net/eth1 | grep ID_PATH
root@FT1-NodeB:~# udevadm info /sys/class/net/eth2 | grep ID_PATH

Almost as if eth1 and eth2 don't exists on NodeB. But...

Code:
root@FT1-NodeA:~# ls -la /sys/class/net/eth*
lrwxrwxrwx 1 root root 0 Aug 19 03:40 /sys/class/net/eth0 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.0/net/eth0
lrwxrwxrwx 1 root root 0 Aug 19 03:40 /sys/class/net/eth1 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.1/net/eth1
lrwxrwxrwx 1 root root 0 Aug 19 03:40 /sys/class/net/eth2 -> ../../devices/pci0000:17/0000:17:02.0/0000:19:00.0/net/eth2

Code:
root@FT1-NodeB:~# ls -la /sys/class/net/eth*
lrwxrwxrwx 1 root root 0 Aug 20 13:35 /sys/class/net/eth0 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.0/net/eth0
lrwxrwxrwx 1 root root 0 Aug 20 13:35 /sys/class/net/eth1 -> ../../devices/pci0000:17/0000:17:02.0/0000:19:00.0/net/eth1
lrwxrwxrwx 1 root root 0 Aug 20 13:35 /sys/class/net/eth2 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.1/net/eth2

eth1 and eth2 seem swapped on NodeB.

But I have renamed ports in the same way on both nodes.

Code:
FT1-NodeA:~# cat /etc/systemd/network/10-rename-enp24s0f1.link
[Match]
Path=pci-0000:18:00.1
[Link]
Name=eth1

FT1-NodeA:~# cat /etc/systemd/network/10-rename-enp25s0f0np0.link
[Match]
Path=pci-0000:19:00.0
[Link]
Name=eth2

Code:
FT1-NodeB:~# cat /etc/systemd/network/10-rename-enp24s0f1.link
[Match]
Path=pci-0000:18:00.1
[Link]
Name=eth1

FT1-NodeB:~# cat /etc/systemd/network/10-rename-enp25s0f0np0.link
[Match]
Path=pci-0000:19:00.0
[Link]
Name=eth2

So, why does the system swap eth1 and eth2? The rename file clearly tells it otherwise. Any explanation?

I have actually tested now to swap the config files around and have 0000:18:00.1 named eth1 and 0000:19:00.0 eth2, but the result is unchanged.
Code:
ls -la /sys/class/net/eth*
lrwxrwxrwx 1 root root 0 Aug 20 15:07 /sys/class/net/eth0 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.0/net/eth0
lrwxrwxrwx 1 root root 0 Aug 20 15:07 /sys/class/net/eth1 -> ../../devices/pci0000:17/0000:17:02.0/0000:19:00.0/net/eth1
lrwxrwxrwx 1 root root 0 Aug 20 15:07 /sys/class/net/eth2 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.1/net/eth2
 
I have actually tested now to swap the config files around and have 0000:18:00.1 named eth1 and 0000:19:00.0 eth2, but the result is unchanged.
Code:
ls -la /sys/class/net/eth*
lrwxrwxrwx 1 root root 0 Aug 20 15:07 /sys/class/net/eth0 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.0/net/eth0
lrwxrwxrwx 1 root root 0 Aug 20 15:07 /sys/class/net/eth1 -> ../../devices/pci0000:17/0000:17:02.0/0000:19:00.0/net/eth1
lrwxrwxrwx 1 root root 0 Aug 20 15:07 /sys/class/net/eth2 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.1/net/eth2

I've been working through https://wiki.debian.org/NetworkInterfaceNames to try to find a solution.
 
This morning a restart of a node that had not been restarted for quite some time caused the same symptoms as those reported here. It dawned on me that this might be due to a new running kernel that this swapping of ports occurs. On further investigation, here's what I found.

NodeA was running kernel 5.15.35-1-pve. The latest kernel was installed, but not rebooted.
NodeA is now running kernel 5.15.83-1-pve. The physical ports of eth1 and eth2 are now swapped.

Since I did not install or run an of the kernel versions inbetween, I'm not able to tell at which point this bug was introduced.

This is a bug. Where do I report this please?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!