We ran into a very nasty issue a few days ago.
Background:
Systemd generates ridiculously long interface names (see https://manpages.debian.org/bookworm/udev/systemd.link.5.en.html and referenced here https://wiki.debian.org/NetworkInterfaceNames#CUSTOM_SCHEMES_USING_.LINK_FILES) like enp25s0f1np1, which combined with a VLAN creates enp25s0f1np1.100 that is more than 15 characters in length which generates complaints.
So we opted to rename the interfaces to what they were before (found quite a few references for this) and in our eagerness missed that they should not be called eth0, eth1, etc.
So when I initially encountered this, the issue I raised was not clear enough about this, so it was never resolved. We simply checked the renamed interfaces' mac addresses and used the correct one for each service (corosync, ceph, applications) regardless of the "wrong" name.
We also don't restart nodes willy-nilly, so everything was stable.
The problem:
Recently we had to restart a node, so we migrated all VM's and containers away from that node and restarted it.
It didn't come back up.
While investigating, one of the other nodes crashed (not sure exactly why, but it seems there were too little RAM left, probably because ceph needed a lot more RAM to deal with the node that down.
(note to self: In future, tell the cluster to not rebalance before taking down a node).
This put the remaining 2 nodes under even more stress and eventually only one node was still functional. All VM's and LXC's were down because ceph was not in quorum.
So after inspecting the nodes via the BMC consoles, it was clear that they were actually booting up, but networks were not functioning. Ceph was not getting quorum, the nodes couldn't even ping each other, etc.
Needless to say, after many hours, we finally found the problem with the renamed nic's and changed the names to lan1, lan2, etc.
This allowed the cluster to start again.
However, because the network device names changed, we had updated the /etc/network/interfaces to the correct names. For example, here is nodeA:
We then needed to touch the network config of each VM or LXC to allow traffic to/from it, which I assume is because the config is stored somehow and not dynamic.
Everything is working fine now, except the SDN's. The SDN's cannot communicate over the actual network ports / cable, but only on the same node between machines and the virtual pfSense firewall were using.
So, for example, I have VLAN12, which is one I use for testing. The ip range for that is 192.168.161.0/24 with .253 the gateway on pfSense1A and .252 the gateway on pfSense1B with a virtual address .254 managed by CARP between the two.
The traffic is allowed by the switch (actually the switch was untouched in this whole debacle), but while all traffic is allowed between the two firewalls, when it comes to the VLAN's defined by SDN, they are not communicating.
For clarity: I have created an interface on each firewall for each VLAN.
So in pfSense I named each of these appropriately, assigned addresses and created rules that are appropriate for each VLAN. None of this has changed, and it was working perfectly before the restart and resulting renaming of underlying nic's.
In the above list, net0, net1, net2, net4 are all working as expected. Note: net2 and net4 are bridges with manual VLAN's defined in interfaces.conf, a config we used before SDN's became available.
Net3, 5, 6, 7 and 8 are not relaying any traffic.
As a test, I deleted one of the VLAN's from the SDN and pfSense, recreated it again, with no difference in behaviour. I also touched each of these interfaces of the pfSense VM's, like I did with the VM's and LXC's that run applications, but it didn't fix it.
Question:
What do I need to do the make SDN's work again. Do I have to remove it all and recreated it with different names? What could be the underlying cause for this behaviour?
Background:
Systemd generates ridiculously long interface names (see https://manpages.debian.org/bookworm/udev/systemd.link.5.en.html and referenced here https://wiki.debian.org/NetworkInterfaceNames#CUSTOM_SCHEMES_USING_.LINK_FILES) like enp25s0f1np1, which combined with a VLAN creates enp25s0f1np1.100 that is more than 15 characters in length which generates complaints.
So we opted to rename the interfaces to what they were before (found quite a few references for this) and in our eagerness missed that they should not be called eth0, eth1, etc.
So when I initially encountered this, the issue I raised was not clear enough about this, so it was never resolved. We simply checked the renamed interfaces' mac addresses and used the correct one for each service (corosync, ceph, applications) regardless of the "wrong" name.
We also don't restart nodes willy-nilly, so everything was stable.
The problem:
Recently we had to restart a node, so we migrated all VM's and containers away from that node and restarted it.
It didn't come back up.
While investigating, one of the other nodes crashed (not sure exactly why, but it seems there were too little RAM left, probably because ceph needed a lot more RAM to deal with the node that down.
(note to self: In future, tell the cluster to not rebalance before taking down a node).
This put the remaining 2 nodes under even more stress and eventually only one node was still functional. All VM's and LXC's were down because ceph was not in quorum.
So after inspecting the nodes via the BMC consoles, it was clear that they were actually booting up, but networks were not functioning. Ceph was not getting quorum, the nodes couldn't even ping each other, etc.
Needless to say, after many hours, we finally found the problem with the renamed nic's and changed the names to lan1, lan2, etc.
This allowed the cluster to start again.
However, because the network device names changed, we had updated the /etc/network/interfaces to the correct names. For example, here is nodeA:
Code:
auto lo
iface lo inet loopback
auto lan0
iface lan0 inet manual
#internet - 1Gb/s max 10Gb/s
auto lan1
iface lan1 inet static
address 172.16.10.1/24
#corosync - 1GB/s max 10Gb/s
auto lan1:1
iface lan1:1 inet static
address 172.16.5.201/24
#ILOM path
auto lan3
iface lan3 inet static
address 10.10.10.1/24
#ceph - 25Gb/s
auto lan2
iface lan2 inet manual
#LAN - 25Gb/s
auto lan2.25
iface lan2.25 inet manual
#client 1
auto lan2.35
iface lan2.35 inet manual
#client 2
auto vmbr0
iface vmbr0 inet static
address 192.168.131.1/24
gateway 192.168.131.254
bridge-ports lan2
bridge-stp off
bridge-fd 0
auto vmbr1
iface vmbr1 inet manual
bridge-ports lan0
bridge-stp off
bridge-fd 0
auto vmbr2
iface vmbr2 inet manual
bridge-ports lan2.25
bridge-stp off
bridge-fd 0
# client 1 VLAN
auto vmbr4
iface vmbr4 inet static
address 192.168.151.1/24
bridge-ports lan2.35
bridge-stp off
bridge-fd 0
# client 2 VLAN
source /etc/network/interfaces.d/*
We then needed to touch the network config of each VM or LXC to allow traffic to/from it, which I assume is because the config is stored somehow and not dynamic.
Everything is working fine now, except the SDN's. The SDN's cannot communicate over the actual network ports / cable, but only on the same node between machines and the virtual pfSense firewall were using.
So, for example, I have VLAN12, which is one I use for testing. The ip range for that is 192.168.161.0/24 with .253 the gateway on pfSense1A and .252 the gateway on pfSense1B with a virtual address .254 managed by CARP between the two.
The traffic is allowed by the switch (actually the switch was untouched in this whole debacle), but while all traffic is allowed between the two firewalls, when it comes to the VLAN's defined by SDN, they are not communicating.
For clarity: I have created an interface on each firewall for each VLAN.
So in pfSense I named each of these appropriately, assigned addresses and created rules that are appropriate for each VLAN. None of this has changed, and it was working perfectly before the restart and resulting renaming of underlying nic's.
In the above list, net0, net1, net2, net4 are all working as expected. Note: net2 and net4 are bridges with manual VLAN's defined in interfaces.conf, a config we used before SDN's became available.
Net3, 5, 6, 7 and 8 are not relaying any traffic.
As a test, I deleted one of the VLAN's from the SDN and pfSense, recreated it again, with no difference in behaviour. I also touched each of these interfaces of the pfSense VM's, like I did with the VM's and LXC's that run applications, but it didn't fix it.
Question:
What do I need to do the make SDN's work again. Do I have to remove it all and recreated it with different names? What could be the underlying cause for this behaviour?
Last edited: