Nodes lose network connectivity when I reboot the switch and do not get it back after switch reboot is complete

xelar · Jan 10, 2024

I am relatively inexperienced with Proxmox and Linux so please bear with me for not using proper terms or possibly asking a very basic question...

I have 3 Proxmox 8.1.3 nodes configured in a cluster. All 3 nodes are Lenovo TinyPC computers (2 x m920q & 1 x P360) with a 1GbE OEM port and a DELL branded Mellanox CX322A 10GbE SFP+ x2 NIC installed. All works well until I reboot my Unifi switches. When I do so, I lose connectivity to my headless nodes which requires me to soft reboot (short press) them via their power switch. Connectivity of other 1 GbE and 10 Gbe clients is restored automatically upon switch reboot so it is just the Proxmox nodes having the issue.

The 1GbE port of all 3 nodes is connected to a 24 port "core" switch connected to the UDMP router and the 10 GbE ports are connected via DAC (2 x that are in same rack) or FO (1 x as it is in another room) to Unifi 8 SFP+ port aggregation switches 1 or 2 levels downstream from the core switch. In all 3 cases the management of the Proxmox node is done via the OEM 1GbE port on the nodes. When I reboot all switches involved, the 1 GbE port stays down while it appears that the 10GbE ports come back up as of my latest test (but I am not 100% certain it is so in all cases as I think it happened to the 10Gbe NIC as well in the recent past). In this latest test, 1 node survived the switch reboot, while the other 2 did not.

My research on "Autostart" led me to posts that say it should be on for the bridges and off for the interfaces so that is what I have for all 3 nodes. As for "Active", I just noticed that I have a mix... on all 3 nodes, the 1Gbe port and 1 x 10GbE port are used so the 2nd 10GbE port should likely be set to No... anyhow it doesn't seem to be causing issues either way. Surprisingly, both 10GbE ports on Maximus node are not set to active but the one that is connected works fine regardless.

Commodus survived, here is its config:

Maximus did not survive:

Spartacus did not survive:

Is there a way to get the node to attempt bringing up the network connection when the switch is back online?

jsterr · Jan 11, 2024

So in short: the 1 GbE NIC stays offline? Not always but sometimes? Is this the problem? If these systems are all the same, you should not have different port names given by udev. It might be a problem, that after reboot the nics get a new name. If you have a system running that is not working anymore, you should check if the onboard NIC got a new name via cli with ip a .

If so you can change udev-settings:

cp /usr/lib/systemd/network/99-default.link /etc/systemd/network/99-default.link
sed -i 's/NamePolicy=keep kernel database onboard slot path/NamePolicy=path/' /etc/systemd/network/99-default.link
update-initramfs -u

this changes how udev names the nics, and fixes it so they dont get new names.

Edit: you also dont need to use vmbridges to put a ip on this. vmbridges should only be used to connect to ressources (vms/container). So for the mgmt of the proxmox ve (UI) you can put the ip directly on eno1/eno2. After you made sure, that these names are persistant.

xelar · Feb 4, 2024

@jsterr - The 3 nodes are nearly the same. The node where the OEM NIC is called eno1 instead of eno2 does not have wifi, while the other two do. I am guessing that explains why 2 nodes use eno2 for their OEM NIC and the other eno1.

This morning I had to reboot the switch that two of these nodes are connected to and they went offline. To get them back up, I gracefully powered the two nodes down (short press of power button) and rebooted them. This behavior is fairly consistent but I don't know whether it happens 100% of the time or not (anyhow my guess is it does).

Since I do not have this issue with any other device connected to any of my 10+ switches, and it consistently happens with these 3 Lenovo TinyPC (m920q & p360) running Proxmox, I must assume it is either a Lenovo OEM NIC issue, or a configuration issue in Proxmox. My guess is that I misconfigured something in Proxmox.

How do I check whether Proxmox is configured to attempt bringing up its ethernet connection once it goes down?

One of your suggestions above mentioned that the IP config should be in eno1/2 and not the bridge. When I try to reconfigure it I get this error message. I'll do more test on another node that has a monitor attached so that I can hopefully recover from CLI if I mess things up.

Edit: According to this post I cannot set the IP on the interface if bridged. I thought we were onto something

EDIT: Just a clarification after some testing... I rebooted the switch that the eno1 on Spartacus is connected to and I lost access to the Proxmox UI (via another node in the cluster) as expected. However, I can still access a VM (OpenSpeedTest) that is connected through a plug in 10GBit NIC serviced by another switch.

`ip link show` shows eno1 and vmbr0 as UP, however the node shows as offline in the cluster and the node does not respond to pings either.

jsterr · Feb 5, 2024

xelar said:
@jsterr - The 3 nodes are nearly the same. The node where the OEM NIC is called eno1 instead of eno2 does not have wifi, while the other two do. I am guessing that explains why 2 nodes use eno2 for their OEM NIC and the other eno1.

This morning I had to reboot the switch that two of these nodes are connected to and they went offline. To get them back up, I gracefully powered the two nodes down (short press of power button) and rebooted them. This behavior is fairly consistent but I don't know whether it happens 100% of the time or not (anyhow my guess is it does).

Since I do not have this issue with any other device connected to any of my 10+ switches, and it consistently happens with these 3 Lenovo TinyPC (m920q & p360) running Proxmox, I must assume it is either a Lenovo OEM NIC issue, or a configuration issue in Proxmox. My guess is that I misconfigured something in Proxmox.

How do I check whether Proxmox is configured to attempt bringing up its ethernet connection once it goes down?

One of your suggestions above mentioned that the IP config should be in eno1/2 and not the bridge. When I try to reconfigure it I get this error message. I'll do more test on another node that has a monitor attached so that I can hopefully recover from CLI if I mess things up.

Edit: According to this post I cannot set the IP on the interface if bridged. I thought we were onto something

View attachment 62549

EDIT: Just a clarification after some testing... I rebooted the switch that the eno1 on Spartacus is connected to and I lost access to the Proxmox UI (via another node in the cluster) as expected. However, I can still access a VM (OpenSpeedTest) that is connected through a plug in 10GBit NIC serviced by another switch.

`ip link show` shows eno1 and vmbr0 as UP, however the node shows as offline in the cluster and the node does not respond to pings either.

Thats because you first need to remove it from the bridge and the set it on eno2 afterwards. Please post "ip a" after you lost access to the node and post it here.

xelar · Feb 5, 2024

@jsterr - I removed the settings from vmbr0 but did not commit them as I figured I'd lose connectivity. I then tried to add those IP settings to eno1 and got the error. Since the node I am doing this has a monitor and keyboard, I will try to commit and if I lose connection I can make the edits via CLI (once I figure out how - I've done it before but can recall 100% how).

This is the output of "ip a" after rebooting the switch.

Code:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast master vmbr0 state UP group default qlen 1000
    link/ether f8:75:a4:cd:ce:81 brd ff:ff:ff:ff:ff:ff
    altname enp0s31f6
3: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master vmbr1 state UP group default qlen 1000
    link/ether 7c:fe:90:9f:93:60 brd ff:ff:ff:ff:ff:ff
4: enp1s0d1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 7c:fe:90:9f:93:61 brd ff:ff:ff:ff:ff:ff
5: wlp4s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether ac:74:b1:35:d4:c3 brd ff:ff:ff:ff:ff:ff
6: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether f8:75:a4:cd:ce:81 brd ff:ff:ff:ff:ff:ff
    inet 10.1.0.42/23 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::fa75:a4ff:fecd:ce81/64 scope link
       valid_lft forever preferred_lft forever
7: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether 7c:fe:90:9f:93:60 brd ff:ff:ff:ff:ff:ff
    inet6 fdd2:2bca:fe8a:f43d:7efe:90ff:fe9f:9360/64 scope global dynamic mngtmpaddr
       valid_lft 1787sec preferred_lft 1787sec
    inet6 fe80::7efe:90ff:fe9f:9360/64 scope link
       valid_lft forever preferred_lft forever
8: tap100i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast master fwbr100i0 state UNKNOWN group default qlen 1000
    link/ether 12:12:74:43:8f:c9 brd ff:ff:ff:ff:ff:ff
9: fwbr100i0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether ae:dd:91:c4:3a:7e brd ff:ff:ff:ff:ff:ff
10: fwpr100p0@fwln100i0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue master vmbr1 state UP group default qlen 1000
    link/ether 0e:af:d8:f6:88:fe brd ff:ff:ff:ff:ff:ff
11: fwln100i0@fwpr100p0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue master fwbr100i0 state UP group default qlen 1000
    link/ether ae:dd:91:c4:3a:7e brd ff:ff:ff:ff:ff:ff
12: veth150i0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue master fwbr150i0 state UP group default qlen 1000
    link/ether fe:8c:a8:b6:dd:d9 brd ff:ff:ff:ff:ff:ff link-netnsid 0
13: fwbr150i0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether 7a:b4:c2:22:f6:21 brd ff:ff:ff:ff:ff:ff
14: fwpr150p0@fwln150i0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue master vmbr1 state UP group default qlen 1000
    link/ether be:cb:9d:35:d4:16 brd ff:ff:ff:ff:ff:ff
15: fwln150i0@fwpr150p0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue master fwbr150i0 state UP group default qlen 1000
    link/ether 7a:b4:c2:22:f6:21 brd ff:ff:ff:ff:ff:ff
16: tap199i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast master vmbr1 state UNKNOWN group default qlen 1000
    link/ether 62:08:c4:0a:01:ee brd ff:ff:ff:ff:ff:ff

xelar · Feb 9, 2024

@jsterr - I moved the IP config to eno1 and rebooted the switch. The same issue occurred... I had to reboot the computer to restore connection. It just seems like Proxmox doesn't try to restore the link once it goes down.

jlauro · Feb 10, 2024

Did you wait at least 60 seconds after switch reboot completed incase spanning tree or something it blocking the port initially after reboot?

Instead of a reboot, can you do
ifreload -a from the server to restore connectivity?

xelar · Feb 11, 2024

@jlauro Other devices regain network connectivity on the same switch after a reboot quite quickly. I watched the port while the switch was rebooting and it never showed "STP Blocked" - it went directly to 1GBit link as shown below:

I then power cycled the switch and when it came back, the UI showed exactly the same as above (so port 2 was up even though Proxmox had no connectivity at this point). I tried executing `ifreload -a` via CLI and nothing happened, so I then tried `ip a` and eno1 showed as UP and with the correct IP. Since I have a 3 node cluster and am logged into another node, I tried logging into the node having issues but the client not reachable.

Then, I unplugged the ethernet cable from the client and plugged it back in, and now the port shows as "STP Blocked":

Executed `ifreload -a` & `ip a` again but no change, except "STP Blocked" disappeared after a bit... not sure whether it was related to the 2 commands or just length of time, anyhow Proxmox is still dead...

I powered down the client and now the port shows active at its lower speed and "STP Blocked".

I use WOL from the Proxmox UI to turn the client back on, and everything works... but the problem is still present on all 3 clients:

jlauro · Feb 11, 2024

I'll need to test this on my test clusters. I am configured for bonding so redundant switches, so I should be fine with a single switch power cycle. However the more common case is firmware upgrades with rolling cycles, in which case if both switches go down it will not handle the second switch power cycling...

One item I had to add to my bridge interfaces is bridge-aging 0
Without that, dhcp wouldn't work for vms on that interface, but fine after that. Symptoms aren't exactly the same, but is similar in some traffic was being blocked despite everything up, and I can't think of anything else to try...
ie:
auto vmbr1002
iface vmbr1002 inet manual
bridge-ports bond10.1002
bridge-stp off
bridge-fd 0
bridge-ageing 0
#net10.0.8-10.0.15

(Can't add the option via gui, but the gui should preserve the changes if you modify /etc/network/interfaces directly)
and reboot or ifreload -a should have the changes take effect.

xelar · Feb 11, 2024

@jlauro I'll look into those options to see which ones I should try.

I just tried to execute "ifdown eno1" and then "ifup eno1" which restored connectivity. Is it possible that eno1 gets "stuck" in some limbo where it shows as up but no data flows, and simply bringing it down and back up fixes it? I had not tried this before simply because I did not know the commands

jlauro · Feb 11, 2024

I have seen switches get confused on power cycles. More often on fiber than twisted pair, and also more often one switch to switch than switch to device. Sometimes forcing speed and duplex on both devices instead of leaving at auto helps. If you set one to a sepcific speed/duplex instead of auto you almost always have to set the other end the connection too.

xelar · Feb 20, 2024

@jlauro When I lose connectivity to the Proxmox UI due to a switch reboot (or simply unplug, plug back in eth cable), the switch shows an active GbE link. Below is the `dmesg` output of when it happened and of when I ran ifdown & if up to restore access. The switch has Jumbo Frames on and Flow Control on as well. The last log reflects that, but in the batch of previous log entries where the link initially went down, it appears it settled on FD but with flow control off. Could this have anything to do with the issue? I have yet to test static settings... that is my next step.

Code:

[   28.999124] eth0: renamed from vethbf592e6
[   29.039228] docker0: port 4(veth4518170) entered blocking state
[   29.039233] docker0: port 4(veth4518170) entered forwarding state
[   37.839039] usb 2-4: reset SuperSpeed USB device number 2 using xhci_hcd
[   38.098961] usb 1-7: reset full-speed USB device number 3 using xhci_hcd
[  271.045125] e1000e 0000:00:1f.6 eno1: NIC Link is Down
[  275.239669] e1000e 0000:00:1f.6 eno1: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[  440.528467] e1000e 0000:00:1f.6 eno1: NIC Link is Down
[  449.209768] e1000e 0000:00:1f.6 eno1: NIC Link is Up 1000 Mbps Half Duplex, Flow Control: Rx/Tx
[  449.213687] e1000e 0000:00:1f.6 eno1: NIC Link is Down
[  452.896263] e1000e 0000:00:1f.6 eno1: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[  619.885401] CIFS: VFS: \\10.1.0.45 has not responded in 180 seconds. Reconnecting...
[ 1322.864941] e1000e 0000:00:1f.6 eno1: NIC Link is Down
[ 1322.868116] e1000e 0000:00:1f.6: Interrupt Throttle Rate on
[ 1344.091568] e1000e 0000:00:1f.6: Interrupt Throttle Rate off
[ 1344.246665] e1000e 0000:00:1f.6: Some CPU C-states have been disabled in order to enable jumbo frames
[ 1347.785957] e1000e 0000:00:1f.6 eno1: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

Edit: I disabled flow control globally (all switches) and replicated the issue. It did the dance as above and settled correctly on:

Code:

NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

but the UI is still unreachable.

jlauro · Feb 21, 2024

What does "corosync-cfgtool -n" look like when it's operating normally, and how about after you reboot a switch and it's messed up?

The cluster I am testing on has:
root@ti-otn-proxmox-1:/etc# corosync-cfgtool -n
Local node ID 1, transport knet
nodeid: 2 reachable
LINK: 0 udp (10.9.1.21->10.9.1.22) enabled connected mtu: 8885
LINK: 1 udp (10.9.2.21->10.9.2.22) enabled connected mtu: 8885
LINK: 2 udp (10.9.3.21->10.9.3.22) enabled connected mtu: 8885
LINK: 3 udp (10.9.4.21->10.9.4.22) enabled connected mtu: 8885
LINK: 4 udp (10.9.5.21->10.9.5.22) enabled connected mtu: 8885

nodeid: 3 reachable
LINK: 0 udp (10.9.1.21->10.9.1.23) enabled connected mtu: 8885
LINK: 1 udp (10.9.2.21->10.9.2.23) enabled connected mtu: 8885
LINK: 2 udp (10.9.3.21->10.9.3.23) enabled connected mtu: 8885
LINK: 3 udp (10.9.4.21->10.9.4.23) enabled connected mtu: 8885
LINK: 4 udp (10.9.5.21->10.9.5.23) enabled connected mtu: 8885`

bofh · Feb 21, 2024

uhm why so many bdiges on different interfaces ? where is your corosync interface ?

you should have at the very least one interface (probably eno) - active, not assigned to a bridge, as a corosync interface

as for unifi, CAREFUL, this symbol is NOT STP Blocking, the UI is misleading. unifi always shows this during negogiation and it may or may not show up longer. its not meaningful or tell you anything really.

to proper torubleshoot from the switchside login via cli and get the port status there. the UI lags behind for way to long and is a bit misleading

jlauro · Feb 21, 2024

bofh said:
uhm why so many bdiges on different interfaces ? where is your corosync interface ?

I'm not having any issues (at least not the ones xelar is), but I haven't actually tested rebooting switches yet. That's an important case, but is rare expect during firmware upgrades, but that's also rare... My corosyncs are not on bridge links, but are on vlans on top of bonded interfaces. Different subnets and vlans are on bridges. Not exactly best practices, but as to why, I am physically limited on this equipment to 4x10gb NICs on the blade and can't dedicate interfaces to corosync traffic. (Technically I could do x2, and restrict vms and iscsi to x2, but this setup lets traffic from one zone borrow traffic from others and can double useable bandwidth). The links are setup for iscsi, and I could have done upto 8 already setup instead of 5 but that seemed to be overkill. Doing xor balance on the bond and having a prime number of interfaces helps ensure a good spread that something will stay up during a switch issue, as switches tend to have a window where they claim up on the links but are not. I am also doing round robin on the corosync instead of passive just to keep it balanced so there is less likely to be a hot spot on one of the physical nics.

xelar · Feb 21, 2024

@jlauro The output you requested is:

Code:

root@spartacus:~# corosync-cfgtool -n
Local node ID 2, transport knet
nodeid: 1 reachable
   LINK: 0 udp (10.1.0.42->10.1.0.40) enabled connected mtu: 8885

nodeid: 3 reachable
   LINK: 0 udp (10.1.0.42->10.1.0.41) enabled connected mtu: 8885

Last night I had a power issue where a breaker kept tripping and my UPS was depleted by the time I noticed the issue. Therefore, before I corrected the breaker issue all my equipment was power cycled a few times. This caused a lot of connectivity issues with the 2 Proxmox nodes (the one I am testing with is in another room) that are identically configured. I did not have the time to do proper orderly testing however my 10GbE links were also having the same issue. To recover, I had to power down everything (switches and nodes) and then power the switches up and then the nodes otherwise the 10Gbe (SFP+ DAC cable) link was not available to the VMs. I say it this way as the only thing I tested was whether the VMs were reachable, and they were not.

In other words, I could restate my issue as any network interface (both onboard and addon NIC) that loses connection, will not re-establish it. In my testing on 'spartacus' (the node with keyboard and monitor in my office) I was rebooting only the switch that serves the OEM GbE port, and not the other switch (USW-AGG) that serves the 10GbE NIC. The two other nodes that caused me much grief last night are connected in the same way but to different switches (Unifi USW Pro 24 PoE and USW-Aggregation). I am starting to wonder whether the issue lies with the switches but can't explain why the only nodes that don't restore connectivity are these 3 Proxmox nodes.

Edit: I just unplugged the fiber from spartacus and when reconnected, the link was back up so I am baffled. I did not seem to be the case with the other 2 nodes using DACs however testing that is a bit more disruptive as it brings down everything. I'll keep testing...

Edit2: Tested a few times and the 10GbE link comes back up after I plug the cable back in. I rebooted the USW-Aggregation serving the 10Gbe port and that came back up too. Last 2 tests shown below:

Code:

Feb 21 16:04:42 spartacus kernel: mlx4_en: enp1s0d1: Link Up
Feb 21 16:04:44 spartacus kernel: mlx4_en: enp1s0d1: Link Down
Feb 21 16:04:44 spartacus kernel: vmbr1: port 1(enp1s0d1) entered disabled state
Feb 21 16:04:54 spartacus kernel: mlx4_en: enp1s0d1: Link Up
Feb 21 16:04:54 spartacus kernel: vmbr1: port 1(enp1s0d1) entered blocking state
Feb 21 16:04:54 spartacus kernel: vmbr1: port 1(enp1s0d1) entered forwarding state
Feb 21 16:07:04 spartacus kernel: mlx4_en: enp1s0d1: Link Down
Feb 21 16:07:04 spartacus kernel: vmbr1: port 1(enp1s0d1) entered disabled state
Feb 21 16:07:12 spartacus kernel: mlx4_en: enp1s0d1: Link Up
Feb 21 16:07:12 spartacus kernel: vmbr1: port 1(enp1s0d1) entered blocking state
Feb 21 16:07:12 spartacus kernel: vmbr1: port 1(enp1s0d1) entered forwarding state
Feb 21 16:07:20 spartacus kernel: mlx4_en: enp1s0d1: Link Down
Feb 21 16:07:20 spartacus kernel: vmbr1: port 1(enp1s0d1) entered disabled state
Feb 21 16:07:46 spartacus kernel: mlx4_en: enp1s0d1: Link Up
Feb 21 16:07:46 spartacus kernel: vmbr1: port 1(enp1s0d1) entered blocking state
Feb 21 16:07:46 spartacus kernel: vmbr1: port 1(enp1s0d1) entered forwarding state
Feb 21 16:08:33 spartacus pmxcfs[1025]: [status] notice: received log

xelar · Feb 21, 2024

bofh said:
UI lags behind for way to long and is a bit misleading

Indeed... and it was quite annoying. I will research the CLI commands to check port status. Thanks for the tip.

xelar · Feb 21, 2024

I captured the syslog from the moment I unplugged the ethernet cable from the GbE OEM port to when I ran ifdown/ifup to bring the port back up. I am hoping the logs reveal the issue: syslog (Too long to paste in here)

Edit: 2nd attempt to catch a shorter version of the syslog >>

Code:

Feb 21 16:43:27 spartacus kernel: e1000e 0000:00:1f.6 eno1: NIC Link is Down
Feb 21 16:43:28 spartacus corosync[1123]:   [KNET  ] link: host: 3 link: 0 is down
Feb 21 16:43:28 spartacus corosync[1123]:   [KNET  ] link: host: 1 link: 0 is down
Feb 21 16:43:28 spartacus corosync[1123]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 21 16:43:28 spartacus corosync[1123]:   [KNET  ] host: host: 3 has no active links
Feb 21 16:43:28 spartacus corosync[1123]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 21 16:43:28 spartacus corosync[1123]:   [KNET  ] host: host: 1 has no active links
Feb 21 16:43:29 spartacus corosync[1123]:   [TOTEM ] Token has not been received in 2737 ms
Feb 21 16:43:30 spartacus corosync[1123]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
Feb 21 16:43:34 spartacus corosync[1123]:   [QUORUM] Sync members[1]: 2
Feb 21 16:43:34 spartacus corosync[1123]:   [QUORUM] Sync left[2]: 1 3
Feb 21 16:43:34 spartacus corosync[1123]:   [TOTEM ] A new membership (2.7733) was formed. Members left: 1 3
Feb 21 16:43:34 spartacus corosync[1123]:   [TOTEM ] Failed to receive the leave message. failed: 1 3
Feb 21 16:43:34 spartacus pmxcfs[1025]: [dcdb] notice: members: 2/1025
Feb 21 16:43:34 spartacus pmxcfs[1025]: [status] notice: members: 2/1025
Feb 21 16:43:34 spartacus corosync[1123]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Feb 21 16:43:34 spartacus corosync[1123]:   [QUORUM] Members[1]: 2
Feb 21 16:43:34 spartacus pmxcfs[1025]: [status] notice: node lost quorum
Feb 21 16:43:34 spartacus corosync[1123]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 21 16:43:34 spartacus pmxcfs[1025]: [dcdb] crit: received write while not quorate - trigger resync
Feb 21 16:43:34 spartacus pmxcfs[1025]: [dcdb] crit: leaving CPG group
Feb 21 16:43:34 spartacus pve-ha-lrm[1192]: unable to write lrm status file - unable to open file '/etc/pve/nodes/spartacus/lrm_status.tmp.1192' - Permission denied
Feb 21 16:43:34 spartacus pmxcfs[1025]: [dcdb] notice: start cluster connection
Feb 21 16:43:34 spartacus pmxcfs[1025]: [dcdb] crit: cpg_join failed: 14
Feb 21 16:43:34 spartacus pmxcfs[1025]: [dcdb] crit: can't initialize service
Feb 21 16:43:36 spartacus pvestatd[1144]: storage 'nas' is not online
Feb 21 16:43:38 spartacus kernel: e1000e 0000:00:1f.6 eno1: NIC Link is Up 1000 Mbps Half Duplex, Flow Control: None
Feb 21 16:43:38 spartacus kernel: e1000e 0000:00:1f.6 eno1: NIC Link is Down
Feb 21 16:43:40 spartacus kernel: e1000e 0000:00:1f.6 eno1: NIC Link is Down
Feb 21 16:43:40 spartacus kernel: e1000e 0000:00:1f.6: Interrupt Throttle Rate on
Feb 21 16:43:40 spartacus systemd[1]: Reloading postfix@-.service - Postfix Mail Transport Agent (instance -)...
Feb 21 16:43:40 spartacus postfix[16713]: Postfix is using backwards-compatible default settings
Feb 21 16:43:40 spartacus postfix[16713]: See http://www.postfix.org/COMPATIBILITY_README.html for details
Feb 21 16:43:40 spartacus postfix[16713]: To disable backwards compatibility use "postconf compatibility_level=3.6" and "postfix reload"
Feb 21 16:43:40 spartacus postfix/postfix-script[16719]: refreshing the Postfix mail system
Feb 21 16:43:40 spartacus postfix/master[1116]: reload -- version 3.7.10, configuration /etc/postfix
Feb 21 16:43:40 spartacus systemd[1]: Reloaded postfix@-.service - Postfix Mail Transport Agent (instance -).
Feb 21 16:43:40 spartacus systemd[1]: Reloading postfix.service - Postfix Mail Transport Agent...
Feb 21 16:43:40 spartacus systemd[1]: Reloaded postfix.service - Postfix Mail Transport Agent.
Feb 21 16:43:40 spartacus chronyd[963]: Source 74.6.168.72 offline
Feb 21 16:43:40 spartacus chronyd[963]: Source 216.240.36.24 offline
Feb 21 16:43:40 spartacus chronyd[963]: Source 216.31.17.12 offline
Feb 21 16:43:40 spartacus chronyd[963]: Can't synchronise: no selectable sources
Feb 21 16:43:40 spartacus chronyd[963]: Source 217.180.209.214 offline
Feb 21 16:43:40 spartacus pmxcfs[1025]: [dcdb] notice: members: 2/1025
Feb 21 16:43:40 spartacus pmxcfs[1025]: [dcdb] notice: all data is up to date
Feb 21 16:43:43 spartacus pvestatd[1144]: storage 'nas' is not online
Feb 21 16:43:45 spartacus kernel: e1000e 0000:00:1f.6: Interrupt Throttle Rate off
Feb 21 16:43:45 spartacus kernel: e1000e 0000:00:1f.6: Some CPU C-states have been disabled in order to enable jumbo frames
Feb 21 16:43:45 spartacus systemd[1]: Reloading postfix@-.service - Postfix Mail Transport Agent (instance -)...
Feb 21 16:43:45 spartacus postfix[16784]: Postfix is using backwards-compatible default settings
Feb 21 16:43:45 spartacus postfix[16784]: See http://www.postfix.org/COMPATIBILITY_README.html for details
Feb 21 16:43:45 spartacus postfix[16784]: To disable backwards compatibility use "postconf compatibility_level=3.6" and "postfix reload"
Feb 21 16:43:45 spartacus postfix/postfix-script[16790]: refreshing the Postfix mail system
Feb 21 16:43:45 spartacus postfix/master[1116]: reload -- version 3.7.10, configuration /etc/postfix
Feb 21 16:43:45 spartacus systemd[1]: Reloaded postfix@-.service - Postfix Mail Transport Agent (instance -).
Feb 21 16:43:45 spartacus systemd[1]: Reloading postfix.service - Postfix Mail Transport Agent...
Feb 21 16:43:45 spartacus systemd[1]: Reloaded postfix.service - Postfix Mail Transport Agent.
Feb 21 16:43:45 spartacus postfix/qmgr[16795]: 4EF7C120589: from=<root@spartacus.local>, size=4833, nrcpt=1 (queue active)
Feb 21 16:43:45 spartacus chronyd[963]: Source 74.6.168.72 online
Feb 21 16:43:45 spartacus chronyd[963]: Source 216.240.36.24 online
Feb 21 16:43:45 spartacus chronyd[963]: Source 216.31.17.12 online
Feb 21 16:43:45 spartacus chronyd[963]: Source 217.180.209.214 online
Feb 21 16:43:49 spartacus kernel: e1000e 0000:00:1f.6 eno1: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Feb 21 16:43:53 spartacus postfix/smtp[16803]: connect to gmail-smtp-in.l.google.com[2607:f8b0:4023:1009::1b]:25: Network is unreachable
Feb 21 16:43:54 spartacus corosync[1123]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
Feb 21 16:43:54 spartacus corosync[1123]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 21 16:43:54 spartacus corosync[1123]:   [KNET  ] pmtud: Global data MTU changed to: 8885
Feb 21 16:43:55 spartacus corosync[1123]:   [KNET  ] rx: host: 1 link: 0 is up
Feb 21 16:43:55 spartacus corosync[1123]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
Feb 21 16:43:55 spartacus corosync[1123]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 21 16:43:55 spartacus corosync[1123]:   [KNET  ] pmtud: Global data MTU changed to: 8885
Feb 21 16:43:55 spartacus corosync[1123]:   [QUORUM] Sync members[3]: 1 2 3
Feb 21 16:43:55 spartacus corosync[1123]:   [QUORUM] Sync joined[2]: 1 3
Feb 21 16:43:55 spartacus corosync[1123]:   [TOTEM ] A new membership (1.7737) was formed. Members joined: 1 3
Feb 21 16:43:55 spartacus pmxcfs[1025]: [dcdb] notice: members: 1/1143, 2/1025, 3/1205
Feb 21 16:43:55 spartacus pmxcfs[1025]: [dcdb] notice: starting data syncronisation
Feb 21 16:43:55 spartacus pmxcfs[1025]: [status] notice: members: 1/1143, 2/1025, 3/1205
Feb 21 16:43:55 spartacus pmxcfs[1025]: [status] notice: starting data syncronisation
Feb 21 16:43:55 spartacus corosync[1123]:   [QUORUM] This node is within the primary component and will provide service.
Feb 21 16:43:55 spartacus corosync[1123]:   [QUORUM] Members[3]: 1 2 3
Feb 21 16:43:55 spartacus corosync[1123]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 21 16:43:55 spartacus pmxcfs[1025]: [status] notice: node has quorum
Feb 21 16:43:55 spartacus pmxcfs[1025]: [dcdb] notice: received sync request (epoch 1/1143/0000000A)
Feb 21 16:43:55 spartacus pmxcfs[1025]: [status] notice: received sync request (epoch 1/1143/0000000A)
Feb 21 16:43:55 spartacus pmxcfs[1025]: [dcdb] notice: received all states
Feb 21 16:43:55 spartacus pmxcfs[1025]: [dcdb] notice: leader is 1/1143
Feb 21 16:43:55 spartacus pmxcfs[1025]: [dcdb] notice: synced members: 1/1143, 3/1205
Feb 21 16:43:55 spartacus pmxcfs[1025]: [dcdb] notice: waiting for updates from leader
Feb 21 16:43:55 spartacus pmxcfs[1025]: [dcdb] notice: dfsm_deliver_queue: queue length 2
Feb 21 16:43:55 spartacus pmxcfs[1025]: [status] notice: received all states
Feb 21 16:43:55 spartacus pmxcfs[1025]: [status] notice: all data is up to date
Feb 21 16:43:55 spartacus pmxcfs[1025]: [dcdb] notice: update complete - trying to commit (got 5 inode updates)
Feb 21 16:43:55 spartacus pmxcfs[1025]: [dcdb] notice: all data is up to date
Feb 21 16:43:55 spartacus pmxcfs[1025]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 2

jlauro · Feb 21, 2024

xelar said:

@jlauro The output you requested is:

Code:

root@spartacus:~# corosync-cfgtool -n
Local node ID 2, transport knet
nodeid: 1 reachable
   LINK: 0 udp (10.1.0.42->10.1.0.40) enabled connected mtu: 8885

nodeid: 3 reachable
   LINK: 0 udp (10.1.0.42->10.1.0.41) enabled connected mtu: 8885

You have no redundant links, so 10.1.0.42, which looks like only is on eno1 is a single point of failure. If that switch goes down for more than a few seconds I don't think the cluster will be able to operate (maybe it would come back after connection restored, not sure, but definitely no manual or automatic cluster operations with that single interface down). That said, it doesn't really explain the ports not coming back up automatically, and having to do an ifdown/ifup. That sounds most likely a hardware/firmware issue with your NIC. See if there is a firmware upgrade for your NIC. What type (vendor/model) of NIC is eno1 on 10.1.0.42?

xelar · Feb 22, 2024

jlauro said:
You have no redundant links, so 10.1.0.42, which looks like only is on eno1 is a single point of failure. If that switch goes down for more than a few seconds I don't think the cluster will be able to operate (maybe it would come back after connection restored, not sure, but definitely no manual or automatic cluster operations with that single interface down). That said, it doesn't really explain the ports not coming back up automatically, and having to do an ifdown/ifup. That sounds most likely a hardware/firmware issue with your NIC. See if there is a firmware upgrade for your NIC. What type (vendor/model) of NIC is eno1 on 10.1.0.42?

When one of the nodes goes down, I can access the cluster as long as I wasn't doing so via that node. Everything appears to be working fine even with one node down. I did read something about quorum issues, and giving one node 2 votes to help with the issue. If that is what you are referring to, I will look into it again.

The newest one of the nodes is a Lenovo TinyPC p360 and the baked in NIC is an Intel I219-LM. I did see some hardware hang error which can be fixed by disabling GSO and TSO which I only did on the node I am testing the issue on. The fix did not seem to change my current issue in any way.

As for firmware, I did not see anything specific to the OEM NIC. I thought that firmware comes with the Intel drivers so not something I can install given the only instance of Windows is a VM, and I don't think I'd be able to do anything in Debian without causing more trouble when it comes to drivers.

Edit: The other 2 nodes are older Lenovo TinyPC m920q which also use the same Intel NIC however I think I saw it is an older revision (but same chip name).

Nodes lose network connectivity when I reboot the switch and do not get it back after switch reboot is complete

New Member

Renowned Member

New Member

Renowned Member

New Member

New Member

Member

New Member

Member

New Member

Member

New Member

Member

Well-Known Member

Member

New Member

New Member

New Member

Member

New Member