Nodes lose network connectivity when I reboot the switch and do not get it back after switch reboot is complete

Are you certain that's correct? Lenovo doesn't show p360 as one of the machines with that specific card (but list a broadcom and a different intel card). That said, online resources by vendors isn't always accurate/complete...

Can you confirm with the following or something:
lspci | grep -i ether

It does sound more like a hardware (or driver) issue than a proxmox issue to me.

If my guess is right, best option is to get a better nic. Your other hacky option besides keeping your switch up 24/7 would be put in a watchdog that pings a few things on the same network and does an ifdown / ifup if they all fail.
 
Are you certain that's correct? Lenovo doesn't show p360 as one of the machines with that specific card (but list a broadcom and a different intel card). That said, online resources by vendors isn't always accurate/complete...

Can you confirm with the following or something:
lspci | grep -i ether

It does sound more like a hardware (or driver) issue than a proxmox issue to me.

If my guess is right, best option is to get a better nic. Your other hacky option besides keeping your switch up 24/7 would be put in a watchdog that pings a few things on the same network and does an ifdown / ifup if they all fail.
Lenovo had the bright idea to name both a TinyPC and a Tower with the name name...

The output on the p360 is:

Code:
root@maximus:~# lspci | grep -i ether
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (17) I219-LM (rev 11)
01:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]

while on the other 2 m920q it is:

Code:
root@spartacus:~# lspci | grep -i ether
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (7) I219-LM (rev 10)
01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]

and

Code:
root@commodus:~# lspci | grep -i ether
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (7) I219-LM (rev 10)
01:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]

I had not thought about the script... I don't know how to write one but I recall following instructions in the past to do something similar. I'll figure it out. Just sucks that I can't get such a basic function to work properly... Anyhow, another option I might try is to use the second port on the Mellanox card as my GbE management LAN port and just not use the OEM GbE port. I also have to play a bit with the UEFI/BIOS settings related to the NIC card, WOL, etc. Either way, I must be able to start the computer remotely so I'll have to see what options are available in the BIOS. I also recall a settings related to C states and networking... another thing to look into given the issue seems restricted to just these 3 nodes.

I REALLY appreciate all your help!!!
 
Something like the following in the crontab:

Code:
* * * * * ( ping 10.1.0.1 -c 1 -w 1 || ping 10.1.0.41 -c 1 -w 1 || ping 10.1.0.42 -c 1 -w 1 || ping 10.1.0.43 -c 1 -w 1 ) >/dev/null 2>/dev/null || ( /usr/sbin/ifdown -a ; /usr/sbin/ifup -a )

That will skip the down/up if any ping is successful, so make sure not to test pinging itself if that succeeds when it's in the problem state...
 
Last edited:
Something like the following in the crontab:

Code:
* * * * * ( ping 10.1.0.1 -c 1 -w 1 || ping 10.1.0.41 -c 1 || ping 10.1.0.42 || ping 10.1.0.43 ) >/dev/null 2>/dev/null || ( /usr/sbin/ifdown -a ; /usr/sbin/ifup -a )

That will skip the down/up if any ping is successful, so make sure not to test pinging itself if that succeeds when it's in the problem state...
You are awesome, thank you! I will try this workaround on the mission critical nodes and keep testing options on the non critical node in hopes of finding the root cause.
 
You are awesome, thank you! I will try this workaround on the mission critical nodes and keep testing options on the non critical node in hopes of finding the root cause.
i would have given you the cli commands if i rembered them lol, last time i needed them was in november.
you can use the cli guide from their edge switch series, a lot of them work also on the unifi series.

also they have an extensive log in the unifi console cli thats not exposed to the gui. every switch gets a fairly comprehensive log

WARNING: no matter what you set in your timezone in the UI the timestamps in the switch will be different. maybe run a tail on that switchlog while trying to recreate the issue


aw corosync, yea been there in corosync hell once (it was an issue with a new upstream update back in the day, killed quiet a few cluster lol)
thats why i was asking for the dedi corosync.


anyhow, you know what, why not simply do all on VLANS on the mellanox. you can either run your management as untagged on one of these and all vms as tagged on port 1. then corosync untagged and all storage traffic on port 2. or run it all tagged who cares.

in that case i would do 2 corosyncs on tagged on each interface to have redundancy.

personally i would simply do everything tagged, and setup vm´s directly to their vlans on their vnic (gui option), and run 2x coro and 1x management on those 2 ports also tagged (just to keep it uniform and simple as possible)



btw i had some troubles with unifi aggregation (sfp only the big boi) getting dhcp on its management vlan when dhcp is pfsense on an intel 25gbit card... not resovled yet , had to workaround couldnt play around for much longer killing 60ish switches behind every 1-2 days lol

is it ideal to run coro as just another vlan ? no, but when having it split on 2 different physical networks as redundant options i dont see high chance for issues there. just keep in mind to not saturate both at the same time or cluster will fail.
 
Last edited:
  • Like
Reactions: xelar
also they have an extensive log in the unifi console cli thats not exposed to the gui. every switch gets a fairly comprehensive log
I found this:

https://help.ui.com/hc/en-us/articles/204959834-UniFi-Advanced-Logging-Information

SSH into switch then run:

Code:
telnet localhost
enable
show tech-support

Indeed there is a huge amount of information so I am still digging around. Ideally there is something that shows live logs for the switch or a specific port (even better). For now I have only found tons of diagnostic and configuration info. Thanks for the suggestion! I forgot about the on device logs.

As for all the corosync config, I have to look into it to better understand your suggestions. I really need to try to keep this as simple as possible as I have already overcomplicated my home setup. It all started with having too many RPIs running simple services/applications (pihole, home assistant, Scrypted, NUT, chrony, etc.) which I then converted to LCX / VMs. The hardware I picked is consumer PC grade but is way more powerful than RPIs and is quite good in terms of performance, size, heat and power used... so I am very happy with it except for the hell that breaks loose when I reboot a switch after one of the many EA Unifi firmware releases.
 
I found this:

https://help.ui.com/hc/en-us/articles/204959834-UniFi-Advanced-Logging-Information

SSH into switch then run:

Code:
telnet localhost
enable
show tech-support

Indeed there is a huge amount of information so I am still digging around. Ideally there is something that shows live logs for the switch or a specific port (even better). For now I have only found tons of diagnostic and configuration info. Thanks for the suggestion! I forgot about the on device logs.

As for all the corosync config, I have to look into it to better understand your suggestions. I really need to try to keep this as simple as possible as I have already overcomplicated my home setup. It all started with having too many RPIs running simple services/applications (pihole, home assistant, Scrypted, NUT, chrony, etc.) which I then converted to LCX / VMs. The hardware I picked is consumer PC grade but is way more powerful than RPIs and is quite good in terms of performance, size, heat and power used... so I am very happy with it except for the hell that breaks loose when I reboot a switch after one of the many EA Unifi firmware releases.
my corosync suggest its really not that complicated, you can set it up within 5 minutes.
just define 2 new tagged vlans and add one of each to each of tghe mellanox ports so lets say
vlan 100 and 101, you add vlan 100 to all ports using mellanox port1, and 101 to all ports using mellanox 2 (i assume 1 is VM and 2 is storage network)
after this you simply make a virtual system interface (you can do it in gui too)

like enp1s0.100 and another one enp1s0d1.101

auto enp1s0.100
iface enp1s0.100 inet static
address 10.1.100.1
netmask 255.255.255.0

auto enp1s0d1.100
iface enp1s0d1.100 inet static
address 10.1.101.1
netmask 255.255.255.0

so now you have 2 more subnets for corosync itself, clean seperated from the rest
you replaicate that on all 3 server. now you can try to ping each other if it works then you simply go into corosync.conf
and add / change adresse for rin0 and ring 1 like
node {
name: due
nodeid: 2
quorum_votes: 1
ring0_addr: 10.1.100.2
ring1_addr: 10.1.101.2
}

node {
name: tre
nodeid: 3
quorum_votes: 1
ring0_addr: 10.1.100.3
ring1_addr: 10.1.101.3
}

node {
name: uno
nodeid: 1
quorum_votes: 1
ring0_addr: 10.1.100.1
ring1_addr: 10.1.101.1
}

then you do a
journalctl -b -u corosync
and a
pvecm status
to confirm
voila you have now a redudant corosync setup across 2 different ports (and i assume 2 different switches)


even if you only run one switch still do 2 vlans one for each interface. reason beeing, because you use that mellanox for other stuff too you might wanna be able to reload /restart it at will. if that would be your only corosync interface then cluster gonna puke. with 2 rings you can shut one down without an issue
but ofc extra browny points if both ports go to seperate switches :)
 
Last edited:
my corosync suggest its really not that complicated, you can set it up within 5 minutes.
just define 2 new tagged vlans and add one of each to each of tghe mellanox ports so lets say
vlan 100 and 101, you add vlan 100 to all ports using mellanox port1, and 101 to all ports using mellanox 2 (i assume 1 is VM and 2 is storage network)
after this you simply make a virtual system interface (you can do it in gui too)

like enp1s0.100 and another one enp1s0d1.101

auto enp1s0.100
iface enp1s0.100 inet static
address 10.1.100.1
netmask 255.255.255.0

auto enp1s0d1.100
iface enp1s0d1.100 inet static
address 10.1.101.1
netmask 255.255.255.0

so now you have 2 more subnets for corosync itself, clean seperated from the rest
you replaicate that on all 3 server. now you can try to ping each other if it works then you simply go into corosync.conf
and add / change adresse for rin0 and ring 1 like
node {
name: due
nodeid: 2
quorum_votes: 1
ring0_addr: 10.1.100.2
ring1_addr: 10.1.101.2
}

node {
name: tre
nodeid: 3
quorum_votes: 1
ring0_addr: 10.1.100.3
ring1_addr: 10.1.101.3
}

node {
name: uno
nodeid: 1
quorum_votes: 1
ring0_addr: 10.1.100.1
ring1_addr: 10.1.101.1
}

then you do a
journalctl -b -u corosync
and a
pvecm status
to confirm
voila you have now a redudant corosync setup across 2 different ports (and i assume 2 different switches)


even if you only run one switch still do 2 vlans one for each interface. reason beeing, because you use that mellanox for other stuff too you might wanna be able to reload /restart it at will. if that would be your only corosync interface then cluster gonna puke. with 2 rings you can shut one down without an issue
but ofc extra browny points if both ports go to seperate switches :)
@bofh I really appreciate the detailed information! The 2 most critical nodes are already connected to 2 different switches each as I have the GbE port for management only on one switch, and one of the two 10 GbE ports on another so I might be able to implement this without major changes... but I first have to study up on the topic.

On another unrelated note... sei Italiano? Ho notato "uno", "due" e "tre" nella configurazione. Io sono cresciuto a Roma ma ora vivo in USA.... Ciao e grazie!
 
no sorry iam not italian but i did understand mostly... :)
i stole the config from the official proxmox documentation lol
 
  • Like
Reactions: xelar

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!