Node in other building keeps showing up as offline to other 2 nodes, possible corosync issues?

shactheorb

Member
Mar 20, 2023
13
0
6
I have a total of 3 nodes, 2 in one building and 1 in another building. Every couple of weeks, the one in the other building shows up as offline to the other two, and vice versa. I'm able to ping between nodes, but they still aren't showing up as connected. Before I was able to get them to re-sync by rebooting the problem node, but the last couple of times have been more stubborn.

Sorry for the lack of information, not an advanced user, but can get any info you might need to help. Thanks a ton in advance.

When I run systemctl status corosync.service I get this:

root@pve1:~# systemctl status corosync.service
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; preset: enabled)
Active: active (running) since Mon 2026-05-04 10:49:34 MDT; 8min ago
Invocation: 200524b3d2e148de8098f23ec61b3ba6
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 1665 (corosync)
Tasks: 9 (limit: 70911)
Memory: 153.1M (peak: 153.8M)
CPU: 3.321s
CGroup: /system.slice/corosync.service
└─1665 /usr/sbin/corosync -f

May 04 10:57:33 pve1 corosync[1665]: [KNET ] rx: Packet rejected from (building 1 public IP):5405
May 04 10:57:33 pve1 corosync[1665]: [KNET ] rx: Packet rejected from (building 1 public IP):26930
May 04 10:57:34 pve1 corosync[1665]: [KNET ] rx: Packet rejected from (building 1 public IP):26930
May 04 10:57:34 pve1 corosync[1665]: [KNET ] rx: Packet rejected from (building 1 public IP):5405
May 04 10:57:34 pve1 corosync[1665]: [KNET ] rx: Packet rejected from (building 1 public IP):5405
May 04 10:57:34 pve1 corosync[1665]: [KNET ] rx: Packet rejected from (building 1 public IP):26930
May 04 10:57:35 pve1 corosync[1665]: [KNET ] rx: Packet rejected from (building 1 public IP):5405
May 04 10:57:35 pve1 corosync[1665]: [KNET ] rx: Packet rejected from (building 1 public IP):26930
May 04 10:57:35 pve1 corosync[1665]: [KNET ] rx: Packet rejected from (building 1 public IP):5405
May 04 10:57:35 pve1 corosync[1665]: [KNET ] rx: Packet rejected from (building 1 public IP):26930
 
Last edited:
I also noticed that any pings that use an MTU above 1350 fail between the nodes in separate buildings, but not sure if that's related
 
May 04 10:57:35 pve1 corosync[1665]: [KNET ] rx: Packet rejected from (building 1 public IP):5405
May 04 10:57:35 pve1 corosync[1665]: [KNET ] rx: Packet rejected from (building 1 public IP):26930
Corosync uses port 5405 + link-id as sending and receiving port. Showing another port number like 26930 in this case indicates a manipulation of the packets, maybe a nat gateway or similar as @JTY mentioned.
 
  • Like
Reactions: Johannes S
What kind of network connection is there between the buildings? Are there any firewalls or other layer 3 devices in between the nodes?
Each building has it's own Xfinity service, with a ubiquity UDM pro. The setup was configured by a contractor long before me, so apologies for my lack of knowledge on how it was specifically configured. There are some NAT rules set up, and I can get you any information on those that you need.
 
Worst case scenario I can also just set it up as two separate clusters, with the problem node being separate from the other two. Not ideal, but at least an option
 
What is the network latency between the nodes on the network used for corosync?
is this network on a shred medium? Can there be congestion?
96 packets transmitted, 96 received, 0% packet loss, time 95145ms
rtt min/avg/max/mdev = 15.667/25.435/33.892/3.466 ms

Not a shared medium as far as I'm aware, wired directly into ethernet switch > unifi router > ISP modem. It could definitely get congested, as I run frequent backups from that server to the backup server in the other building (and xfinity sucks, 35Mbps upload limit)