Cluster node disconnected randomly

Adnan · Oct 6, 2024

Hi, I have 4 servers across 2 physical sites, named PVE1, PVE2, PVE3 (on site 1) and PVE4 (on site 2).

Sites 1 and 2 are physically across the street, currently connected together via VPN over WAN and we are installing a radio PtP link between the sites. Anyway, the issue is something else.

Sometimes, PVE4 appears disconnected in the cluster, whether I’m connected to PVE1 to PVE3, BUT PVE4 is still online, can be pinged, can be accessed via SSH or web, and sees the other 3 as disconnected.

PVE4 only has test VMs so usually we simply reboot it and it’s back in the cluster, and rarely we need to reboot all of them to “reconnect” them all together. I know that it’s not a recommended or advised way to connect nodes in a cluster… now I’m looking for ideas or solutions.

Is it possible to reconnect the nodes together without rebooting them? Maybe a corosync recheck, or something equivalent please ? (Something we can program on a zabbix to trigger automatically for instance).

Thanks

tcabernoch · Oct 6, 2024

I do things like this. For fun. Not in Prod.
You clearly know the following bit already.

https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network

The Proxmox VE cluster stack requires a reliable network with latencies under 5 milliseconds (LAN performance) between all nodes to operate stably. While on setups with a small node count a network with higher latencies may work, this is not guaranteed and gets rather unlikely with more than three nodes and latencies above around 10 ms.

I'd try
systemctl stop pve-cluster
systemctl stop corosync
And start em again.

(I don't mean to sound snide. I wish this did work. I would love to use storage replication to do DR at a remote site. But it doesn't work. And they are pretty darn clear that it won't.)

esi_y · Oct 6, 2024

Adnan said:
Sites 1 and 2 are physically across the street, currently connected together via VPN over WAN and we are installing a radio PtP link between the sites. Anyway, the issue is something else.

Actually, this is your issue.

Adnan said:
Sometimes, PVE4 appears disconnected in the cluster, whether I’m connected to PVE1 to PVE3, BUT PVE4 is still online, can be pinged, can be accessed via SSH or web, and sees the other 3 as disconnected.

I would be very skeptical with GUI, i.e. it might be lagging, what actually might be happening is you are losing quorum on and off in a quick succession, something the GUI does not reflect. I would definitely want to check first how often this happens with journalctl -u corosync on the odd one of those on "the other side".

Adnan said:
PVE4 only has test VMs so usually we simply reboot it and it’s back in the cluster, and rarely we need to reboot all of them to “reconnect” them all together.

It's a bigger issue than you think, every time a node leaves or re-appears it basically disrupts the entire cluster, the remaining nodes have to form new "membership" to keep exchanging messages (this is the toll of quorum as opposed to some master-slaves system), while they can't exchange messages, they can't update files in /etc/pve (it appears as readonly), so the cluster is not really a capable of doing anything. This all despite 3 nodes are enough for a quorum.

Adnan said:
I know that it’s not a recommended or advised way to connect nodes in a cluster… now I’m looking for ideas or solutions.

Well, as @tcabernoch pitched (I don't even want to repost it

.. I mean for this particular case) restarting the said services is really what matters (and it's basically just that during a reboot that gets you back up), but as you discovered you sometimes need to do it with them all. Yes in that case you have to do it on them all, but together, so off on all nodes first, then only on, one by one, basically you are coaxing it to catch up with one another. Start with the 3 in the same place, then add the 4th.

Adnan said:
Is it possible to reconnect the nodes together without rebooting them? Maybe a corosync recheck, or something equivalent please ? (Something we can program on a zabbix to trigger automatically for instance).

Yeah so ... you basically asked for something I have yet to add into my "tutorial" [1]. I did not get to it yet because in the process I went through some interesting situations and do not want to be taken apart for posting it (just yet).

You basically have to have /var/lib/pve-cluster/config.db in consistent state across all nodes and need to launch them from such state. For which you need a sort of "control" node (which I am lucky to have in that Ansible scenario), but this goes against Proxmox ideology of no-masters, so ...

Adnan said:
Thanks

The official support reply for this scenario would always be to take that no 4 one out of the cluster and find another way.

Post the corosync log if you want to have an idea what you are setting yourself up for.

[1] https://forum.proxmox.com/threads/dhcp-cluster-deployment.154780/#post-706594

esi_y · Oct 6, 2024

PS Just a warning, but important - if you are also using HA services, you might eventually find yourself in a never-ending-reboots scenario of all nodes thanks to this: https://forum.proxmox.com/threads/high-availability-watchdog-reboots.154580/

It is also conceivable your no 4 is rebooting, but you are not aware of it?

tcabernoch · Oct 6, 2024

Ya, don't do HA if your cluster is this unstable. It's quite difficult to clean up.

Adnan · Oct 13, 2024

Sorry for the delay guys, I was waiting for the cluster to fail again. Which happened today.

tcabernoch said:
I'd try
systemctl stop pve-cluster
systemctl stop corosync
And start em again.

I did that on PV1 and PVE4 at the same time (stop 1, stop 4, start 1, start 4) and promox web GUI was empty (datacenter with no node) for 2 or 3 minutes. But PVE4 still disconnected.

esi_y said:
Actually, this is your issue.

I meant the issue is not that we are installing a PtP wireless link, etc…

esi_y said:
I would be very skeptical with GUI, i.e. it might be lagging, what actually might be happening is you are losing quorum on and off in a quick succession, something the GUI does not reflect. I would definitely want to check first how often this happens with journalctl -u corosync on the odd one of those on "the other side".

Here what I have

From PVE1:

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.190.0.8
0x00000002 1 10.190.0.6
0x00000004 1 10.190.0.4 (local)

From PVE4:

Membership information
----------------------
Nodeid Votes Name
0x00000003 1 10.133.0.4 (local)

corosync.log PVE1: https://pastebin.com/pBNK4fzb

corosync.log PVE4: https://pastebin.com/rR3Q8F4z

I have cut the repeated parts, replaced them by [...]

esi_y said:
so off on all nodes first, then only on, one by one, basically you are coaxing it to catch up with one another. Start with the 3 in the same place, then add the 4th.

I didn't do that at this order. I did the single one off/on because sometimes it's enough. Then PVE1, when it went back online, PVE4 was still disconnected. Then PVE2, PVE2 stayed offline alongside PVE4. Then PVE3, PVE2 went back online with PVE4. Then PVE3 went back online.

esi_y said:
You basically have to have /var/lib/pve-cluster/config.db in consistent state across all nodes and need to launch them from such state. For which you need a sort of "control" node (which I am lucky to have in that Ansible scenario), but this goes against Proxmox ideology of no-masters, so ...

My "master" is PVE1, so I could get its config.db on all the other nodes and how do I launch them? I didn't get this.

esi_y said:
The official support reply for this scenario would always be to take that no 4 one out of the cluster and find another way.

Fortunately, I'm not using any HA functionality. I'm only managing all of the PVEs from 1 interface. That's it. Sometimes I clone or migrate but not failover.

Thanks

esi_y · Oct 14, 2024

Before proceeding any further, can you set your MTU to <=1397, e.g. 1392?

Adnan · Oct 14, 2024

esi_y said:
Before proceeding any further, can you set your MTU to <=1397, e.g. 1392?

Set the MTU of which device? PVEs, 1, 4, all? Switch connected to them or the router to internet ?

All the PVEs are connected to a switch (not sure if it's manageable) then to a mikrotik router on which the ISP router is connected. Our mikrotik handles an EoIP tunnel between the 2 networks. if you want me to simulate a fragmentation between the PVEs, the easiest is to lower the MTU on our Mikrotik router on either side.

PS: The logs show the public IPs when packets are rejected, but their "link" goes through a VPN with private addresses (10.190.0.4 -> 10.0.0.1 -> 10.0.0.2 -> 10.133.0.4), I don't get how the public IPs are used

esi_y · Oct 14, 2024

Adnan said:
Set the MTU of which device? PVEs, 1, 4, all? Switch connected to them or the router to internet ?

In /etc/network/interfaces for everything that corosync uses on all nodes.

Adnan said:
All the PVEs are connected to a switch (not sure if it's manageable) then to a mikrotik router on which the ISP router is connected. Our mikrotik handles an EoIP tunnel between the 2 networks. if you want me to simulate a fragmentation between the PVEs, the easiest is to lower the MTU on our Mikrotik router on either side.

No, you already have fragmentation going on, no need to simulate.

Adnan said:
PS: The logs show the public IPs when packets are rejected, but their "link" goes through a VPN with private addresses (10.190.0.4 -> 10.0.0.1 -> 10.0.0.2 -> 10.133.0.4), I don't get how the public IPs are used

This is why I originally reacted that:

esi_y said:
Actually, this is your issue.

But let's do away with the MTU issue first, then will have a cleaner picture in the logs what's next.

Adnan · Oct 15, 2024

esi_y said:
This is why I originally reacted that:

Here you are

esi_y · Oct 15, 2024

Adnan said:
Here you are

What happened there?

I am a bit lost here now. I thought you get your MTU fixed and wait for first time it has a hiccup, then have a look at the logs.

Adnan · Oct 15, 2024

esi_y said:
What happened there? I am a bit lost here now. I thought you get your MTU fixed and wait for first time it has a hiccup, then have a look at the logs.

Sorry I didn’t know if you wanted to see what happens immediately. I’m still waiting for a hiccup, but as the logs show « strange » things to me, and as it takes 9 days to hiccup, better send something now.

Anyway, I’ll wait for the next issue to post the logs again

Adnan · Oct 15, 2024

New hiccup, I did : journalctl -u corosync --since "05:00:00" > corosync_pveX.log, nothing much is shown until the loss of link between them.

esi_y · Oct 15, 2024

Adnan said:
New hiccup, I did : journalctl -u corosync --since "05:00:00" > corosync_pveX.log, nothing much is shown until the loss of link between them.

Excellent.

Code:

Oct 15 20:19:29 pve1 corosync[1588]:   [KNET  ] link: host: 3 link: 0 is down
Oct 15 20:19:29 pve1 corosync[1588]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 15 20:19:29 pve1 corosync[1588]:   [KNET  ] host: host: 3 has no active links
Oct 15 20:19:30 pve1 corosync[1588]:   [TOTEM ] Token has not been received in 3225 ms
Oct 15 20:19:31 pve1 corosync[1588]:   [TOTEM ] A processor failed, forming new configuration: token timed out (4300ms), waiting 5160ms for consensus.
Oct 15 20:19:36 pve1 corosync[1588]:   [QUORUM] Sync members[3]: 1 2 4
Oct 15 20:19:36 pve1 corosync[1588]:   [QUORUM] Sync left[1]: 3
Oct 15 20:19:36 pve1 corosync[1588]:   [TOTEM ] A new membership (1.1a99f) was formed. Members left: 3
Oct 15 20:19:36 pve1 corosync[1588]:   [TOTEM ] Failed to receive the leave message. failed: 3
Oct 15 20:19:36 pve1 corosync[1588]:   [QUORUM] Members[3]: 1 2 4
Oct 15 20:19:36 pve1 corosync[1588]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 15 20:20:33 pve1 corosync[1588]:   [KNET  ] rx: Packet rejected from PVE4_PUBLIC_IP:5405
[...]

So your 3 in one site (pve{1,2,3} ~ IDs 1,2,4) lost the odd one out (ID 3 ~ pve4) and lost it in some not orderly way (no leave message).

And then it's rejecting it back and the node stays orhpaned.

Code:

Oct 15 20:19:28 pve4 corosync[863]:   [KNET  ] link: host: 1 link: 0 is down
Oct 15 20:19:28 pve4 corosync[863]:   [KNET  ] link: host: 2 link: 0 is down
Oct 15 20:19:28 pve4 corosync[863]:   [KNET  ] link: host: 4 link: 0 is down
Oct 15 20:19:28 pve4 corosync[863]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 15 20:19:28 pve4 corosync[863]:   [KNET  ] host: host: 1 has no active links
Oct 15 20:19:28 pve4 corosync[863]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 15 20:19:28 pve4 corosync[863]:   [KNET  ] host: host: 2 has no active links
Oct 15 20:19:28 pve4 corosync[863]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 15 20:19:28 pve4 corosync[863]:   [KNET  ] host: host: 4 has no active links
Oct 15 20:19:30 pve4 corosync[863]:   [TOTEM ] Token has not been received in 3225 ms
Oct 15 20:19:31 pve4 corosync[863]:   [TOTEM ] A processor failed, forming new configuration: token timed out (4300ms), waiting 5160ms for consensus.
Oct 15 20:19:36 pve4 corosync[863]:   [QUORUM] Sync members[1]: 3
Oct 15 20:19:36 pve4 corosync[863]:   [QUORUM] Sync left[3]: 1 2 4
Oct 15 20:19:36 pve4 corosync[863]:   [TOTEM ] A new membership (3.1a99f) was formed. Members left: 1 2 4
Oct 15 20:19:36 pve4 corosync[863]:   [TOTEM ] Failed to receive the leave message. failed: 1 2 4
Oct 15 20:19:36 pve4 corosync[863]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 15 20:19:36 pve4 corosync[863]:   [QUORUM] Members[1]: 3
Oct 15 20:19:36 pve4 corosync[863]:   [MAIN  ] Completed service synchronization, ready to provide service.

What's your /etc/corosync/corosync.conf like on these nodes (say pve1 AND pve4 would suffice to share)?

Adnan · Oct 15, 2024

esi_y said:
So your 3 in one site (pve{1,2,3} ~ IDs 1,2,4) lost the odd one out (ID 3 ~ pve4) and lost it in some not orderly way (no leave message).

Correct (it seem so).

esi_y said:
What's your /etc/corosync/corosync.conf like on these nodes (say pve1 AND pve4 would suffice to share)?

They're both the same

esi_y · Oct 15, 2024

esi_y said:

Code:

Oct 15 20:19:29 pve1 corosync[1588]:   [KNET  ] link: host: 3 link: 0 is down
Oct 15 20:20:33 pve1 corosync[1588]:   [KNET  ] rx: Packet rejected from PVE4_PUBLIC_IP:5405

Almost sure you lost connectivity on that PtP VPN and now the packet has found its way over from the wrong interface?

It's why it's rejecting the traffic, it's not expected from other than:

Code:

  node {
    name: pve4
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.133.0.4
  }

That's not corosync's problem, you know that, right?

Adnan · Oct 15, 2024

esi_y said:
Almost sure you lost connectivity on that PtP VPN and now the packet has found its way over from the wrong interface?

I understand it's possible, but I can't see how. Loosing the PtP link is possible as it's currently done via internet, but the routes on our gateway are so: 10.133.0.0/16 via 10.0.0.2 (on PVE1's side) and 10.190.0.0/16 via 10.0.0.1 (on PVE4's side). 10.0.0.0/30 on both sides are on the EoIP interface.

The PtP link is an EoIP tunnel, if it's down, it's unreachable but I don't see how the packets sent by PVE4 get the src-address of it's gateway public IP, while no route mention the public IP on both side

esi_y said:
That's not corosync's problem, you know that, right?

This, I get it. If I allow PVE4's public IP to connect to corosync, I will be able to track down how the connection is done. The other issue is that the tunnel doesn't show being down at some point. I have rebooted PVE4 and it's back online, I will do a constant ping or something else to check when the PtP is down.

I'll let you know, thank you for your assistance

esi_y · Oct 15, 2024

Adnan said:
I understand it's possible, but I can't see how. Loosing the PtP link is possible as it's currently done via internet, but the routes on our gateway are so: 10.133.0.0/16 via 10.0.0.2 (on PVE1's side) and 10.190.0.0/16 via 10.0.0.1 (on PVE4's side). 10.0.0.0/30 on both sides are on the EoIP interface.

It's certainly interesting case. I just want to say, I am not against digging this out, but from this point on, you know you are out of specs for the corosync connection anyhow, right? You can't use e.g. HA and your fourth node will be constantly coming and going, momentarily disrupting the quorum of the remaining 3 - this is AFTER you fix this "routing" issue even.

Adnan said:
The PtP link is an EoIP tunnel, if it's down, it's unreachable but I don't see how the packets sent by PVE4 get the src-address of it's gateway public IP, while no route mention the public IP on both side

Can you catch that packet anywhere midway with tshark or such?

Adnan said:
The other issue is that the tunnel doesn't show being down at some point. I have rebooted PVE4 and it's back online, I will do a constant ping or something else to check when the PtP is down.

The link does not have to be down as in interface, but the traffic is not getting through, that's all you know from corosync's log.

Adnan said:
I'll let you know, thank you for your assistance

No worries, this one is funny at least.

esi_y · Oct 15, 2024

esi_y said:
packet has found its way over from the wrong interface?

One other thing that crossed my mind, there's still the possibility (probably even more likely) that something is mangling that packet headers, that they did not actually come with that source address as such.

Adnan · Oct 15, 2024

esi_y said:
You can't use e.g. HA and your fourth node will be constantly coming and going, momentarily disrupting the quorum of the remaining 3 - this is AFTER you fix this "routing" issue even.

Of course, I know HA shouldn't be used in this case, and as said, it's more about having everything on 1 screen that I'm interested in.
I'm having big hopes that our PtP wireless link will give me a better experience with this setup. I've seen some posts about a central management for Proxmox, I'm also waiting for this.

esi_y said:
Can you catch that packet anywhere midway with tshark or such?

I'll do that in my gateway, it has a packet sniffer.

esi_y said:
One other thing that crossed my mind, there's still the possibility (probably even more likely) that something is mangling that packet headers, that they did not actually come with that source address as such.

This is also a possibility, my mangle rules only mangle packets in response of something coming in a specific interface (my ISP's) and not another virtual interface (like the EoIP). Anyway, I'll sniff the packet coming in PVE4's gateway, and coming to PVE1's gateway, I'll get something for sure !

Glad you're having fun

Cluster node disconnected randomly

Renowned Member

Active Member

Renowned Member

Renowned Member

Active Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Attachments

Renowned Member

Renowned Member

Renowned Member

Attachments

Renowned Member

Renowned Member

Attachments

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

We value your privacy