Cluster node disconnected randomly

Adnan

Renowned Member
Oct 4, 2012
31
1
73
Paris, France
Hi, I have 4 servers across 2 physical sites, named PVE1, PVE2, PVE3 (on site 1) and PVE4 (on site 2).

Sites 1 and 2 are physically across the street, currently connected together via VPN over WAN and we are installing a radio PtP link between the sites. Anyway, the issue is something else.

Sometimes, PVE4 appears disconnected in the cluster, whether I’m connected to PVE1 to PVE3, BUT PVE4 is still online, can be pinged, can be accessed via SSH or web, and sees the other 3 as disconnected.

PVE4 only has test VMs so usually we simply reboot it and it’s back in the cluster, and rarely we need to reboot all of them to “reconnect” them all together. I know that it’s not a recommended or advised way to connect nodes in a cluster… now I’m looking for ideas or solutions.

Is it possible to reconnect the nodes together without rebooting them? Maybe a corosync recheck, or something equivalent please ? (Something we can program on a zabbix to trigger automatically for instance).

Thanks
 
I do things like this. For fun. Not in Prod.
You clearly know the following bit already.

https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network

The Proxmox VE cluster stack requires a reliable network with latencies under 5 milliseconds (LAN performance) between all nodes to operate stably. While on setups with a small node count a network with higher latencies may work, this is not guaranteed and gets rather unlikely with more than three nodes and latencies above around 10 ms.

I'd try
systemctl stop pve-cluster
systemctl stop corosync
And start em again.

(I don't mean to sound snide. I wish this did work. I would love to use storage replication to do DR at a remote site. But it doesn't work. And they are pretty darn clear that it won't.)
 
Last edited:
Sites 1 and 2 are physically across the street, currently connected together via VPN over WAN and we are installing a radio PtP link between the sites. Anyway, the issue is something else.

Actually, this is your issue. :)

Sometimes, PVE4 appears disconnected in the cluster, whether I’m connected to PVE1 to PVE3, BUT PVE4 is still online, can be pinged, can be accessed via SSH or web, and sees the other 3 as disconnected.

I would be very skeptical with GUI, i.e. it might be lagging, what actually might be happening is you are losing quorum on and off in a quick succession, something the GUI does not reflect. I would definitely want to check first how often this happens with journalctl -u corosync on the odd one of those on "the other side".

PVE4 only has test VMs so usually we simply reboot it and it’s back in the cluster, and rarely we need to reboot all of them to “reconnect” them all together.

It's a bigger issue than you think, every time a node leaves or re-appears it basically disrupts the entire cluster, the remaining nodes have to form new "membership" to keep exchanging messages (this is the toll of quorum as opposed to some master-slaves system), while they can't exchange messages, they can't update files in /etc/pve (it appears as readonly), so the cluster is not really a capable of doing anything. This all despite 3 nodes are enough for a quorum.

I know that it’s not a recommended or advised way to connect nodes in a cluster… now I’m looking for ideas or solutions.

Well, as @tcabernoch pitched (I don't even want to repost it;) .. I mean for this particular case) restarting the said services is really what matters (and it's basically just that during a reboot that gets you back up), but as you discovered you sometimes need to do it with them all. Yes in that case you have to do it on them all, but together, so off on all nodes first, then only on, one by one, basically you are coaxing it to catch up with one another. Start with the 3 in the same place, then add the 4th.

Is it possible to reconnect the nodes together without rebooting them? Maybe a corosync recheck, or something equivalent please ? (Something we can program on a zabbix to trigger automatically for instance).

Yeah so ... you basically asked for something I have yet to add into my "tutorial" [1]. I did not get to it yet because in the process I went through some interesting situations and do not want to be taken apart for posting it (just yet). :) You basically have to have /var/lib/pve-cluster/config.db in consistent state across all nodes and need to launch them from such state. For which you need a sort of "control" node (which I am lucky to have in that Ansible scenario), but this goes against Proxmox ideology of no-masters, so ...


The official support reply for this scenario would always be to take that no 4 one out of the cluster and find another way.

Post the corosync log if you want to have an idea what you are setting yourself up for.

[1] https://forum.proxmox.com/threads/dhcp-cluster-deployment.154780/#post-706594
 
Last edited:
Sorry for the delay guys, I was waiting for the cluster to fail again. Which happened today.

I'd try
systemctl stop pve-cluster
systemctl stop corosync
And start em again.
I did that on PV1 and PVE4 at the same time (stop 1, stop 4, start 1, start 4) and promox web GUI was empty (datacenter with no node) for 2 or 3 minutes. But PVE4 still disconnected.

Actually, this is your issue. :)
I meant the issue is not that we are installing a PtP wireless link, etc… :D

I would be very skeptical with GUI, i.e. it might be lagging, what actually might be happening is you are losing quorum on and off in a quick succession, something the GUI does not reflect. I would definitely want to check first how often this happens with journalctl -u corosync on the odd one of those on "the other side".
Here what I have

From PVE1:
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.190.0.8
0x00000002 1 10.190.0.6
0x00000004 1 10.190.0.4 (local)

From PVE4:
Membership information
----------------------
Nodeid Votes Name
0x00000003 1 10.133.0.4 (local)

corosync.log PVE1: https://pastebin.com/pBNK4fzb

corosync.log PVE4: https://pastebin.com/rR3Q8F4z

I have cut the repeated parts, replaced them by [...]

so off on all nodes first, then only on, one by one, basically you are coaxing it to catch up with one another. Start with the 3 in the same place, then add the 4th.
I didn't do that at this order. I did the single one off/on because sometimes it's enough. Then PVE1, when it went back online, PVE4 was still disconnected. Then PVE2, PVE2 stayed offline alongside PVE4. Then PVE3, PVE2 went back online with PVE4. Then PVE3 went back online.

You basically have to have /var/lib/pve-cluster/config.db in consistent state across all nodes and need to launch them from such state. For which you need a sort of "control" node (which I am lucky to have in that Ansible scenario), but this goes against Proxmox ideology of no-masters, so ...
My "master" is PVE1, so I could get its config.db on all the other nodes and how do I launch them? I didn't get this.

The official support reply for this scenario would always be to take that no 4 one out of the cluster and find another way.
Fortunately, I'm not using any HA functionality. I'm only managing all of the PVEs from 1 interface. That's it. Sometimes I clone or migrate but not failover.

Thanks
 
Before proceeding any further, can you set your MTU to <=1397, e.g. 1392? :)
Set the MTU of which device? PVEs, 1, 4, all? Switch connected to them or the router to internet ?

All the PVEs are connected to a switch (not sure if it's manageable) then to a mikrotik router on which the ISP router is connected. Our mikrotik handles an EoIP tunnel between the 2 networks. if you want me to simulate a fragmentation between the PVEs, the easiest is to lower the MTU on our Mikrotik router on either side.

PS: The logs show the public IPs when packets are rejected, but their "link" goes through a VPN with private addresses (10.190.0.4 -> 10.0.0.1 -> 10.0.0.2 -> 10.133.0.4), I don't get how the public IPs are used
 
Set the MTU of which device? PVEs, 1, 4, all? Switch connected to them or the router to internet ?

In /etc/network/interfaces for everything that corosync uses on all nodes.

All the PVEs are connected to a switch (not sure if it's manageable) then to a mikrotik router on which the ISP router is connected. Our mikrotik handles an EoIP tunnel between the 2 networks. if you want me to simulate a fragmentation between the PVEs, the easiest is to lower the MTU on our Mikrotik router on either side.

No, you already have fragmentation going on, no need to simulate. :)

PS: The logs show the public IPs when packets are rejected, but their "link" goes through a VPN with private addresses (10.190.0.4 -> 10.0.0.1 -> 10.0.0.2 -> 10.133.0.4), I don't get how the public IPs are used

This is why I originally reacted that:

Actually, this is your issue. :)

But let's do away with the MTU issue first, then will have a cleaner picture in the logs what's next.
 
What happened there? :) I am a bit lost here now. I thought you get your MTU fixed and wait for first time it has a hiccup, then have a look at the logs.
Sorry I didn’t know if you wanted to see what happens immediately. I’m still waiting for a hiccup, but as the logs show « strange » things to me, and as it takes 9 days to hiccup, better send something now.

Anyway, I’ll wait for the next issue to post the logs again
 
New hiccup, I did : journalctl -u corosync --since "05:00:00" > corosync_pveX.log, nothing much is shown until the loss of link between them.
Excellent. :D

Code:
Oct 15 20:19:29 pve1 corosync[1588]:   [KNET  ] link: host: 3 link: 0 is down
Oct 15 20:19:29 pve1 corosync[1588]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Oct 15 20:19:29 pve1 corosync[1588]:   [KNET  ] host: host: 3 has no active links
Oct 15 20:19:30 pve1 corosync[1588]:   [TOTEM ] Token has not been received in 3225 ms
Oct 15 20:19:31 pve1 corosync[1588]:   [TOTEM ] A processor failed, forming new configuration: token timed out (4300ms), waiting 5160ms for consensus.
Oct 15 20:19:36 pve1 corosync[1588]:   [QUORUM] Sync members[3]: 1 2 4
Oct 15 20:19:36 pve1 corosync[1588]:   [QUORUM] Sync left[1]: 3
Oct 15 20:19:36 pve1 corosync[1588]:   [TOTEM ] A new membership (1.1a99f) was formed. Members left: 3
Oct 15 20:19:36 pve1 corosync[1588]:   [TOTEM ] Failed to receive the leave message. failed: 3
Oct 15 20:19:36 pve1 corosync[1588]:   [QUORUM] Members[3]: 1 2 4
Oct 15 20:19:36 pve1 corosync[1588]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 15 20:20:33 pve1 corosync[1588]:   [KNET  ] rx: Packet rejected from PVE4_PUBLIC_IP:5405
[...]

So your 3 in one site (pve{1,2,3} ~ IDs 1,2,4) lost the odd one out (ID 3 ~ pve4) and lost it in some not orderly way (no leave message).

And then it's rejecting it back and the node stays orhpaned.

Code:
Oct 15 20:19:28 pve4 corosync[863]:   [KNET  ] link: host: 1 link: 0 is down
Oct 15 20:19:28 pve4 corosync[863]:   [KNET  ] link: host: 2 link: 0 is down
Oct 15 20:19:28 pve4 corosync[863]:   [KNET  ] link: host: 4 link: 0 is down
Oct 15 20:19:28 pve4 corosync[863]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Oct 15 20:19:28 pve4 corosync[863]:   [KNET  ] host: host: 1 has no active links
Oct 15 20:19:28 pve4 corosync[863]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Oct 15 20:19:28 pve4 corosync[863]:   [KNET  ] host: host: 2 has no active links
Oct 15 20:19:28 pve4 corosync[863]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Oct 15 20:19:28 pve4 corosync[863]:   [KNET  ] host: host: 4 has no active links
Oct 15 20:19:30 pve4 corosync[863]:   [TOTEM ] Token has not been received in 3225 ms
Oct 15 20:19:31 pve4 corosync[863]:   [TOTEM ] A processor failed, forming new configuration: token timed out (4300ms), waiting 5160ms for consensus.
Oct 15 20:19:36 pve4 corosync[863]:   [QUORUM] Sync members[1]: 3
Oct 15 20:19:36 pve4 corosync[863]:   [QUORUM] Sync left[3]: 1 2 4
Oct 15 20:19:36 pve4 corosync[863]:   [TOTEM ] A new membership (3.1a99f) was formed. Members left: 1 2 4
Oct 15 20:19:36 pve4 corosync[863]:   [TOTEM ] Failed to receive the leave message. failed: 1 2 4
Oct 15 20:19:36 pve4 corosync[863]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 15 20:19:36 pve4 corosync[863]:   [QUORUM] Members[1]: 3
Oct 15 20:19:36 pve4 corosync[863]:   [MAIN  ] Completed service synchronization, ready to provide service.

What's your /etc/corosync/corosync.conf like on these nodes (say pve1 AND pve4 would suffice to share)?
 
So your 3 in one site (pve{1,2,3} ~ IDs 1,2,4) lost the odd one out (ID 3 ~ pve4) and lost it in some not orderly way (no leave message).
Correct (it seem so).
What's your /etc/corosync/corosync.conf like on these nodes (say pve1 AND pve4 would suffice to share)?
They're both the same
 

Attachments

Code:
Oct 15 20:19:29 pve1 corosync[1588]:   [KNET  ] link: host: 3 link: 0 is down
Oct 15 20:20:33 pve1 corosync[1588]:   [KNET  ] rx: Packet rejected from PVE4_PUBLIC_IP:5405

Almost sure you lost connectivity on that PtP VPN and now the packet has found its way over from the wrong interface?

It's why it's rejecting the traffic, it's not expected from other than:

Code:
  node {
    name: pve4
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.133.0.4
  }

That's not corosync's problem, you know that, right? :)
 
Almost sure you lost connectivity on that PtP VPN and now the packet has found its way over from the wrong interface?
I understand it's possible, but I can't see how. Loosing the PtP link is possible as it's currently done via internet, but the routes on our gateway are so: 10.133.0.0/16 via 10.0.0.2 (on PVE1's side) and 10.190.0.0/16 via 10.0.0.1 (on PVE4's side). 10.0.0.0/30 on both sides are on the EoIP interface.

The PtP link is an EoIP tunnel, if it's down, it's unreachable but I don't see how the packets sent by PVE4 get the src-address of it's gateway public IP, while no route mention the public IP on both side o_O

That's not corosync's problem, you know that, right? :)
This, I get it. If I allow PVE4's public IP to connect to corosync, I will be able to track down how the connection is done. The other issue is that the tunnel doesn't show being down at some point. I have rebooted PVE4 and it's back online, I will do a constant ping or something else to check when the PtP is down.

I'll let you know, thank you for your assistance
 
I understand it's possible, but I can't see how. Loosing the PtP link is possible as it's currently done via internet, but the routes on our gateway are so: 10.133.0.0/16 via 10.0.0.2 (on PVE1's side) and 10.190.0.0/16 via 10.0.0.1 (on PVE4's side). 10.0.0.0/30 on both sides are on the EoIP interface.

It's certainly interesting case. I just want to say, I am not against digging this out, but from this point on, you know you are out of specs for the corosync connection anyhow, right? You can't use e.g. HA and your fourth node will be constantly coming and going, momentarily disrupting the quorum of the remaining 3 - this is AFTER you fix this "routing" issue even.

The PtP link is an EoIP tunnel, if it's down, it's unreachable but I don't see how the packets sent by PVE4 get the src-address of it's gateway public IP, while no route mention the public IP on both side o_O

Can you catch that packet anywhere midway with tshark or such?

The other issue is that the tunnel doesn't show being down at some point. I have rebooted PVE4 and it's back online, I will do a constant ping or something else to check when the PtP is down.

The link does not have to be down as in interface, but the traffic is not getting through, that's all you know from corosync's log.

I'll let you know, thank you for your assistance

No worries, this one is funny at least. ;)
 
packet has found its way over from the wrong interface?

One other thing that crossed my mind, there's still the possibility (probably even more likely) that something is mangling that packet headers, that they did not actually come with that source address as such.
 
You can't use e.g. HA and your fourth node will be constantly coming and going, momentarily disrupting the quorum of the remaining 3 - this is AFTER you fix this "routing" issue even.
Of course, I know HA shouldn't be used in this case, and as said, it's more about having everything on 1 screen that I'm interested in.
I'm having big hopes that our PtP wireless link will give me a better experience with this setup. I've seen some posts about a central management for Proxmox, I'm also waiting for this.

Can you catch that packet anywhere midway with tshark or such?
I'll do that in my gateway, it has a packet sniffer.

One other thing that crossed my mind, there's still the possibility (probably even more likely) that something is mangling that packet headers, that they did not actually come with that source address as such.
This is also a possibility, my mangle rules only mangle packets in response of something coming in a specific interface (my ISP's) and not another virtual interface (like the EoIP). Anyway, I'll sniff the packet coming in PVE4's gateway, and coming to PVE1's gateway, I'll get something for sure !

Glad you're having fun :D
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!