[SOLVED] Lost quorum restarts servers

Jan 23, 2020
11
1
23
49
I just need to check if this is the bug or the feature.
In testing environment we have proxmox cluster with 2 nodes and one Linux machine that acts as Qdevice. As documentation would said: 2 +1 setup.
When we lost network on one of the nodes we had expected that other would be still functional, but no. Both of the nodes, one without network and other with network gone to restart. Not just some service restart but physical machine, OS restart.

We didn't separate networks so in our Cluster everything is going over the same network. So this same thing with restarting of servers happened once more when we have tried to migrate one VM with images on local disks.

So, my question is why servers go to physical restart if cluster has a network problem? And is there some setting that I can change to prevent this?


pvecm status
Cluster information
-------------------
Name: CLUSTERNAME
Config Version: 9
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Feb 27 14:45:41 2020
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1.b3c
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate Qdevice

Membership information
----------------------
Nodeid Votes Qdevice Name
0x00000001 1 A,V,NMW 10.0.0.1 (local)
0x00000002 1 NR 10.0.0.2
0x00000000 1 Qdevice
 
Do you have any resources under HA? (see https://pve.proxmox.com/pve-docs/chapter-ha-manager.html)
This would explain the host restarts (if a node loses quorum, and has had any ha resource defined on it since the last boot it gets fenced (check the docs from above).

More details could maybe be found in the logs before the reboot (however since both nodes got reset - I suppose that the last log lines not have been synced to disk - so the information is probably incomplete)

We didn't separate networks so in our Cluster everything is going over the same network.
This is not recommended, especially with HA (for the reason of a network overload causing the nodes to fence themselves)

And is there some setting that I can change to prevent this?
If you don't need HA - disable it
In any case consider separating the networks (especially put corosync on a (physical) network of its own)

I hope this helps!
 
Do you have any resources under HA? (see https://pve.proxmox.com/pve-docs/chapter-ha-manager.html)
This would explain the host restarts (if a node loses quorum, and has had any ha resource defined on it since the last boot it gets fenced (check the docs from above).

More details could maybe be found in the logs before the reboot (however since both nodes got reset - I suppose that the last log lines not have been synced to disk - so the information is probably incomplete)


This is not recommended, especially with HA (for the reason of a network overload causing the nodes to fence themselves)


If you don't need HA - disable it
In any case consider separating the networks (especially put corosync on a (physical) network of its own)

I hope this helps!

Thanks for your answer mate.
We didn't separate network because this is just testing environment. In production we will have a lot more nodes and networks available.

I think I understand fencing and HA but correct me if I am wrong.
This is my ha-manager status:
quorum OK
master node1 (active, Fri Feb 28 09:49:54 2020)
lrm node1 (active, Fri Feb 28 09:49:54 2020)
lrm node2 (active, Fri Feb 28 09:49:55 2020)
service vm:100 (node1, started)
service vm:101 (node2, started)
service vm:103 (node1, started)
service vm:104 (node1, started)
service vm:105 (node1, stopped)
service vm:106 (node2, started)
service vm:107 (node2, started)
service vm:108 (node1, stopped)
service vm:109 (node1, stopped)
service vm:110 (node1, started)

All our VMs are on one network share. Nothing on local disks. And from node active times you can see that they are restarted almost at the same time.
Yesterday, I have unplugged network from lets say node 2. Why would fencing or HA restart node1 completely when there isn't any problem with node 1?
I understand that I lost quorum (I am not sure why I lost it since I had Qdevice configured per documentation), but why loosing quorum get my perfectly healthy node1 in restart? We also had this behavior when I tried to migrate one vm with disk on local drives. And to repeat once more I understand why I loose quorum but I don't understand why nodes that are online and don't have any kind of problem goes in the complete restart?

In Proxmox documentation there is this example: For example, in a cluster with 15 nodes 7 could fail before the cluster becomes inquorate.

So, from this example, if 8 nodes fails cluster becomes inquorate, but what happens whit the rest of nodes that didn't fail? Do they goes to restart as ours just because they have HA enabled?
 
I would guess that your network did not sustain the traffic burst that happened because Node2 was offline (retransmits of corosync packets) - hence communication between Node1 and the qdevice was interrupted as well -> Node1 also was not quorate anymore

check the logs/journal on the nodes and on the qdevice

In Proxmox documentation there is this example: For example, in a cluster with 15 nodes 7 could fail before the cluster becomes inquorate.
with HA enabled those 7 nodes would fence themselves

So, from this example, if 8 nodes fails cluster becomes inquorate, but what happens whit the rest of nodes that didn't fail? Do they goes to restart as ours just because they have HA enabled?
If the cluster is not quorate (less than half+1 of the nodes see each other) -> all nodes with HA resources would fence themselves
 
I would guess that your network did not sustain the traffic burst that happened because Node2 was offline (retransmits of corosync packets) - hence communication between Node1 and the qdevice was interrupted as well -> Node1 also was not quorate anymore

check the logs/journal on the nodes and on the qdevice


with HA enabled those 7 nodes would fence themselves


If the cluster is not quorate (less than half+1 of the nodes see each other) -> all nodes with HA resources would fence themselves

Thanks again. Just one more question and then I will stop asking.
I just need clarification about fencing. Logs from both servers are in the attachment.
Documentation quote: "On node failures, fencing ensures that the erroneous node is guaranteed to be offline. "
So fencing occurs on erroneous nodes, but I am interested in nodes that aren't erroneous. Why healthy nodes that don't have any error go to restart and disconnect from the network?

Thanks for your help.
 

Attachments

So fencing occurs on erroneous nodes, but I am interested in nodes that aren't erroneous. Why healthy nodes that don't have any error go to restart and disconnect from the network?

Because that's how quorum based systems work - if more than half of the nodes are gone - then the remaining ones cannot know whether they are indeed the only ones who survived (it could as well be the the other half+1 are alive and see each other) -> if a node is in a partition with less than half(+1) of the clusternodes - it is not quorate
in a HA setting a node which is not quorate needs to fence itself to prevent a split-brain situation (see https://en.wikipedia.org/wiki/Byzantine_fault)

from the logs - it seems that node1 could not connect to the qnetd on the decision node -> it was not quorate and fenced itself

where is the 3rd node hosted? - anything in its logs (especially from corosync qnetd)?
 
The 3rd node has corosync-qnetd installed on it and acts as Qdevice only. I don't have anything in its logs. So I don't know what went wrong there.
We will soon add 3rd full proxmox server to the testing environment so next time I will test it properly.

Thanks a lot for your help. Much more is clear now. Sorry if I was asking something that I should understand from the documentation.
 
Last edited:
  • Like
Reactions: Stoiko Ivanov
No need Sorry if I to apologize! The forum is here to ask questions ;)

Please mark the thread as 'SOLVED' - it could help others with similar questions


Thanks!
 
  • Like
Reactions: dejan.teletrader

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!