[SOLVED] Lost quorum restarts servers

dejan.teletrader · Feb 27, 2020

I just need to check if this is the bug or the feature.
In testing environment we have proxmox cluster with 2 nodes and one Linux machine that acts as Qdevice. As documentation would said: 2 +1 setup.
When we lost network on one of the nodes we had expected that other would be still functional, but no. Both of the nodes, one without network and other with network gone to restart. Not just some service restart but physical machine, OS restart.

We didn't separate networks so in our Cluster everything is going over the same network. So this same thing with restarting of servers happened once more when we have tried to migrate one VM with images on local disks.

So, my question is why servers go to physical restart if cluster has a network problem? And is there some setting that I can change to prevent this?

pvecm status
Cluster information
-------------------
Name: CLUSTERNAME
Config Version: 9
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Thu Feb 27 14:45:41 2020
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1.b3c
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate Qdevice

Membership information
----------------------
Nodeid Votes Qdevice Name
0x00000001 1 A,V,NMW 10.0.0.1 (local)
0x00000002 1 NR 10.0.0.2
0x00000000 1 Qdevice

Stoiko Ivanov · Feb 27, 2020

Do you have any resources under HA? (see https://pve.proxmox.com/pve-docs/chapter-ha-manager.html)
This would explain the host restarts (if a node loses quorum, and has had any ha resource defined on it since the last boot it gets fenced (check the docs from above).

More details could maybe be found in the logs before the reboot (however since both nodes got reset - I suppose that the last log lines not have been synced to disk - so the information is probably incomplete)

dejan.teletrader said:
We didn't separate networks so in our Cluster everything is going over the same network.

This is not recommended, especially with HA (for the reason of a network overload causing the nodes to fence themselves)

dejan.teletrader said:
And is there some setting that I can change to prevent this?

If you don't need HA - disable it
In any case consider separating the networks (especially put corosync on a (physical) network of its own)

I hope this helps!

dejan.teletrader · Feb 28, 2020

Stoiko Ivanov said:
Do you have any resources under HA? (see https://pve.proxmox.com/pve-docs/chapter-ha-manager.html)
This would explain the host restarts (if a node loses quorum, and has had any ha resource defined on it since the last boot it gets fenced (check the docs from above).

More details could maybe be found in the logs before the reboot (however since both nodes got reset - I suppose that the last log lines not have been synced to disk - so the information is probably incomplete)

This is not recommended, especially with HA (for the reason of a network overload causing the nodes to fence themselves)

If you don't need HA - disable it
In any case consider separating the networks (especially put corosync on a (physical) network of its own)

I hope this helps!

Thanks for your answer mate.
We didn't separate network because this is just testing environment. In production we will have a lot more nodes and networks available.

I think I understand fencing and HA but correct me if I am wrong.
This is my ha-manager status:
quorum OK
master node1 (active, Fri Feb 28 09:49:54 2020)
lrm node1 (active, Fri Feb 28 09:49:54 2020)
lrm node2 (active, Fri Feb 28 09:49:55 2020)
service vm:100 (node1, started)
service vm:101 (node2, started)
service vm:103 (node1, started)
service vm:104 (node1, started)
service vm:105 (node1, stopped)
service vm:106 (node2, started)
service vm:107 (node2, started)
service vm:108 (node1, stopped)
service vm:109 (node1, stopped)
service vm:110 (node1, started)

All our VMs are on one network share. Nothing on local disks. And from node active times you can see that they are restarted almost at the same time.
Yesterday, I have unplugged network from lets say node 2. Why would fencing or HA restart node1 completely when there isn't any problem with node 1?
I understand that I lost quorum (I am not sure why I lost it since I had Qdevice configured per documentation), but why loosing quorum get my perfectly healthy node1 in restart? We also had this behavior when I tried to migrate one vm with disk on local drives. And to repeat once more I understand why I loose quorum but I don't understand why nodes that are online and don't have any kind of problem goes in the complete restart?

In Proxmox documentation there is this example: For example, in a cluster with 15 nodes 7 could fail before the cluster becomes inquorate.

So, from this example, if 8 nodes fails cluster becomes inquorate, but what happens whit the rest of nodes that didn't fail? Do they goes to restart as ours just because they have HA enabled?

Stoiko Ivanov · Feb 28, 2020

I would guess that your network did not sustain the traffic burst that happened because Node2 was offline (retransmits of corosync packets) - hence communication between Node1 and the qdevice was interrupted as well -> Node1 also was not quorate anymore

check the logs/journal on the nodes and on the qdevice

dejan.teletrader said:
In Proxmox documentation there is this example: For example, in a cluster with 15 nodes 7 could fail before the cluster becomes inquorate.

with HA enabled those 7 nodes would fence themselves

dejan.teletrader said:
So, from this example, if 8 nodes fails cluster becomes inquorate, but what happens whit the rest of nodes that didn't fail? Do they goes to restart as ours just because they have HA enabled?

If the cluster is not quorate (less than half+1 of the nodes see each other) -> all nodes with HA resources would fence themselves

dejan.teletrader · Feb 28, 2020

Stoiko Ivanov said:
I would guess that your network did not sustain the traffic burst that happened because Node2 was offline (retransmits of corosync packets) - hence communication between Node1 and the qdevice was interrupted as well -> Node1 also was not quorate anymore

check the logs/journal on the nodes and on the qdevice

with HA enabled those 7 nodes would fence themselves

If the cluster is not quorate (less than half+1 of the nodes see each other) -> all nodes with HA resources would fence themselves

Thanks again. Just one more question and then I will stop asking.
I just need clarification about fencing. Logs from both servers are in the attachment.
Documentation quote: "On node failures, fencing ensures that the erroneous node is guaranteed to be offline. "
So fencing occurs on erroneous nodes, but I am interested in nodes that aren't erroneous. Why healthy nodes that don't have any error go to restart and disconnect from the network?

Thanks for your help.

Stoiko Ivanov · Feb 28, 2020

dejan.teletrader said:
So fencing occurs on erroneous nodes, but I am interested in nodes that aren't erroneous. Why healthy nodes that don't have any error go to restart and disconnect from the network?

Because that's how quorum based systems work - if more than half of the nodes are gone - then the remaining ones cannot know whether they are indeed the only ones who survived (it could as well be the the other half+1 are alive and see each other) -> if a node is in a partition with less than half(+1) of the clusternodes - it is not quorate
in a HA setting a node which is not quorate needs to fence itself to prevent a split-brain situation (see https://en.wikipedia.org/wiki/Byzantine_fault)

from the logs - it seems that node1 could not connect to the qnetd on the decision node -> it was not quorate and fenced itself

where is the 3rd node hosted? - anything in its logs (especially from corosync qnetd)?

dejan.teletrader · Mar 2, 2020

The 3rd node has corosync-qnetd installed on it and acts as Qdevice only. I don't have anything in its logs. So I don't know what went wrong there.
We will soon add 3rd full proxmox server to the testing environment so next time I will test it properly.

Thanks a lot for your help. Much more is clear now. Sorry if I was asking something that I should understand from the documentation.

Stoiko Ivanov · Mar 2, 2020

No need Sorry if I to apologize! The forum is here to ask questions

Please mark the thread as 'SOLVED' - it could help others with similar questions

Thanks!

Search

Search

[SOLVED] Lost quorum restarts servers

dejan.teletrader

Member

Stoiko Ivanov

Proxmox Staff Member

dejan.teletrader

Member

Stoiko Ivanov

Proxmox Staff Member

dejan.teletrader

Member

Attachments

Stoiko Ivanov

Proxmox Staff Member

dejan.teletrader

Member

Stoiko Ivanov

Proxmox Staff Member