All nodes reboot when one node reboots

user8923428934 · Mar 15, 2022

Hi,

We currently have two dell poweredge servers running pve 7.1.1 which are in a cluster. We don:t have anything in HA or shared storage. one of the node runs almost all of our production servers. the second one is almost empty. after reciving an error when trying to delete user (one of the nodes is too old error) i updated both nodes and then wanted to reboot the second (currently non production) node via the gui. however suddenly the first node (production) rebooted to without any warning and all of our production servers got offline.
Does anyone now what could be the issue here? I have had proxmxo clusters setup this way and was always able to reboot one of them without issues (kind of also the reason why we have a cluster)

thanks,
robl

sterzy · Mar 15, 2022

Hi,
Only going of the information provided, I'm assuming your issue is related to not being quorate. If you have a two node cluster and one node is inactive, you cannot have a quorum of 50%+1 nodes. Hence, it is recommended to either have at least three nodes or set up a QDevice [1].

[1]: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_external_vote_support

user8923428934 · Mar 22, 2022

Hi, thanks for the answer. Yes i know that to run HA aso. you should use 3 nodes. but is it then normal in a two node cluster that when rebooting one node, that the otherone reboots aswell. because we already had a 2 node cluster before where I could reboot the nodes independently from each other.

thanks

sterzy · Mar 30, 2022

I am not sure what your actual setup looks like so I cannot give you any definitive answers. However, if HA is configured at all, this behavior might be caused by "fencing" [1]. In general, you should really set up a QDevice, is there a reason why you cannot do so?

[1]: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing

Pyromancer · Oct 18, 2023

I'm seeing the same thing, rebooting one node causes all nodes to reboot.

We have a Proxmox cluster of three nodes plus a Qdevice. Originally it was a two node cluster, the two nodes being large and powerful servers with 120 cores and 300gig+ of memory each. In order to be able to use HA we added a Qdevice, which is a VM running on a VMware hypervisor on the same network. Later we added a third experimental node, mainly for dev systems, which is a blade server with 24 cores and 96 gig of memory.

Q1: We didn't remove the Qdevice when we added the third node, should we have done?

Just now we are upgrading the cluster, first to latest v7.4, and then from there we plan to go to v8. A couple of weeks ago we upgraded the third node first, and that worked faultlessly including rebooting it to complete the process. No problem. Then we upgraded the first of the two large machines, however when we rebooted it, both other nodes also rebooted, on their own, completely unbidden. Unfortunately the first machine (the one we had deliberately rebooted) lost it's boot disk at that point (SSD failed with no warning) so it never came back up, both of the other two did.

I then rebuilt the failed machine with new boot disks (a raid pair this time), brought it back on line, and re-added it to the cluster.

So, at the moment we have:
Main Node 1, IP .25, : v7.4-3, installed from memory stick.
Main Node 2, IP .26 : v7.0-11, as upgraded from v6 ages ago.
Dev Node, IP .13: v7.4-16, as upgraded via apt dist-upgrade from previous 7.3.
Qdevice: Unchanged since first installed when everything was running v6.

pvecm status gives this output:

Code:

Cluster information
-------------------
Name:             core-1
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Oct 18 21:16:41 2023
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000002
Ring ID:          1.2bc5
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1         NR 10.10.4.25
0x00000002          1    A,V,NMW 10.10.4.26 (local)
0x00000003          1         NR 10.10.4.13
0x00000000          1            Qdevice

Before I attempt to upgrade Node-2 (the one flagged as (local) here), should I removed the Qdevice, and before doing that, should I be concerned that two of the nodes are showing NR indicating the Qdevice is unregistered?

We currently have all our production VMs mostly migrated to Node-1 with a couple on Node-3, so if I can upgrade and reboot Node-2 without interfering with Nodes 1 and 3, that would be ideal.

UdoB · Oct 21, 2023

As you have three nodes a QDev basically won't help you and is useless.

And then https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_external_vote_support gives some hints regarding additionally possible problems.

So I would remove the QDev..., but (fortunately) I have no experience with actual problems like yours.

Best regards

Pyromancer · Oct 26, 2023

UdoB said:
As you have three nodes a QDev basically won't help you and is useless.

And then https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_external_vote_support gives some hints regarding additionally possible problems.

So I would remove the QDev..., but (fortunately) I have no experience with actual problems like yours.

Best regards

Fair enough, appreciate the reply and having reviewed your link I see what you mean, effectively a potential single point of failure.

Haven't made any changes yet, not sure if removing the Qdevice is liable to cause a cluster reboot or not so waiting till I'm next on the actual premises so can deal with any hardware issues, just in case.

Search

Search

All nodes reboot when one node reboots

user8923428934

Member

sterzy

Proxmox Staff Member

user8923428934

Member

sterzy

Proxmox Staff Member

Pyromancer

Member

UdoB

Distinguished Member

Pyromancer

Member

We value your privacy