All nodes reboot when one node reboots

Feb 24, 2022
28
0
6
34
Hi,

We currently have two dell poweredge servers running pve 7.1.1 which are in a cluster. We don:t have anything in HA or shared storage. one of the node runs almost all of our production servers. the second one is almost empty. after reciving an error when trying to delete user (one of the nodes is too old error) i updated both nodes and then wanted to reboot the second (currently non production) node via the gui. however suddenly the first node (production) rebooted to without any warning and all of our production servers got offline.
Does anyone now what could be the issue here? I have had proxmxo clusters setup this way and was always able to reboot one of them without issues (kind of also the reason why we have a cluster)

thanks,
robl
 
Hi, thanks for the answer. Yes i know that to run HA aso. you should use 3 nodes. but is it then normal in a two node cluster that when rebooting one node, that the otherone reboots aswell. because we already had a 2 node cluster before where I could reboot the nodes independently from each other.

thanks
 
I'm seeing the same thing, rebooting one node causes all nodes to reboot.

We have a Proxmox cluster of three nodes plus a Qdevice. Originally it was a two node cluster, the two nodes being large and powerful servers with 120 cores and 300gig+ of memory each. In order to be able to use HA we added a Qdevice, which is a VM running on a VMware hypervisor on the same network. Later we added a third experimental node, mainly for dev systems, which is a blade server with 24 cores and 96 gig of memory.

Q1: We didn't remove the Qdevice when we added the third node, should we have done?

Just now we are upgrading the cluster, first to latest v7.4, and then from there we plan to go to v8. A couple of weeks ago we upgraded the third node first, and that worked faultlessly including rebooting it to complete the process. No problem. Then we upgraded the first of the two large machines, however when we rebooted it, both other nodes also rebooted, on their own, completely unbidden. Unfortunately the first machine (the one we had deliberately rebooted) lost it's boot disk at that point (SSD failed with no warning) so it never came back up, both of the other two did.

I then rebuilt the failed machine with new boot disks (a raid pair this time), brought it back on line, and re-added it to the cluster.

So, at the moment we have:
Main Node 1, IP .25, : v7.4-3, installed from memory stick.
Main Node 2, IP .26 : v7.0-11, as upgraded from v6 ages ago.
Dev Node, IP .13: v7.4-16, as upgraded via apt dist-upgrade from previous 7.3.
Qdevice: Unchanged since first installed when everything was running v6.

pvecm status gives this output:

Code:
Cluster information
-------------------
Name:             core-1
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Oct 18 21:16:41 2023
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000002
Ring ID:          1.2bc5
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1         NR 10.10.4.25
0x00000002          1    A,V,NMW 10.10.4.26 (local)
0x00000003          1         NR 10.10.4.13
0x00000000          1            Qdevice

Before I attempt to upgrade Node-2 (the one flagged as (local) here), should I removed the Qdevice, and before doing that, should I be concerned that two of the nodes are showing NR indicating the Qdevice is unregistered?

We currently have all our production VMs mostly migrated to Node-1 with a couple on Node-3, so if I can upgrade and reboot Node-2 without interfering with Nodes 1 and 3, that would be ideal.
 
As you have three nodes a QDev basically won't help you and is useless.

And then https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_external_vote_support gives some hints regarding additionally possible problems.

So I would remove the QDev..., but (fortunately) I have no experience with actual problems like yours.

Best regards

Fair enough, appreciate the reply and having reviewed your link I see what you mean, effectively a potential single point of failure.

Haven't made any changes yet, not sure if removing the Qdevice is liable to cause a cluster reboot or not so waiting till I'm next on the actual premises so can deal with any hardware issues, just in case.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!