Rebooting one node causes other node to reboot

SteveITS

Member
Feb 6, 2025
76
14
8
This is in a lab/test/POC setup. Fortunately.

I discovered that if I issue a "reboot" command on node 2, node 1 also reboots. I can duplicate this. I only have two nodes in this cluster so I don't know if this is "all nodes" or "just one other node."

That seems rather dangerous/poor behavior. What is the reason and how can that be avoided?

root@hn2p:~# reboot

root@hn1p:~# uptime
11:02:36 up 1 min,
 
  • Like
Reactions: waltar and SteveITS
Hm, yes.

This would imply that if something tragic happened to enough nodes the rest of the cluster would perpetually reboot? And they might all be "permanently" offline because they are all constantly rebooting and (potentially) never see enough nodes?

Or, say, the storage network switch drops out?

In production we'll have 5 nodes.
 
I discovered that if I issue a "reboot" command on node 2, node 1 also reboots. I can duplicate this. I only have two nodes in this cluster so I don't know if this is "all nodes" or "just one other node."
Any cluster requires a "majority" to operate properly. A majority in a two node cluster is _two_. If you loose a node or loose network connectivity between the nodes, neither node could possibly know that it is the only surviving member. As such, the proper step is to reboot itself thereby releasing any clustered services that may be running.
If it does not do so, there is a risk of both nodes trying to run the same service and causing data corruption. This is called "split-brain".

It is stated in many places (documentation, forum, etc) that you must run a cluster with _odd_ number of nodes. In a 3 node cluster a reboot of one node leaves two running - majority.
A 4 node cluster can survive a reboot of one node, but will not survive network loss and 2x2 split.

Some cluster implementations (HyperV) can use a Quorum disk. For PVE you have an option to run a low-budget QDevice.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: fba
This would imply that if something tragic happened to enough nodes
There is always a cliff where if something tragic were to happen to enough nodes, the cluster will not survive. Generally, a double-fault is enough.
Your business requirements and budget will dictate where that cliff is. The more failures you want to survive - the more expensive it gets.
odes the rest of the cluster would perpetually reboot?
Nodes should not be stuck in a perpetual reboot cycle. After a reboot, they should come up and remain in a degraded state until a majority quorum can be established.

However, if your infrastructure or network is unstable (flapping), nodes may repeatedly lose and regain connectivity, triggering additional reboots.

For extreme situations, there are manual methods to force quorum and recover stability.



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
I just didn't realize a reboot was involved. As opposed to, say, disallowing writes to the ceph storage.
A reboot is the most reliable way to ensure there is no conflict.

Note that Ceph is not the storage that PVE cluster is primarily concerned about. This is the files system that will be brought up RO after the reboot, if cluster is able to be formed: https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs)

Cheers


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
I've realized this creates a "problem" for us. Our general plan was to convert to Proxmox while migrating off Virtuozzo, and repurpose the existing servers. The thought was to add a (second) backup server, add one new physical server, and then gradually install Proxmox on the other five. Virtuozzo can get down to two nodes (2 of 3 MDS).

However if they all get votes in Proxmox we'd end up with six nodes not five.

One option then is a QDevice. Is it recommended/not recommended to run that on a PBS server?
 
One option then is a QDevice. Is it recommended/not recommended to run that on a PBS server?

As long as you realize that the PBS will become essential/critical part of your PVE cluster, and not just a backup server that can be rebooted any time, you should be fine.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
That's a good point. I suppose other options are a VM inside the cluster (ugh) or a Pi (bit less ugh).

Or making one of the other nodes have 2 votes, which works if that one is "more normally" on (the new one?). If I do that the quorum is 4/7 votes so it seems like it could lose either two servers (including the one with 2 votes) or three others (one vote each).

In practice I'm only intentionally restarting one at a time, of course.
 
However if they all get votes in Proxmox we'd end up with six nodes not five.
Are you planning on shutting down 3 of your servers often or are you simply planning for a "just in case 3 servers break"? On the first case, use a qdevice somewhere (PBS is fine, although I also install PVE in my PBS servers and prefer to use QDevice in a VM instead of installing in bare metal). The other 3 options are far from ideal.

On the second, in case of a disaster get yourself familiar with "pvecm expected" to lower the quorum requirements manually, but use it exclusively only if the missing nodes are completely off, i.e. not just a corosync networks failure. That command is powerful, but prone to introduce PVE database conflicts if not used properly.

Also, you mention Ceph: remember it has it's own quorum and loosing 3 of 6 servers at once will take down the whole cluster unless you use at least a 4/2 replicated pool.

I suggest you install a stagging enviroment using virtual machines and get used to how cluster will work with your reqirements.
 
Just in case. Normal usage, we restart one at a time for updates, or some sort of maintenance, and that's basically all we've had to do.

Some of these are "dual" servers that share one redundant power supply. So in theory two could possibly die together if both power supplies fail (both halves).

I'd already seen references to "pvecm expected" but, good reminder.

re: replication, so with 3/2 your point is that when losing 3 servers that might include all 3 replications of a chunk? And using 4/2 ensures a fourth copy?

This is a lab env, hence me wading into it to try to break discover things as I go along. Bull in a china shop, somewhat intentionally.

I mean, with 5 nodes the cluster can lose 2 and remain stable. With 6 nodes it can still only lose 2, so from that standpoint, that's not really any different. It's just, "not 3."
 
Some of these are "dual" servers that share one redundant power supply. So in theory two could possibly die together if both power supplies fail (both halves).
Supermicro Twin? Ceph wise I would create a custom crush map to define a "chasis" level and ensure that a single chasis will not hold two copies.

re: replication, so with 3/2 your point is that when losing 3 servers that might include all 3 replications of a chunk? And using 4/2 ensures a fourth copy?
Replace "migh" with "will" for around one third of your PGs. It's a fact, not a chance ;). The fourth copy ensures that you have to lose at least 4 machines to lose data. Keep in mind that if there's enough time between server failures, Ceph will rebalance and recreate copies on other nodes. In practice the chances for that to happend are low, specially if you use a custom map to tell Ceph not to host two copies that depend on the same pair of PSUs.

This is a lab env, hence me wading into it to try to break discover things as I go along. Bull in a china shop, somewhat intentionally.
That's how it should be: expect the best, prepare for the worst.

I mean, with 5 nodes the cluster can lose 2 and remain stable. With 6 nodes it can still only lose 2, so from that standpoint, that's not really any different. It's just, "not 3."
The point here is that you can easily recover PVE quorum. With Ceph isn't nowhere near as easy: i.e. if you lose 3 of your 5 monitors, you will have to enter disaster recovery and inject custom crush map without the lost monitors, edit config files, hope you have at least one copy of every PG, etc.
 
Yes, Twins. So far they have been great. Not entirely sure we'll keep going that way since 1U is not as expensive now, but we will still need lots of drive bays.

I did not realize that was possible with the crush map, will look into that.

Proxmox seems to recommend only 3 monitors? That's what we're used to on Virtuozzo anyway, 3 of what they call MDS servers. We put those on separate chassis now.

Thanks all. Diverged off my initial report, but good discussion to have.