How is this possible?

pant-dm

New Member
Apr 16, 2015
27
0
1
Hello.

It's an amazing story.

I had problems with the cluster: https://forum.proxmox.com/threads/quourum-dissolved.34572/

But suddenly, two weeks later the cluster recovered itself. Is that possible?! The only thing I did was not long before the restoration of the cluster enabled a blade adjacent to the cluster.
After the restore, I looked at the status of the cluster and one thing surprised me very much where did the 5 active subsystems come from, when there are only 4 blades in the cluster?
root@vnode4:~#
root@vnode4:~# pvecm status
Version: 6.2.0
Config Version: 33
Cluster Name: vcluster
Cluster Id: 28468
Cluster Member: Yes
Cluster Generation: 3080
Membership state: Cluster-Member
Nodes: 4
Expected votes: 4
Total votes: 4
Node votes: 1
Quorum: 3
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: vnode4
Node ID: 3
Multicast addresses: 239.192.111.163
Node addresses: 10.61.5.104
root@vnode4:~#

It was found out that historically there were 6 blades in this cluster. But long enough 2 blades were removed. And as a matter of fact I have enabled a completely different blade with a different ip address, but in the same slot where the removed blade used to be. Could this affect the recovery of the cluster? How can I clean the cluster configuration from old blades?

With gratitude for advice.
 
Any ideas? There may be secret configuration files for the cluster, somehow related to fencing or something of the sort ?... Which do not update when the node is removed from the cluster normally.

PS: config "Datacenter/HA" is correct.
 
Hi,

I've seen cases where cluster is grumpy, if one node reboots and then another in rapid sequence and they fail to establish quorum. In your case the first post above the pasted report appears to be saying, "4 nodes" which is correct if I understand you properly. I believe you can review the docs on cluster - to see how to make a cluster forget about a node (ie, failed hardware part that is removed, not replaced). The blade-slot - does not matter, proxmox won't know identity of the hardware other than the IP and hostname. The physical placement (slot, chassis etc) is not going to have any impact. Possibly in your case, a 'graceful and slow' reboot of cluster nodes might have been sufficient for the cluster to regain its senses, rather than a 2-week wait.

ie, 4 notes are powered on and cluster is not at quorum for some reason.
leave nodes 2,3,4 powered on, power cycle node 1. Once it is booted, power cycle node2. Rinse repeat until all nodes were rebooted. Then observe logs and cluster status; possibly things are better after this.


Tim
 
Thanks for the answer :).

But for all the time, no cluster node has been rebooted. "Uptime" of all blades remained the same. The cluster recovered itself! After the powered on of a meaningless neighboring blade.

I see that the fencing configuration file ("Datacenter/HA") knows about the hardware ports of the servers and the password from controlling the entire server chassis. But I checked from does not know about the port of the neighboring blade.

The question still open, how is this possible? Restore the cluster without rebooting the nodes.

Sincerely.
 
Hi, sorry for my vague reply :) I guess what I meant, somewhat, to say but didn't say so clearly,
- perturbation to the clients (cluster nodes) may trigger reassessment of the cluster status.
- this may include, reboot of nodes, but also network hiatus, NIC unplug, spec of lint, atomic dust, X-rays, I don't know ? :)
- ie, I can believe that it was broken and then fixed itself, just with the passage of time and the opportunity for reality to reassert itself.

Did you put in the HA Fence configs in place, or were those by work of the prior proxmox-cluster-admin ? BTW I believe from sound of it you are on older (3.x?) proxmox; and FYI clusters are 'much simpler' on Proxmox 4.X and don't require manually building a fence file.

ie, my experience with fence config HA on 3.x was that "it was not trivial to get working" and it required human config intervention to put proper fence config in place. In contrast HA on 4.x is almost as simple as "I've got a cluster of 3+ nodes, I want HA, enable, done, thanks!"

Anyhow. Sometimes there are mysteries we can't understand easily (at least not without a tremendous amount of time and effort :)

Tim
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!