[SOLVED] Cluster lost qourum

Hai

Member
Feb 17, 2021
22
11
8
26
Hello,

After network outage one of clusters lost quorom. There are active VMs running, we do not want to cause any downtime. How can it be fixed?

The otput of pvecm status is:
Code:
root@OurNode06:~# pvecm status
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LC_CTYPE = "UTF-8",
    LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").
Cluster information
-------------------
Name:             OUR-CLUSTER
Config Version:   17
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Dec  7 15:04:01 2023
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000009
Ring ID:          9.ee
Quorate:          No

Votequorum information
----------------------
Expected votes:   9
Highest expected: 9
Total votes:      1
Quorum:           5 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x00000009          1 172.16.xx.xxx (local)
 
Any suggestions?

Is there a way to fix the cluster with downtime?
 
root@ds4-node02:~# cat /etc/pve/.members

Code:
{
"nodename": "ds4-node02",
"version": 43,
"cluster": { "name": "DS-RR", "version": 17, "nodes": 9, "quorate": 0 },
"nodelist": {
  "ds2-node03": { "id": 6, "online": 0, "ip": "172.16.xx.x16"},
  "ds4-node03": { "id": 5, "online": 0, "ip": "172.16.xx.x15"},
  "ds2-node02": { "id": 4, "online": 0, "ip": "172.16.xx.x14"},
  "ds4-node02": { "id": 3, "online": 1, "ip": "172.16.xx.x13"},
  "ds4-node06": { "id": 9, "online": 0, "ip": "172.16.xx.x19"},
  "ds2-node01": { "id": 2, "online": 0, "ip": "172.16.xx.x12"},
  "ds4-node05": { "id": 8, "online": 0, "ip": "172.16.xx.x18"},
  "ds4-node04": { "id": 7, "online": 0, "ip": "172.16.xx.x17"},
  "ds4-node01": { "id": 1, "online": 0, "ip": "172.16.xx.x11"}
  }
}

Same output from other nodes
 
root@ds4-node02:~# cat /etc/pve/.members

Code:
{
"nodename": "ds4-node02",
"version": 43,
"cluster": { "name": "DS-RR", "version": 17, "nodes": 9, "quorate": 0 },
"nodelist": {
  "ds2-node03": { "id": 6, "online": 0, "ip": "172.16.xx.x16"},
  "ds4-node03": { "id": 5, "online": 0, "ip": "172.16.xx.x15"},
  "ds2-node02": { "id": 4, "online": 0, "ip": "172.16.xx.x14"},
  "ds4-node02": { "id": 3, "online": 1, "ip": "172.16.xx.x13"},
  "ds4-node06": { "id": 9, "online": 0, "ip": "172.16.xx.x19"},
  "ds2-node01": { "id": 2, "online": 0, "ip": "172.16.xx.x12"},
  "ds4-node05": { "id": 8, "online": 0, "ip": "172.16.xx.x18"},
  "ds4-node04": { "id": 7, "online": 0, "ip": "172.16.xx.x17"},
  "ds4-node01": { "id": 1, "online": 0, "ip": "172.16.xx.x11"}
  }
}

Same output from other nodes
Hi,
if all nodes (except ds4-node06) shows the same, then you can restart cororsync and after that pve-cluster on ds4-node06.

And look with "corosync-cfgtool -s" on the nodes.


Udo
 
root@ds4-node02:~# cat /etc/pve/.members

Code:
{
"nodename": "ds4-node02",
"version": 43,
"cluster": { "name": "DS-RR", "version": 17, "nodes": 9, "quorate": 0 },
"nodelist": {
  "ds2-node03": { "id": 6, "online": 0, "ip": "172.16.xx.x16"},
  "ds4-node03": { "id": 5, "online": 0, "ip": "172.16.xx.x15"},
  "ds2-node02": { "id": 4, "online": 0, "ip": "172.16.xx.x14"},
  "ds4-node02": { "id": 3, "online": 1, "ip": "172.16.xx.x13"},
  "ds4-node06": { "id": 9, "online": 0, "ip": "172.16.xx.x19"},
  "ds2-node01": { "id": 2, "online": 0, "ip": "172.16.xx.x12"},
  "ds4-node05": { "id": 8, "online": 0, "ip": "172.16.xx.x18"},
  "ds4-node04": { "id": 7, "online": 0, "ip": "172.16.xx.x17"},
  "ds4-node01": { "id": 1, "online": 0, "ip": "172.16.xx.x11"}
  }
}

Same output from other nodes
BTW. the other node can't be have the same output
Code:
name": "DS-RR", "version": 17, "nodes": 9, "quorate": 0 }
the question is, if on all other nodes only ds4-node02 offline? and all have the same version?
 
Firstly i did graceful shutdowns for VMs inside the nodes when connected to node via ssh:
qm shutdown <vmid>
After that I've shutdown all the nodes one by one. Once all nodes were offline, I've turned them on one by one, fully waiting for each node to become online. When the second node was up, I've noticed that vote count increased to 2, suggesting that it would work, and it did! The qourum was fixed.
 
Firstly i did graceful shutdowns for VMs inside the nodes when connected to node via ssh:
qm shutdown <vmid>
After that I've shutdown all the nodes one by one. Once all nodes were offline, I've turned them on one by one, fully waiting for each node to become online. When the second node was up, I've noticed that vote count increased to 2, suggesting that it would work, and it did! The qourum was fixed.
Hi,
why you shutdown all VMs and Nodes? corosync and pve-cluster restart can you do while the VMs are still running - without downtime.

Udo
 
Hi,
why you shutdown all VMs and Nodes? corosync and pve-cluster restart can you do while the VMs are still running - without downtime.

Udo
Seemed like an easiest way to fix it. Our product lets us have downtime on weekend without major impact. Also at the same time did update the nodes with the latest version
 
  • Like
Reactions: udo

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!