[SOLVED] Cluster lost qourum

Hai · Dec 7, 2023

Hello,

After network outage one of clusters lost quorom. There are active VMs running, we do not want to cause any downtime. How can it be fixed?

The otput of pvecm status is:

Code:

root@OurNode06:~# pvecm status
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LC_CTYPE = "UTF-8",
    LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").
Cluster information
-------------------
Name:             OUR-CLUSTER
Config Version:   17
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Dec  7 15:04:01 2023
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000009
Ring ID:          9.ee
Quorate:          No

Votequorum information
----------------------
Expected votes:   9
Highest expected: 9
Total votes:      1
Quorum:           5 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x00000009          1 172.16.xx.xxx (local)

Hai · Dec 8, 2023

Any suggestions?

Is there a way to fix the cluster with downtime?

udo · Dec 8, 2023

Hi,
how looks "pvecm status" and the content of /etc/pve/.members on the other nodes?

Udo

Hai · Dec 8, 2023

root@ds4-node02:~# cat /etc/pve/.members

Code:

{
"nodename": "ds4-node02",
"version": 43,
"cluster": { "name": "DS-RR", "version": 17, "nodes": 9, "quorate": 0 },
"nodelist": {
  "ds2-node03": { "id": 6, "online": 0, "ip": "172.16.xx.x16"},
  "ds4-node03": { "id": 5, "online": 0, "ip": "172.16.xx.x15"},
  "ds2-node02": { "id": 4, "online": 0, "ip": "172.16.xx.x14"},
  "ds4-node02": { "id": 3, "online": 1, "ip": "172.16.xx.x13"},
  "ds4-node06": { "id": 9, "online": 0, "ip": "172.16.xx.x19"},
  "ds2-node01": { "id": 2, "online": 0, "ip": "172.16.xx.x12"},
  "ds4-node05": { "id": 8, "online": 0, "ip": "172.16.xx.x18"},
  "ds4-node04": { "id": 7, "online": 0, "ip": "172.16.xx.x17"},
  "ds4-node01": { "id": 1, "online": 0, "ip": "172.16.xx.x11"}
  }
}

Same output from other nodes

udo · Dec 8, 2023

Hai said:

root@ds4-node02:~# cat /etc/pve/.members

Code:

{
"nodename": "ds4-node02",
"version": 43,
"cluster": { "name": "DS-RR", "version": 17, "nodes": 9, "quorate": 0 },
"nodelist": {
  "ds2-node03": { "id": 6, "online": 0, "ip": "172.16.xx.x16"},
  "ds4-node03": { "id": 5, "online": 0, "ip": "172.16.xx.x15"},
  "ds2-node02": { "id": 4, "online": 0, "ip": "172.16.xx.x14"},
  "ds4-node02": { "id": 3, "online": 1, "ip": "172.16.xx.x13"},
  "ds4-node06": { "id": 9, "online": 0, "ip": "172.16.xx.x19"},
  "ds2-node01": { "id": 2, "online": 0, "ip": "172.16.xx.x12"},
  "ds4-node05": { "id": 8, "online": 0, "ip": "172.16.xx.x18"},
  "ds4-node04": { "id": 7, "online": 0, "ip": "172.16.xx.x17"},
  "ds4-node01": { "id": 1, "online": 0, "ip": "172.16.xx.x11"}
  }
}

Same output from other nodes

Hi,
if all nodes (except ds4-node06) shows the same, then you can restart cororsync and after that pve-cluster on ds4-node06.

And look with "corosync-cfgtool -s" on the nodes.

Udo

udo · Dec 8, 2023

Hai said:

root@ds4-node02:~# cat /etc/pve/.members

Code:

{
"nodename": "ds4-node02",
"version": 43,
"cluster": { "name": "DS-RR", "version": 17, "nodes": 9, "quorate": 0 },
"nodelist": {
  "ds2-node03": { "id": 6, "online": 0, "ip": "172.16.xx.x16"},
  "ds4-node03": { "id": 5, "online": 0, "ip": "172.16.xx.x15"},
  "ds2-node02": { "id": 4, "online": 0, "ip": "172.16.xx.x14"},
  "ds4-node02": { "id": 3, "online": 1, "ip": "172.16.xx.x13"},
  "ds4-node06": { "id": 9, "online": 0, "ip": "172.16.xx.x19"},
  "ds2-node01": { "id": 2, "online": 0, "ip": "172.16.xx.x12"},
  "ds4-node05": { "id": 8, "online": 0, "ip": "172.16.xx.x18"},
  "ds4-node04": { "id": 7, "online": 0, "ip": "172.16.xx.x17"},
  "ds4-node01": { "id": 1, "online": 0, "ip": "172.16.xx.x11"}
  }
}

Same output from other nodes

BTW. the other node can't be have the same output

Code:

name": "DS-RR", "version": 17, "nodes": 9, "quorate": 0 }

the question is, if on all other nodes only ds4-node02 offline? and all have the same version?

Hai · Dec 9, 2023

Firstly i did graceful shutdowns for VMs inside the nodes when connected to node via ssh:
qm shutdown <vmid>
After that I've shutdown all the nodes one by one. Once all nodes were offline, I've turned them on one by one, fully waiting for each node to become online. When the second node was up, I've noticed that vote count increased to 2, suggesting that it would work, and it did! The qourum was fixed.

udo · Dec 9, 2023

Hai said:
Firstly i did graceful shutdowns for VMs inside the nodes when connected to node via ssh:
qm shutdown <vmid>
After that I've shutdown all the nodes one by one. Once all nodes were offline, I've turned them on one by one, fully waiting for each node to become online. When the second node was up, I've noticed that vote count increased to 2, suggesting that it would work, and it did! The qourum was fixed.

Hi,
why you shutdown all VMs and Nodes? corosync and pve-cluster restart can you do while the VMs are still running - without downtime.

Udo

Hai · Dec 9, 2023

udo said:
Hi,
why you shutdown all VMs and Nodes? corosync and pve-cluster restart can you do while the VMs are still running - without downtime.

Udo

Seemed like an easiest way to fix it. Our product lets us have downtime on weekend without major impact. Also at the same time did update the nodes with the latest version

Search

Search

[SOLVED] Cluster lost qourum

Hai

Member

Hai

Member

udo

Distinguished Member

Hai

Member

udo

Distinguished Member

udo

Distinguished Member

Hai

Member

udo

Distinguished Member

Hai

Member

We value your privacy