pve6to7 : corosync.conf (5) and pmxcfs (6) don't agree about size of nodelist

PointPubMedia

Active Member
Aug 3, 2017
27
1
43
Hi,

We are planning to upgrade from 6 to 7 but on 1 out of 5 nodes, we got this:

Analzying quorum settings and state..
FAIL: 1 nodes are offline!
INFO: configured votes - nodes: 5
INFO: configured votes - qdevice: 0
INFO: current expected votes: 5
INFO: current total votes: 5
FAIL: corosync.conf (5) and pmxcfs (6) don't agree about size of nodelist.
 
Hi,
is /etc/corosync/corosync.conf the same as /etc/pve/corosync.conf on that node? What is the output of cat /etc/pve/corosync.conf and pvecm status?
 
/etc/corosync/corosync.conf and /etc/pve/corosync.conf are the same!

logging { debug: off to_syslog: yes } nodelist { node { name: pve11 nodeid: 2 quorum_votes: 1 ring0_addr: x.x.x.242 ring1_addr: x.x.y.242 } node { name: pve12 nodeid: 4 quorum_votes: 1 ring0_addr: x.x.x.243 ring1_addr: x.x.y.243 } node { name: pve13 nodeid: 5 quorum_votes: 1 ring0_addr: x.x.x.244 ring1_addr: x.x.y.244 } node { name: pve14 nodeid: 6 quorum_votes: 1 ring0_addr: x.x.x.245 ring1_addr: x.x.y.245 } node { name: pve15 nodeid: 3 quorum_votes: 1 ring0_addr: x.x.x.246 ring1_addr: x.x.y.246 } } quorum { provider: corosync_votequorum } totem { cluster_name: QSE-PVE config_version: 9 interface { linknumber: 0 } interface { linknumber: 1 } ip_version: ipv4-6 link_mode: passive secauth: on version: 2 }

Cluster information ------------------- Name: QSE-PVE Config Version: 9 Transport: knet Secure auth: on Quorum information ------------------ Date: Thu Aug 4 06:47:43 2022 Quorum provider: corosync_votequorum Nodes: 5 Node ID: 0x00000004 Ring ID: 2.e4e Quorate: Yes Votequorum information ---------------------- Expected votes: 5 Highest expected: 5 Total votes: 5 Quorum: 3 Flags: Quorate Membership information ---------------------- Nodeid Votes Name 0x00000002 1 x.x.x.242 0x00000003 1 x.x.x.246 0x00000004 1 x.x.x.243 (local) 0x00000005 1 x.x.x.244 0x00000006 1 x.x.x.245
 
What is the output of cat /etc/pve/.members and journalctl -b -u pve-cluster.service?

Does systemctl reload-or-restart pve-cluster.service on the problematic node help?

Was there a sixth node in the past? How did you remove it?
 
{ "nodename": "pve12", "version": 7, "cluster": { "name": "QSE-PVE", "version": 8, "nodes": 6, "quorate": 1 }, "nodelist": { "pve15": { "id": 3, "online": 1, "ip": "x.x.x.246"}, "pve01": { "id": 1, "online": 0}, "pve11": { "id": 2, "online": 1, "ip": "x.x.x.242"}, "pve12": { "id": 4, "online": 1, "ip": "x.x.x.243"}, "pve13": { "id": 5, "online": 1, "ip": "x.x.x.244"}, "pve14": { "id": 6, "online": 1, "ip": "x.x.x.245"} } }

Restart of pve-cluster didn't help!
Aug 4 07:44:29 pve12 systemd[1]: Stopped The Proxmox VE cluster filesystem. Aug 4 07:44:29 pve12 systemd[1]: Starting The Proxmox VE cluster filesystem... Aug 4 07:44:29 pve12 pmxcfs[17008]: [status] notice: update cluster info (cluster name QSE-PVE, version = 8) Aug 4 07:44:29 pve12 pmxcfs[17008]: [status] notice: node has quorum Aug 4 07:44:29 pve12 pmxcfs[17008]: [dcdb] notice: members: 2/9970, 3/25652, 4/17008, 5/17056, 6/12386 Aug 4 07:44:29 pve12 pmxcfs[17008]: [dcdb] notice: starting data syncronisation Aug 4 07:44:29 pve12 pmxcfs[17008]: [dcdb] notice: received sync request (epoch 2/9970/00000005) Aug 4 07:44:29 pve12 pmxcfs[17008]: [status] notice: members: 2/9970, 3/25652, 4/17008, 5/17056, 6/12386 Aug 4 07:44:29 pve12 pmxcfs[17008]: [status] notice: starting data syncronisation Aug 4 07:44:29 pve12 pmxcfs[17008]: [status] notice: received sync request (epoch 2/9970/00000005) Aug 4 07:44:29 pve12 pmxcfs[17008]: [dcdb] notice: received all states Aug 4 07:44:29 pve12 pmxcfs[17008]: [dcdb] notice: leader is 2/9970 Aug 4 07:44:29 pve12 pmxcfs[17008]: [dcdb] notice: synced members: 2/9970, 3/25652, 4/17008, 5/17056, 6/12386 Aug 4 07:44:29 pve12 pmxcfs[17008]: [dcdb] notice: all data is up to date Aug 4 07:44:29 pve12 pmxcfs[17008]: [status] notice: received all states Aug 4 07:44:29 pve12 pmxcfs[17008]: [status] notice: all data is up to date Aug 4 07:44:30 pve12 systemd[1]: Started The Proxmox VE cluster filesystem.
We remove pve01 using the "how to" from proxmox website as we did a lot of time in the past. If I remember correctly, we removed pve19 also and it works fine on all other node!

On all pve11,12,13,14,15 , the only issue it's pve12 that still "see" pve01

In journalctl, we got nothing except a lot of "received log"
 
Last edited:
@fiona

We just upgraded and everything works well... but in the web interface, we are having

1659626805928.png

Is there a way to completely remove pve01 and pve19 ?
 
Do you have any HA services configured currently?

Please provide the output of the following:
Code:
ha-manager status -v
cat /etc/pve/ha/manager_status
cat /etc/pve/nodes/pve19/lrm_status
cat /etc/pve/nodes/pve01/lrm_status
 
quorum OK master pve15 (active, Fri Aug 5 06:48:17 2022) lrm pve01 (maintenance mode, Sat Feb 26 09:37:13 2022) lrm pve11 (idle, Fri Aug 5 06:48:17 2022) lrm pve12 (idle, Fri Aug 5 06:48:19 2022) lrm pve13 (idle, Fri Aug 5 06:48:20 2022) lrm pve14 (idle, Fri Aug 5 06:48:17 2022) lrm pve15 (idle, Fri Aug 5 06:48:17 2022) lrm pve19 (maintenance mode, Sun May 31 16:21:28 2020) full cluster state: { "lrm_status" : { "pve01" : { "mode" : "maintenance", "results" : {}, "state" : "wait_for_agent_lock", "timestamp" : 1645886233 }, "pve11" : { "mode" : "active", "results" : {}, "state" : "wait_for_agent_lock", "timestamp" : 1659696497 }, "pve12" : { "mode" : "active", "results" : {}, "state" : "wait_for_agent_lock", "timestamp" : 1659696499 }, "pve13" : { "mode" : "active", "results" : {}, "state" : "wait_for_agent_lock", "timestamp" : 1659696500 }, "pve14" : { "mode" : "active", "results" : {}, "state" : "wait_for_agent_lock", "timestamp" : 1659696497 }, "pve15" : { "mode" : "active", "results" : {}, "state" : "wait_for_agent_lock", "timestamp" : 1659696497 }, "pve19" : { "mode" : "maintenance", "results" : {}, "state" : "wait_for_agent_lock", "timestamp" : 1590956488 } }, "manager_status" : { "master_node" : "pve15", "node_status" : { "pve01" : "maintenance", "pve11" : "online", "pve12" : "online", "pve13" : "online", "pve14" : "online", "pve15" : "online", "pve19" : "maintenance" }, "service_status" : {}, "timestamp" : 1659696497 }, "quorum" : { "node" : "pve15", "quorate" : "1" } }

cat /etc/pve/ha/manager_status {"timestamp":1659696527,"master_node":"pve15","service_status":{},"node_status":{"pve11":"online","pve15":"online","pve14":"online","pve12":"online","pve13":"online","pve19":"maintenance","pve01":"maintenance"}} cat /etc/pve/nodes/pve19/lrm_status {"mode":"maintenance","timestamp":1590956488,"state":"wait_for_agent_lock","results":{}} cat /etc/pve/nodes/pve01/lrm_status {"mode":"maintenance","timestamp":1645886233,"results":{},"state":"wait_for_agent_lock"}
 
Last edited:
cat /etc/pve/nodes/pve19/lrm_status {"mode":"maintenance","timestamp":1590956488,"state":"wait_for_agent_lock","results":{}}root@pve15:~# cat /etc/pve/nodes/pve01/lrm_status {"mode":"maintenance","timestamp":1645886233,"results":{},"state":"wait_for_agent_lock"}root@pve15:~#
I guess the HA manager thinks that the LRM for these two nodes still exists, because of these left-over files. After removing these files, the manager should switch the LRM status unkown and the nodes should disappear after a while (IIRC an hour).

You might even want to remove the whole directories for the gone nodes after doing a safety check if anything in there is still needed.
 
Yeah I already check and there's nothing good in pve19 and pve01, so I just need to remove /etc/pve/nodes/{pve01,pve19} on each node ?