Hi,
We have a cluster of 4 PVE servers, all were working fine.
Corosync works on the main management interface (vmbr0).
We stopped one node and after booting up again, corosync failed to work on it (no updates were performed before shutdown).
The other nodes work. Tcpdump shows udp packets going to/from port 5405 between all nodes.
The 3 nodes see each other, this one is not joining the quorum.
Tried restarting corosync and pve-cluster on it, restarting the server, still not working.
journalctl --unit=pve-cluster shows:
pvecm status on the failed node:
pvecm status on a working node:
This cluster is composed of different version PVE,
non working node (originally had 6.1, worked until this restart, i upgraded it to see if it helps, but no change) :
working nodes:
Now, i am aware that we should upgrade these machines to the same version (although they worked just fine and this specific node was restarted a few times), but i have a few major concerns here:
- i cannot say why this just happened. This is pretty important, having a clue how to deal with these issues and at the end of the day provide a report of what went wrong. The network seems fine, no updates were performed, packets - udp, tcp travel between the hosts. But quorum is not working on this particular node. How can i debug this? Obviously packets are sent but discarded for some reason (maybe some are missing?) but Corosync related issues are quite elusive.
- If we start upgrading a cluster with older nodes, will we face this issue again? We have this cluster that has 6.0 versions, another 2 clusters with 6.1 (these are all at the same level). What will happen if i start upgrade to 6.3?
PS We have a Community subscription for all our clusters.
We have a cluster of 4 PVE servers, all were working fine.
Corosync works on the main management interface (vmbr0).
We stopped one node and after booting up again, corosync failed to work on it (no updates were performed before shutdown).
The other nodes work. Tcpdump shows udp packets going to/from port 5405 between all nodes.
The 3 nodes see each other, this one is not joining the quorum.
Tried restarting corosync and pve-cluster on it, restarting the server, still not working.
journalctl --unit=pve-cluster shows:
Code:
Feb 04 03:05:10 ndi-srv-024 pmxcfs[5065]: [status] crit: cpg_send_message failed: 6
Feb 04 03:05:12 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retry 10
Feb 04 03:05:13 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retry 20
Feb 04 03:05:14 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retry 30
Feb 04 03:05:15 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retry 40
Feb 04 03:05:15 ndi-srv-024 pmxcfs[5065]: [dcdb] notice: members: 2/1455, 4/5065
Feb 04 03:05:15 ndi-srv-024 pmxcfs[5065]: [dcdb] notice: starting data syncronisation
Feb 04 03:05:15 ndi-srv-024 pmxcfs[5065]: [status] notice: members: 2/1455, 4/5065
Feb 04 03:05:15 ndi-srv-024 pmxcfs[5065]: [status] notice: starting data syncronisation
Feb 04 03:05:15 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retried 41 times
Feb 04 03:05:21 ndi-srv-024 pmxcfs[5065]: [dcdb] notice: members: 4/5065
Feb 04 03:05:21 ndi-srv-024 pmxcfs[5065]: [dcdb] notice: all data is up to date
Feb 04 03:05:21 ndi-srv-024 pmxcfs[5065]: [status] notice: members: 4/5065
Feb 04 03:05:21 ndi-srv-024 pmxcfs[5065]: [status] notice: all data is up to date
Feb 04 03:05:22 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retry 10
Feb 04 03:05:23 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retry 20
Feb 04 03:05:24 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retry 30
Feb 04 03:05:25 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retry 40
Feb 04 03:05:26 ndi-srv-024 pmxcfs[5065]: [status] notice: cpg_send_message retry 50
Code:
Feb 04 03:06:23 ndi-srv-024 corosync[5071]: [QUORUM] Members[1]: 4
Feb 04 03:06:23 ndi-srv-024 corosync[5071]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 04 03:06:24 ndi-srv-024 corosync[5071]: [TOTEM ] A new membership (1.9db5e) was formed. Members joined: 1 2 3
Feb 04 03:06:26 ndi-srv-024 corosync[5071]: [TOTEM ] FAILED TO RECEIVE
Feb 04 03:06:32 ndi-srv-024 corosync[5071]: [TOTEM ] A new membership (4.9dbca) was formed. Members left: 1 2 3
Feb 04 03:06:32 ndi-srv-024 corosync[5071]: [TOTEM ] Failed to receive the leave message. failed: 1 2 3
Feb 04 03:06:32 ndi-srv-024 corosync[5071]: [QUORUM] Members[1]: 4
Feb 04 03:06:32 ndi-srv-024 corosync[5071]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 04 03:06:32 ndi-srv-024 corosync[5071]: [TOTEM ] A new membership (1.9dbce) was formed. Members joined: 1 2 3
Feb 04 03:06:35 ndi-srv-024 corosync[5071]: [TOTEM ] FAILED TO RECEIVE
Feb 04 03:06:40 ndi-srv-024 corosync[5071]: [TOTEM ] A new membership (4.9dc32) was formed. Members left: 1 2 3
Feb 04 03:06:40 ndi-srv-024 corosync[5071]: [TOTEM ] Failed to receive the leave message. failed: 1 2 3
Feb 04 03:06:40 ndi-srv-024 corosync[5071]: [TOTEM ] A new membership (1.9dc36) was formed. Members joined: 1 2 3
Feb 04 03:06:43 ndi-srv-024 corosync[5071]: [TOTEM ] FAILED TO RECEIVE
pvecm status on the failed node:
Code:
Cluster information
-------------------
Name: proxmox
Config Version: 6
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Thu Feb 4 02:58:31 2021
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000004
Ring ID: 1.9c68e
Quorate: No
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 2
Quorum: 3 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
0x00000004 1 10.10.10.58 (local)
pvecm status on a working node:
Code:
Quorum information
------------------
Date: Thu Feb 4 03:03:41 2021
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000002
Ring ID: 1/644422
Quorate: Yes
Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 3
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.10.10.52
0x00000002 1 10.10.10.54 (local)
0x00000003 1 10.10.10.56
This cluster is composed of different version PVE,
non working node (originally had 6.1, worked until this restart, i upgraded it to see if it helps, but no change) :
Code:
pve-manager/6.3-3/eee5f901 (running kernel: 5.4.78-2-pve)
Code:
pve-manager/6.0-4/2a719255 (running kernel: 5.0.15-1-pve)
pve-manager/6.0-4/2a719255 (running kernel: 5.0.15-1-pve)
pve-manager/6.0-6/c71f879f (running kernel: 5.0.21-1-pve)
Now, i am aware that we should upgrade these machines to the same version (although they worked just fine and this specific node was restarted a few times), but i have a few major concerns here:
- i cannot say why this just happened. This is pretty important, having a clue how to deal with these issues and at the end of the day provide a report of what went wrong. The network seems fine, no updates were performed, packets - udp, tcp travel between the hosts. But quorum is not working on this particular node. How can i debug this? Obviously packets are sent but discarded for some reason (maybe some are missing?) but Corosync related issues are quite elusive.
- If we start upgrading a cluster with older nodes, will we face this issue again? We have this cluster that has 6.0 versions, another 2 clusters with 6.1 (these are all at the same level). What will happen if i start upgrade to 6.3?
PS We have a Community subscription for all our clusters.
Last edited: