Please help with Corosync/cluster issue

dtiKev

Well-Known Member
Apr 7, 2018
75
4
48
I followed the directions in the docs for adding a new node and everything worked great. Then as the start of removing one of the old nodes I simply shut it down after all VMs were safely migrated to other nodes.

As soon as it is powered off the original other nodes lose the new node and the new node can't see any of the cluster other than itself. As soon as I power on the node I plan to remove everything goes back to green.

All nodes are up to date with version 6.4-13 and all are able to ping and ssh to one another. All are on the same subnet with no firewalling between them. I'm not sure what other information to provide but some examples of things I've looked at are below. The new node is called "pvnBax" in these examples as it's main function will be a replication target. The node I intend to remove is pvnOne and both pvn2 and pvnQ are healthy members.

pvn2:~# pvecm status
Cluster information
-------------------
Name: dtCluster
Config Version: 9
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Jul 26 16:59:33 2021
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000003
Ring ID: 1.204a
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 x.y.z.244
0x00000002 1 x.y.z.245
0x00000003 1 x.y.z.252 (local)
0x00000004 1 x.y.z.246

@pvnBax:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2021-07-26 16:00:26 EDT; 42min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 1519 (corosync)
Tasks: 9 (limit: 4915)
Memory: 160.2M
CGroup: /system.slice/corosync.service
└─1519 /usr/sbin/corosync -f

Jul 26 16:38:32 pvnBax corosync[1519]: [QUORUM] Sync members[1]: 4
Jul 26 16:38:32 pvnBax corosync[1519]: [TOTEM ] A new membership (4.1f36) was formed. Members
Jul 26 16:38:32 pvnBax corosync[1519]: [QUORUM] Members[1]: 4
Jul 26 16:38:32 pvnBax corosync[1519]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 26 16:38:32 pvnBax corosync[1519]: [QUORUM] Sync members[4]: 1 2 3 4
Jul 26 16:38:32 pvnBax corosync[1519]: [QUORUM] Sync joined[3]: 1 2 3
Jul 26 16:38:32 pvnBax corosync[1519]: [TOTEM ] A new membership (1.1f3a) was formed. Members joined: 1 2 3
Jul 26 16:38:32 pvnBax corosync[1519]: [QUORUM] This node is within the primary component and will provide service.
Jul 26 16:38:32 pvnBax corosync[1519]: [QUORUM] Members[4]: 1 2 3 4
Jul 26 16:38:32 pvnBax corosync[1519]: [MAIN ] Completed service synchronization, ready to provide service.

pvnBax:~# pvecm status
Cluster information
-------------------
Name: dtCluster
Config Version: 9
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Jul 26 16:46:00 2021
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000004
Ring ID: 4.1f46
Quorate: No

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 1
Quorum: 3 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000004 1 x.y.z.246 (local)

pvnQ:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2021-07-26 14:49:08 EDT; 1h 54min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 814 (corosync)
Tasks: 9 (limit: 4530)
Memory: 147.5M
CGroup: /system.slice/corosync.service
└─814 /usr/sbin/corosync -f

Jul 26 16:38:27 pvnQ corosync[814]: [QUORUM] Sync joined[1]: 1
Jul 26 16:38:27 pvnQ corosync[814]: [TOTEM ] A new membership (1.1f36) was formed. Members joined: 1
Jul 26 16:38:27 pvnQ corosync[814]: [QUORUM] This node is within the primary component and will provide service.
Jul 26 16:38:27 pvnQ corosync[814]: [QUORUM] Members[3]: 1 2 3
Jul 26 16:38:27 pvnQ corosync[814]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 26 16:38:32 pvnQ corosync[814]: [QUORUM] Sync members[4]: 1 2 3 4
Jul 26 16:38:32 pvnQ corosync[814]: [QUORUM] Sync joined[1]: 4
Jul 26 16:38:32 pvnQ corosync[814]: [TOTEM ] A new membership (1.1f3a) was formed. Members joined: 4
Jul 26 16:38:32 pvnQ corosync[814]: [QUORUM] Members[4]: 1 2 3 4
Jul 26 16:38:32 pvnQ corosync[814]: [MAIN ] Completed service synchronization, ready to provide service.
pvnQ:~# pvecm status
Cluster information
-------------------
Name: dtCluster
Config Version: 9
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Jul 26 16:46:20 2021
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000002
Ring ID: 2.1f52
Quorate: No

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 2
Quorum: 3 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 x.y.z.245 (local)
0x00000003 1 x.y.z.252
 
Am I missing something that would make one node essential to the quorum?
 
Still grasping at straws here... I started looking into the possibility that there's some misconfiguration with corosync and am reviewing those docs but am not 100% sure.

Looking at it's config I do see that under the totem heading it's referencing a bindnetaddress of an IP that's not beenin this cluster for over a year but that wouldn't explain (at least to me) why suddenly the powering off of my degraded server would cause this "blindness" as if there are two segments.
 
So this morning, once again I tried to shutdown pvnOne which is the node I intend to remove and I'm monitoring pvnBax the new node. Once again it sees only itself once pvnOne is gone. Looking at it's syslog this is what I see:

Jul 28 08:08:11 pvnBax corosync[1519]: [CFG ] Node 1 was shut down by sysadmin
Jul 28 08:08:11 pvnBax pmxcfs[1514]: [dcdb] notice: members: 2/769, 3/3360, 4/1514
Jul 28 08:08:11 pvnBax pmxcfs[1514]: [dcdb] notice: starting data syncronisation
Jul 28 08:08:11 pvnBax pmxcfs[1514]: [status] notice: members: 2/769, 3/3360, 4/1514
Jul 28 08:08:11 pvnBax pmxcfs[1514]: [status] notice: starting data syncronisation
Jul 28 08:08:13 pvnBax corosync[1519]: [KNET ] link: host: 1 link: 0 is down
Jul 28 08:08:13 pvnBax corosync[1519]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 28 08:08:13 pvnBax corosync[1519]: [KNET ] host: host: 1 has no active links
Jul 28 08:08:21 pvnBax corosync[1519]: [QUORUM] Sync members[1]: 4
Jul 28 08:08:21 pvnBax corosync[1519]: [QUORUM] Sync left[3]: 1 2 3
Jul 28 08:08:21 pvnBax corosync[1519]: [TOTEM ] A new membership (4.2052) was formed. Members left: 1 2 3
Jul 28 08:08:21 pvnBax corosync[1519]: [TOTEM ] Failed to receive the leave message. failed: 2 3
Jul 28 08:08:21 pvnBax pmxcfs[1514]: [dcdb] notice: members: 4/1514
Jul 28 08:08:21 pvnBax pmxcfs[1514]: [dcdb] notice: all data is up to date
Jul 28 08:08:21 pvnBax pmxcfs[1514]: [status] notice: members: 4/1514
Jul 28 08:08:21 pvnBax pmxcfs[1514]: [status] notice: all data is up to date
Jul 28 08:08:21 pvnBax corosync[1519]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 28 08:08:21 pvnBax corosync[1519]: [QUORUM] Members[1]: 4
Jul 28 08:08:21 pvnBax corosync[1519]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 28 08:08:21 pvnBax pmxcfs[1514]: [status] notice: cpg_send_message retried 1 times
Jul 28 08:08:21 pvnBax pmxcfs[1514]: [status] notice: node lost quorum
Jul 28 08:08:21 pvnBax pmxcfs[1514]: [dcdb] crit: received write while not quorate - trigger resync
Jul 28 08:08:21 pvnBax pmxcfs[1514]: [dcdb] crit: leaving CPG group
Jul 28 08:08:21 pvnBax pve-ha-lrm[1654]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pvnBax/lrm_status.tmp.1654' - Permission denied
Jul 28 08:08:31 pvnBax corosync[1519]: [QUORUM] Sync members[1]: 4
Jul 28 08:08:31 pvnBax corosync[1519]: [TOTEM ] A new membership (4.2056) was formed. Members
Jul 28 08:08:31 pvnBax corosync[1519]: [QUORUM] Members[1]: 4
Jul 28 08:08:31 pvnBax corosync[1519]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 28 08:08:31 pvnBax pmxcfs[1514]: [dcdb] notice: start cluster connection
Jul 28 08:08:31 pvnBax pmxcfs[1514]: [dcdb] crit: cpg_join failed: 14
Jul 28 08:08:31 pvnBax pmxcfs[1514]: [dcdb] crit: can't initialize service
Jul 28 08:08:40 pvnBax corosync[1519]: [QUORUM] Sync members[1]: 4
Jul 28 08:08:40 pvnBax corosync[1519]: [TOTEM ] A new membership (4.205a) was formed. Members
Jul 28 08:08:40 pvnBax corosync[1519]: [QUORUM] Members[1]: 4
Jul 28 08:08:40 pvnBax corosync[1519]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 28 08:08:40 pvnBax pmxcfs[1514]: [dcdb] notice: members: 4/1514
Jul 28 08:08:40 pvnBax pmxcfs[1514]: [dcdb] notice: all data is up to date
Jul 28 08:08:50 pvnBax corosync[1519]: [QUORUM] Sync members[1]: 4
Jul 28 08:08:50 pvnBax corosync[1519]: [TOTEM ] A new membership (4.205e) was formed. Members
Jul 28 08:08:50 pvnBax corosync[1519]: [QUORUM] Members[1]: 4
Jul 28 08:08:50 pvnBax corosync[1519]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 28 08:08:59 pvnBax corosync[1519]: [QUORUM] Sync members[1]: 4
Jul 28 08:08:59 pvnBax corosync[1519]: [TOTEM ] A new membership (4.2062) was formed. Members
Jul 28 08:08:59 pvnBax corosync[1519]: [QUORUM] Members[1]: 4
Jul 28 08:08:59 pvnBax corosync[1519]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 28 08:09:00 pvnBax systemd[1]: Starting Proxmox VE replication runner...
Jul 28 08:09:01 pvnBax pvesr[14115]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 28 08:09:02 pvnBax pvesr[14115]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 28 08:09:03 pvnBax pvesr[14115]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 28 08:09:04 pvnBax pvesr[14115]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 28 08:09:05 pvnBax pvesr[14115]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 28 08:09:06 pvnBax pvesr[14115]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 28 08:09:07 pvnBax pvesr[14115]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 28 08:09:08 pvnBax pvesr[14115]: trying to acquire cfs lock 'file-replication_cfg' ...
Jul 28 08:09:09 pvnBax pvesr[14115]: trying to acquire cfs lock 'file-replication_cfg' ...
 
Does anyone have a suggestion or knowledge of what I am missing?

The more I read the more I'm convinced it's something with the corosync setup but the manual shows straight forward vanilla setups which doesn't help with troubleshooting a long lived cluster that has new nodes and has lost original ones.
 
Last edited:
And it get's even spookier - I just loaded the gui management from one of the nodes that was in good standing and suddenly the new pvnBax node isn't there at all and the Cluster Summary page is back to only thinking there are 3 nodes.

So I loaded the gui mangment on pvnBax and it shows the same cluster has 4 nodes and I can see all four. Also, all of the VMs on the current workhorse show that replication has run and that the targe was pvnBax - even on the gui interface on the side that isn't reporting pvnBax.

When I look at the VMdisks on pvnBax in it's local-zfs pool I see disks for all of the VMs.

I am at a loss here.
 
Last edited:
... I feel like the fact that the "new" node doesn't show an IP from one of the original node's pvecm status reports is odd. What would that mean?

# pvecm status
Cluster information
-------------------
Name: dtCluster
Config Version: 9
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Wed Aug 4 10:39:43 2021
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000002
Ring ID: 1.2120
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate
Unable to get node address for nodeid 4: CS_ERR_NOT_EXIST

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 x.y.z.244
0x00000002 1 x.y.z.245 (local)
0x00000003 1 x.y.z.252
0x00000004 1
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!