[SOLVED] Corosync failed on one node

livpet · Jan 12, 2021

Hi,

I inherited a 6.0 cluster and now the node #4 is in trouble. The corosync service fails as follows:

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2021-01-12 15:03:22 EET; 50min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 12385 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8)
Main PID: 12385 (code=exited, status=8)

Jan 12 15:03:22 PRMX4 systemd[1]: Starting Corosync Cluster Engine...
Jan 12 15:03:22 PRMX4 corosync[12385]: [MAIN ] Corosync Cluster Engine 3.0.2 starting up
Jan 12 15:03:22 PRMX4 corosync[12385]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Jan 12 15:03:22 PRMX4 corosync[12385]: [MAIN ] failed to parse node address 'PRMX1'
Jan 12 15:03:22 PRMX4 corosync[12385]: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1353.
Jan 12 15:03:22 PRMX4 systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
Jan 12 15:03:22 PRMX4 systemd[1]: corosync.service: Failed with result 'exit-code'.
Jan 12 15:03:22 PRMX4 systemd[1]: Failed to start Corosync Cluster Engine.

On every node /etc/corosync/corosync.conf is like this:

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: PRMX1
nodeid: 1
quorum_votes: 1
ring0_addr: PRMX1
}
node {
name: PRMX2
nodeid: 2
quorum_votes: 1
ring0_addr: PRMX2
}
node {
name: PRMX3
nodeid: 3
quorum_votes: 1
ring0_addr: 10.5.1.58
}
node {
name: PRMX4
nodeid: 4
quorum_votes: 1
ring0_addr: 10.5.1.65
}
node {
name: PRMX5
nodeid: 5
quorum_votes: 1
ring0_addr: 10.5.1.66
}
node {
name: PRMX6
nodeid: 6
quorum_votes: 1
ring0_addr: 10.5.1.67
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: UGAL
config_version: 18
interface {
bindnetaddr: 10.5.1.53
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}

In the syslog on node #4 I have folowing:

Jan 12 16:31:00 PRMX4 systemd[1]: Starting Proxmox VE replication runner...
Jan 12 16:31:01 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:02 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:03 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:04 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:04 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:05 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:05 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:05 PRMX4 pvedaemon[12004]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:06 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:06 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:07 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:08 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:09 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:09 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:09 PRMX4 pvedaemon[12004]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:10 PRMX4 pvesr[24893]: error with cfs lock 'file-replication_cfg': no quorum!
Jan 12 16:31:10 PRMX4 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jan 12 16:31:10 PRMX4 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jan 12 16:31:10 PRMX4 systemd[1]: Failed to start Proxmox VE replication runner.
Jan 12 16:31:10 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!

Can someone please tell me what I can do to fix it.
Thank you!

RokaKen · Jan 12, 2021

livpet said:
Hi,

I inherited a 6.0 cluster and now the node #4 is in trouble. The corosync service fails as follows:
<snip>
Jan 12 15:03:22 PRMX4 corosync[12385]: [MAIN ] failed to parse node address 'PRMX1'
<snip>

Hi,

I suspect your PRMX4 doesn't have an entry in /etc/hosts for PRMX1. As a workaround, add proper entries for PRMX1 (and PRMX2) and restart the service.

Once all nodes are quorate, the fix is to convert the node ring0_addr entries for PRMX1 and PRMX2 to IP addresses (like the other nodes) by carefully editing the /etc/pve/corosync.conf.

livpet · Jan 13, 2021

Thanks RokaKen for your suggestion.
I apologize for the delay in answering. I added a proper entry for PRMX1 in /etc/hosts (PRMX4) then I restart corosync service and all nodes except node #1 turned gray. Then I remove that entry and now I have #1 - green, #2 and #3 - gray and #4, #5 and #6 - red. In the meantime I found out that there are communication problems at nodes # 4 and # 5. I will come back after fixing them.

livpet · Jan 18, 2021

Thanks again RokaKen for your suggestion. After we solved the communication problems at nodes #3 and #4 and after restarting the PVE services, everything returned to normal, except for the access via ssh to node #2.
I consider this incident closed.

RokaKen · Jan 19, 2021

livpet said:
...
I consider this incident closed.

You're welcome -- please edit your original post, click three dots, pick SOLVED so others who find the post know what to expect.

Search

Search

[SOLVED] Corosync failed on one node

livpet

Member

RokaKen

Active Member

livpet

Member

livpet

Member

RokaKen

Active Member

We value your privacy