[SOLVED] Corosync failed on one node

livpet

Member
Jul 10, 2019
5
0
21
59
Hi,

I inherited a 6.0 cluster and now the node #4 is in trouble. The corosync service fails as follows:

● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2021-01-12 15:03:22 EET; 50min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 12385 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8)
Main PID: 12385 (code=exited, status=8)

Jan 12 15:03:22 PRMX4 systemd[1]: Starting Corosync Cluster Engine...
Jan 12 15:03:22 PRMX4 corosync[12385]: [MAIN ] Corosync Cluster Engine 3.0.2 starting up
Jan 12 15:03:22 PRMX4 corosync[12385]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Jan 12 15:03:22 PRMX4 corosync[12385]: [MAIN ] failed to parse node address 'PRMX1'
Jan 12 15:03:22 PRMX4 corosync[12385]: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1353.
Jan 12 15:03:22 PRMX4 systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
Jan 12 15:03:22 PRMX4 systemd[1]: corosync.service: Failed with result 'exit-code'.
Jan 12 15:03:22 PRMX4 systemd[1]: Failed to start Corosync Cluster Engine.

On every node /etc/corosync/corosync.conf is like this:

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: PRMX1
nodeid: 1
quorum_votes: 1
ring0_addr: PRMX1
}
node {
name: PRMX2
nodeid: 2
quorum_votes: 1
ring0_addr: PRMX2
}
node {
name: PRMX3
nodeid: 3
quorum_votes: 1
ring0_addr: 10.5.1.58
}
node {
name: PRMX4
nodeid: 4
quorum_votes: 1
ring0_addr: 10.5.1.65
}
node {
name: PRMX5
nodeid: 5
quorum_votes: 1
ring0_addr: 10.5.1.66
}
node {
name: PRMX6
nodeid: 6
quorum_votes: 1
ring0_addr: 10.5.1.67
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: UGAL
config_version: 18
interface {
bindnetaddr: 10.5.1.53
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}

In the syslog on node #4 I have folowing:

Jan 12 16:31:00 PRMX4 systemd[1]: Starting Proxmox VE replication runner...
Jan 12 16:31:01 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:02 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:03 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:04 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:04 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:05 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:05 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:05 PRMX4 pvedaemon[12004]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:06 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:06 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:07 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:08 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:09 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:09 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:09 PRMX4 pvedaemon[12004]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:10 PRMX4 pvesr[24893]: error with cfs lock 'file-replication_cfg': no quorum!
Jan 12 16:31:10 PRMX4 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jan 12 16:31:10 PRMX4 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jan 12 16:31:10 PRMX4 systemd[1]: Failed to start Proxmox VE replication runner.
Jan 12 16:31:10 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!

Can someone please tell me what I can do to fix it.
Thank you!
 
Hi,

I inherited a 6.0 cluster and now the node #4 is in trouble. The corosync service fails as follows:
<snip>
Jan 12 15:03:22 PRMX4 corosync[12385]: [MAIN ] failed to parse node address 'PRMX1'
<snip>
Hi,

I suspect your PRMX4 doesn't have an entry in /etc/hosts for PRMX1. As a workaround, add proper entries for PRMX1 (and PRMX2) and restart the service.

Once all nodes are quorate, the fix is to convert the node ring0_addr entries for PRMX1 and PRMX2 to IP addresses (like the other nodes) by carefully editing the /etc/pve/corosync.conf.
 
  • Like
Reactions: Christian St.
Thanks RokaKen for your suggestion.
I apologize for the delay in answering. I added a proper entry for PRMX1 in /etc/hosts (PRMX4) then I restart corosync service and all nodes except node #1 turned gray. Then I remove that entry and now I have #1 - green, #2 and #3 - gray and #4, #5 and #6 - red. In the meantime I found out that there are communication problems at nodes # 4 and # 5. I will come back after fixing them.
 
Thanks again RokaKen for your suggestion. After we solved the communication problems at nodes #3 and #4 and after restarting the PVE services, everything returned to normal, except for the access via ssh to node #2.
I consider this incident closed.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!