Hi,
I inherited a 6.0 cluster and now the node #4 is in trouble. The corosync service fails as follows:
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2021-01-12 15:03:22 EET; 50min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 12385 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8)
Main PID: 12385 (code=exited, status=8)
Jan 12 15:03:22 PRMX4 systemd[1]: Starting Corosync Cluster Engine...
Jan 12 15:03:22 PRMX4 corosync[12385]: [MAIN ] Corosync Cluster Engine 3.0.2 starting up
Jan 12 15:03:22 PRMX4 corosync[12385]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Jan 12 15:03:22 PRMX4 corosync[12385]: [MAIN ] failed to parse node address 'PRMX1'
Jan 12 15:03:22 PRMX4 corosync[12385]: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1353.
Jan 12 15:03:22 PRMX4 systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
Jan 12 15:03:22 PRMX4 systemd[1]: corosync.service: Failed with result 'exit-code'.
Jan 12 15:03:22 PRMX4 systemd[1]: Failed to start Corosync Cluster Engine.
On every node /etc/corosync/corosync.conf is like this:
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: PRMX1
nodeid: 1
quorum_votes: 1
ring0_addr: PRMX1
}
node {
name: PRMX2
nodeid: 2
quorum_votes: 1
ring0_addr: PRMX2
}
node {
name: PRMX3
nodeid: 3
quorum_votes: 1
ring0_addr: 10.5.1.58
}
node {
name: PRMX4
nodeid: 4
quorum_votes: 1
ring0_addr: 10.5.1.65
}
node {
name: PRMX5
nodeid: 5
quorum_votes: 1
ring0_addr: 10.5.1.66
}
node {
name: PRMX6
nodeid: 6
quorum_votes: 1
ring0_addr: 10.5.1.67
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: UGAL
config_version: 18
interface {
bindnetaddr: 10.5.1.53
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}
In the syslog on node #4 I have folowing:
Jan 12 16:31:00 PRMX4 systemd[1]: Starting Proxmox VE replication runner...
Jan 12 16:31:01 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:02 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:03 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:04 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:04 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:05 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:05 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:05 PRMX4 pvedaemon[12004]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:06 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:06 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:07 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:08 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:09 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:09 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:09 PRMX4 pvedaemon[12004]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:10 PRMX4 pvesr[24893]: error with cfs lock 'file-replication_cfg': no quorum!
Jan 12 16:31:10 PRMX4 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jan 12 16:31:10 PRMX4 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jan 12 16:31:10 PRMX4 systemd[1]: Failed to start Proxmox VE replication runner.
Jan 12 16:31:10 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!
Can someone please tell me what I can do to fix it.
Thank you!
I inherited a 6.0 cluster and now the node #4 is in trouble. The corosync service fails as follows:
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2021-01-12 15:03:22 EET; 50min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 12385 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=exited, status=8)
Main PID: 12385 (code=exited, status=8)
Jan 12 15:03:22 PRMX4 systemd[1]: Starting Corosync Cluster Engine...
Jan 12 15:03:22 PRMX4 corosync[12385]: [MAIN ] Corosync Cluster Engine 3.0.2 starting up
Jan 12 15:03:22 PRMX4 corosync[12385]: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Jan 12 15:03:22 PRMX4 corosync[12385]: [MAIN ] failed to parse node address 'PRMX1'
Jan 12 15:03:22 PRMX4 corosync[12385]: [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1353.
Jan 12 15:03:22 PRMX4 systemd[1]: corosync.service: Main process exited, code=exited, status=8/n/a
Jan 12 15:03:22 PRMX4 systemd[1]: corosync.service: Failed with result 'exit-code'.
Jan 12 15:03:22 PRMX4 systemd[1]: Failed to start Corosync Cluster Engine.
On every node /etc/corosync/corosync.conf is like this:
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: PRMX1
nodeid: 1
quorum_votes: 1
ring0_addr: PRMX1
}
node {
name: PRMX2
nodeid: 2
quorum_votes: 1
ring0_addr: PRMX2
}
node {
name: PRMX3
nodeid: 3
quorum_votes: 1
ring0_addr: 10.5.1.58
}
node {
name: PRMX4
nodeid: 4
quorum_votes: 1
ring0_addr: 10.5.1.65
}
node {
name: PRMX5
nodeid: 5
quorum_votes: 1
ring0_addr: 10.5.1.66
}
node {
name: PRMX6
nodeid: 6
quorum_votes: 1
ring0_addr: 10.5.1.67
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: UGAL
config_version: 18
interface {
bindnetaddr: 10.5.1.53
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}
In the syslog on node #4 I have folowing:
Jan 12 16:31:00 PRMX4 systemd[1]: Starting Proxmox VE replication runner...
Jan 12 16:31:01 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:02 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:03 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:04 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:04 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:05 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:05 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:05 PRMX4 pvedaemon[12004]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:06 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:06 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:07 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:08 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:09 PRMX4 pvesr[24893]: trying to acquire cfs lock 'file-replication_cfg' ...
Jan 12 16:31:09 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:09 PRMX4 pvedaemon[12004]: Cluster not quorate - extending auth key lifetime!
Jan 12 16:31:10 PRMX4 pvesr[24893]: error with cfs lock 'file-replication_cfg': no quorum!
Jan 12 16:31:10 PRMX4 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jan 12 16:31:10 PRMX4 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jan 12 16:31:10 PRMX4 systemd[1]: Failed to start Proxmox VE replication runner.
Jan 12 16:31:10 PRMX4 pveproxy[16914]: Cluster not quorate - extending auth key lifetime!
Can someone please tell me what I can do to fix it.
Thank you!