Proxmox 6.1 error during cfs-locked 'file-replication_cfg' operation: no quorum!

ffrom

New Member
Nov 3, 2020
4
0
1
47
Hi
Today i got that error on all 7 nodes of the cluster.
Each not is up but sees only itself up and all other nodes down.
Error:
Nov 03 17:46:43 kvm38 pvesr[5245]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 03 17:46:44 kvm38 pvesr[5245]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 03 17:46:45 kvm38 pvesr[5245]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 03 17:46:46 kvm38 pvesr[5245]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 03 17:46:47 kvm38 pvesr[5245]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 03 17:46:48 kvm38 pvesr[5245]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 03 17:46:49 kvm38 pvesr[5245]: error during cfs-locked 'file-replication_cfg' operation: no quorum!
Nov 03 17:46:49 kvm38 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Nov 03 17:46:49 kvm38 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Nov 03 17:46:49 kvm38 systemd[1]: Failed to start Proxmox VE replication runner.

There is also messages from corosync, like below:
Nov 3 17:58:20 kvm38 corosync[4717]: [KNET ] rx: host: 4 link: 0 is up
Nov 3 17:58:21 kvm38 corosync[4717]: [KNET ] rx: host: 2 link: 0 is up
Nov 3 17:58:24 kvm38 corosync[4717]: [KNET ] rx: host: 7 link: 0 is up
Nov 3 17:58:24 kvm38 corosync[4717]: [KNET ] link: host: 2 link: 0 is down
Nov 3 17:58:25 kvm38 corosync[4717]: [KNET ] link: host: 4 link: 0 is down
Nov 3 17:58:26 kvm38 corosync[4717]: [KNET ] link: host: 8 link: 0 is down
Nov 3 17:58:27 kvm38 corosync[4717]: [KNET ] link: host: 3 link: 0 is down
Nov 3 17:58:28 kvm38 corosync[4717]: [KNET ] link: host: 7 link: 0 is down

Why it can be that interfaces go up/down on host??
Physically they UP on hosts and switches. There is no flapping....

All VM's running there is no any issue with network. People works on 100+ VM's but cluster management is down....
management and VM's connected to the same switches...

Please advice.
 
hi,

can you post the output of pvecm status and cat /etc/pve/corosync.conf
 
Hi, Sure

Code:
root@kvm39:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: dn13
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 100.64.15.227
  }
  node {
    name: kvm38
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.3.11
  }
  node {
    name: kvm39
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 100.64.15.151
  }
  node {
    name: kvm48
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 100.64.15.154
  }
  node {
    name: kvm50
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 100.64.15.171
  }
  node {
    name: kvm51
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 100.64.15.122
  }
  node {
    name: kvm57
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 100.64.15.212
  }
  node {
    name: kvm67
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 10.10.3.12
  }
  node {
    name: kvm75
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.10.3.10
  }
  node {
    name: kvm89
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 100.64.15.139
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: cluster01
  config_version: 31
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  secauth: on
  version: 2
}

Code:
root@kvm39:~# pvecm status
Cluster information
-------------------
Name:             cluster01
Config Version:   31
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Nov  4 09:35:28 2020
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000002
Ring ID:          2.3c5fa8
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   1
Highest expected: 1
Total votes:      1
Quorum:           1
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 100.64.15.151 (local)
root@kvm39:~#


I see that "votequorum information" was changed during the night on some nodes.
It was:

Votequorum information
----------------------
Expected votes: 10
Highest expected: 10
Total votes: 1
Quorum: 6 Activity blocked
Flags:
 
Last edited:
kvm38 has a completely different IP address, were you trying to change something when this started happening?
 
We spend a day trying to bring the cluster up unsuccessfully. And from today morning we created a new VLAN and started to move nodes management to that VLAN. That is somehow provided partial solution and we managed to bring 3 nodes up in the cluster (see each other).

The issue of the cluster failure related to corosync process. All nodes interfaces started to up/down at the same time from corosync POV.
As for real there was no issue with physical interfaces or cables. Pings worked as well.
The question is how i can debug corosync to see the real reason for host interface to be up or down?

also 2 nodes of the cluster spammed the system log with thousands of reports in second like this:
Code:
corosync[18871]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailable
corosync[18871]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailable
corosync[18871]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailable
corosync[18871]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailable
 
are these ranges able to route to each other? all the nodes need to have a route to each other. that's why you're getting the errors
 
Hi All
The errors i posted are from the time when all nodes were in the same VLAN. Before we started to move them to other VLAN in order to solve the issue.
How i mentioned in the beginning all VM's continued to work only management of cluster was down. So there is no way it is related to routes or network. Because management and VM's network were all in the same VLAN. We also checked ping between nodes with different MTU sizes and all worked.

Even moving management interfaces of all nodes to the new VLAN did not solved the issue completely. The nodes started good but failed again after some time.
The solution was switching Corosync to sctp protocol. At least it working for some days now without any issues.

Thank you.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!