issue with cluster

killmasta93

Renowned Member
Aug 13, 2017
973
58
68
31
Hi,
Recently connected a few hosts to the cluster, each host is working fine but as of few mins only 1 host is putting issue, i have no replication on the cluster, it was recently that i put in the server was working few mins then got this error and i have the other hosts working no issue in the cluster
im getting this error
Code:
trying to acquire cfs lock 'file-replication_cfg

Code:
Sep 05 15:56:51 prometheus2 corosync[40599]: notice  [TOTEM ] Retransmit List: af8 af9 afa afb
Sep 05 15:56:51 prometheus2 corosync[40599]:  [TOTEM ] Retransmit List: af8 af9 afa afb
Sep 05 15:56:52 prometheus2 corosync[40599]: error   [TOTEM ] FAILED TO RECEIVE
Sep 05 15:56:52 prometheus2 corosync[40599]:  [TOTEM ] FAILED TO RECEIVE
Sep 05 15:56:55 prometheus2 corosync[40599]: notice  [TOTEM ] A new membership (192.168.3.152:32) was formed. Members left: 2 1 4 3
Sep 05 15:56:55 prometheus2 corosync[40599]: notice  [TOTEM ] Failed to receive the leave message. failed: 2 1 4 3
Sep 05 15:56:55 prometheus2 corosync[40599]: warning [CPG   ] downlist left_list: 4 received
Sep 05 15:56:55 prometheus2 corosync[40599]: notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Sep 05 15:56:55 prometheus2 corosync[40599]: notice  [QUORUM] Members[1]: 5
Sep 05 15:56:55 prometheus2 corosync[40599]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 05 15:56:55 prometheus2 corosync[40599]:  [TOTEM ] A new membership (192.168.3.152:32) was formed. Members left: 2 1 4 3
Sep 05 15:56:55 prometheus2 corosync[40599]:  [TOTEM ] Failed to receive the leave message. failed: 2 1 4 3
Sep 05 15:56:55 prometheus2 corosync[40599]:  [CPG   ] downlist left_list: 4 received
Sep 05 15:56:55 prometheus2 corosync[40599]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Sep 05 15:56:55 prometheus2 corosync[40599]:  [QUORUM] Members[1]: 5
Sep 05 15:56:55 prometheus2 corosync[40599]:  [MAIN  ] Completed service synchronization, ready to provide service.

checked the service

Code:
root@prometheus2:~# systemctl status pvesr.service
● pvesr.service - Proxmox VE replication runner
   Loaded: loaded (/lib/systemd/system/pvesr.service; static; vendor preset: enabled)
   Active: activating (start) since Sat 2020-09-05 16:20:00 -05; 7s ago
Main PID: 6870 (pvesr)
    Tasks: 1 (limit: 7372)
   Memory: 66.5M
      CPU: 1.104s
   CGroup: /system.slice/pvesr.service
           └─6870 /usr/bin/perl -T /usr/bin/pvesr run --mail 1

Sep 05 16:20:00 prometheus2 systemd[1]: Starting Proxmox VE replication runner...
Sep 05 16:20:01 prometheus2 pvesr[6870]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 05 16:20:02 prometheus2 pvesr[6870]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 05 16:20:03 prometheus2 pvesr[6870]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 05 16:20:04 prometheus2 pvesr[6870]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 05 16:20:05 prometheus2 pvesr[6870]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 05 16:20:06 prometheus2 pvesr[6870]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 05 16:20:07 prometheus2 pvesr[6870]: trying to acquire cfs lock 'file-replication_cfg' ..
 
Last edited:
Hoe stable is the network? Does the corosync (cluster) traffic have its own physical network or is it sharing it with other services? If the latter is true, how high is the usage on the network?

Corosync really needs low latency and if other services congest the network, the latency for corosync can go up quite a bit.
 
thanks for the reply, it was something funky with that port i changed it and started to work again, but something odd also is happening. Only one the host principal which was created the cluster i can see prometheus12 but on the other hosts i cannot see prometheus12

Code:
root@prometheus4:~# cat  /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: prometheus
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.3.150
  }
  node {
    name: prometheus11
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.3.216
  }
  node {
    name: prometheus12
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 192.168.3.186
  }
  node {
    name: prometheus2
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.3.152
  }
  node {
    name: prometheus4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.3.187
  }
  node {
    name: prometheus6
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.3.99
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: troy
  config_version: 6
  interface {
    bindnetaddr: 192.168.3.150
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

1599520595000.png1599520611404.png1599520621222.png
 
EDIT: so i added a VM and i check on the host prometheus4 i get this

Error hostname lookup 'prometheus12' failed - failed to get address info for: prometheus12: Name or service not known
 
so i rechecked if i made an issue on the subnet or gateway and nothing cant seem to find the issue could it be a DNS issue?
was reading other parts usually happened when renaming a node, but this node was fresh installed into the cluster
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!