issue with cluster

killmasta93 · Sep 5, 2020

Hi,
Recently connected a few hosts to the cluster, each host is working fine but as of few mins only 1 host is putting issue, i have no replication on the cluster, it was recently that i put in the server was working few mins then got this error and i have the other hosts working no issue in the cluster
im getting this error

Code:

trying to acquire cfs lock 'file-replication_cfg

Code:

Sep 05 15:56:51 prometheus2 corosync[40599]: notice  [TOTEM ] Retransmit List: af8 af9 afa afb
Sep 05 15:56:51 prometheus2 corosync[40599]:  [TOTEM ] Retransmit List: af8 af9 afa afb
Sep 05 15:56:52 prometheus2 corosync[40599]: error   [TOTEM ] FAILED TO RECEIVE
Sep 05 15:56:52 prometheus2 corosync[40599]:  [TOTEM ] FAILED TO RECEIVE
Sep 05 15:56:55 prometheus2 corosync[40599]: notice  [TOTEM ] A new membership (192.168.3.152:32) was formed. Members left: 2 1 4 3
Sep 05 15:56:55 prometheus2 corosync[40599]: notice  [TOTEM ] Failed to receive the leave message. failed: 2 1 4 3
Sep 05 15:56:55 prometheus2 corosync[40599]: warning [CPG   ] downlist left_list: 4 received
Sep 05 15:56:55 prometheus2 corosync[40599]: notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Sep 05 15:56:55 prometheus2 corosync[40599]: notice  [QUORUM] Members[1]: 5
Sep 05 15:56:55 prometheus2 corosync[40599]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 05 15:56:55 prometheus2 corosync[40599]:  [TOTEM ] A new membership (192.168.3.152:32) was formed. Members left: 2 1 4 3
Sep 05 15:56:55 prometheus2 corosync[40599]:  [TOTEM ] Failed to receive the leave message. failed: 2 1 4 3
Sep 05 15:56:55 prometheus2 corosync[40599]:  [CPG   ] downlist left_list: 4 received
Sep 05 15:56:55 prometheus2 corosync[40599]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Sep 05 15:56:55 prometheus2 corosync[40599]:  [QUORUM] Members[1]: 5
Sep 05 15:56:55 prometheus2 corosync[40599]:  [MAIN  ] Completed service synchronization, ready to provide service.

checked the service

Code:

root@prometheus2:~# systemctl status pvesr.service
● pvesr.service - Proxmox VE replication runner
   Loaded: loaded (/lib/systemd/system/pvesr.service; static; vendor preset: enabled)
   Active: activating (start) since Sat 2020-09-05 16:20:00 -05; 7s ago
Main PID: 6870 (pvesr)
    Tasks: 1 (limit: 7372)
   Memory: 66.5M
      CPU: 1.104s
   CGroup: /system.slice/pvesr.service
           └─6870 /usr/bin/perl -T /usr/bin/pvesr run --mail 1

Sep 05 16:20:00 prometheus2 systemd[1]: Starting Proxmox VE replication runner...
Sep 05 16:20:01 prometheus2 pvesr[6870]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 05 16:20:02 prometheus2 pvesr[6870]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 05 16:20:03 prometheus2 pvesr[6870]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 05 16:20:04 prometheus2 pvesr[6870]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 05 16:20:05 prometheus2 pvesr[6870]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 05 16:20:06 prometheus2 pvesr[6870]: trying to acquire cfs lock 'file-replication_cfg' ...
Sep 05 16:20:07 prometheus2 pvesr[6870]: trying to acquire cfs lock 'file-replication_cfg' ..

aaron · Sep 7, 2020

Hoe stable is the network? Does the corosync (cluster) traffic have its own physical network or is it sharing it with other services? If the latter is true, how high is the usage on the network?

Corosync really needs low latency and if other services congest the network, the latency for corosync can go up quite a bit.

killmasta93 · Sep 8, 2020

thanks for the reply, it was something funky with that port i changed it and started to work again, but something odd also is happening. Only one the host principal which was created the cluster i can see prometheus12 but on the other hosts i cannot see prometheus12

Code:

root@prometheus4:~# cat  /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: prometheus
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.3.150
  }
  node {
    name: prometheus11
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.3.216
  }
  node {
    name: prometheus12
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 192.168.3.186
  }
  node {
    name: prometheus2
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.3.152
  }
  node {
    name: prometheus4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.3.187
  }
  node {
    name: prometheus6
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.3.99
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: troy
  config_version: 6
  interface {
    bindnetaddr: 192.168.3.150
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

killmasta93 · Sep 8, 2020

EDIT: so i added a VM and i check on the host prometheus4 i get this

Error hostname lookup 'prometheus12' failed - failed to get address info for: prometheus12: Name or service not known

killmasta93 · Sep 11, 2020

so i rechecked if i made an issue on the subnet or gateway and nothing cant seem to find the issue could it be a DNS issue?
was reading other parts usually happened when renaming a node, but this node was fresh installed into the cluster

Search

Search

issue with cluster

killmasta93

Renowned Member

aaron

Proxmox Staff Member

killmasta93

Renowned Member

killmasta93

Renowned Member

killmasta93

Renowned Member

We value your privacy