Nodes keep dropping out

wheelsca · Mar 9, 2019

I am new here and new to Proxmox, I am having an issue. I have rebuilt a cluster twice now and keep having nodes dropping out.

The first time I had a couple NFS shares, a couple Windows VM's and a Vesta web server. The second time, same VM's but no shares.

Hopefully someone has some ideas of what's going on.

Thanks.

David Calvache Casas · Mar 9, 2019

yo have lost quorate.
To help yo must provide more info (network, coronsyc,conf, pvecm status, etc). Your switch support multicast?

MarkjT · Mar 19, 2019

Hello,

We are also seeing the same problem as wheelsca mentions above. We have rebuilt our proxmox cluster from 4.3 to 5.3, using the same network switches and server hardware, we see random nodes dissapear and then results in no quorate. The way we can get around it without restarting most of the nodes is to restart three services - pvestatd, pvedaemon, pveproxy. There appears to be one node which we reloaded again yesterday, gave it a different IP, host name etc.. hoping this may help. It runs for a about 10 minutes and then in the web gui it goes offline. In the logs, it shows the below -:

Mar 19 09:15:01 pve5n pvesr[6291]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 19 09:15:02 pve5n pvesr[6291]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 19 09:15:03 pve5n pvesr[6291]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 19 09:15:04 pve5n pvesr[6291]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 19 09:15:05 pve5n pvesr[6291]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 19 09:15:06 pve5n pvesr[6291]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 19 09:15:07 pve5n pvesr[6291]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 19 09:15:08 pve5n pvesr[6291]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 19 09:15:09 pve5n pvesr[6291]: trying to acquire cfs lock 'file-replication_cfg' ...
Mar 19 09:15:10 pve5n pvesr[6291]: error with cfs lock 'file-replication_cfg': no quorum!
Mar 19 09:15:10 pve5n systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Mar 19 09:15:10 pve5n systemd[1]: Failed to start Proxmox VE replication runner.
Mar 19 09:15:10 pve5n systemd[1]: pvesr.service: Unit entered failed state.
Mar 19 09:15:10 pve5n systemd[1]: pvesr.service: Failed with result 'exit-code'.

With this node we do end up rebooting it but then it goes offline again etc...

Any ideas what is causing this ? When we were running proxmox 4.3 on the same server hardware, same switch we never had any of these issues - we have tried a lot of things trying to resolve it but we are now a bit of a loss...

We have performed the multicast test and all looks fine - again, we were running proxmox 4.3 on the same switch without any problems.

Any ideas ?

Thanks,

MarkjT · Mar 22, 2019

Hello,

Does anyone have any suggestions on what to try ? We have still been unable to resolve the problem.

If you need any more information please let me know.

Thanks,

Andrew Martin · Mar 25, 2019

Hi, seeing the same issues with 5.3

replication ports are on:
11.0.0.1, 11.0.0.2, 11.0.0.3, 11.0.0.4, 11.0.0.15, 11.0.0.16, 11.0.0.7 with subnet 255.255.255.224 broadcast is 11.0.0.31
running through HP 1820 switch, multicast is working.

corosync.conf

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: pve1
nodeid: 1
quorum_votes: 1
ring0_addr: 194.1.1.81
}
node {
name: pve2
nodeid: 2
quorum_votes: 1
ring0_addr: 194.1.1.82
}
node {
name: pve3
nodeid: 3
quorum_votes: 1
ring0_addr: 194.1.1.83
}
node {
name: pve4
nodeid: 5
quorum_votes: 1
ring0_addr: 194.1.1.84
}
node {
name: pve5n
nodeid: 6
quorum_votes: 1
ring0_addr: 194.1.1.95
}
node {
name: pve6n
nodeid: 7
quorum_votes: 1
ring0_addr: 194.1.1.96
}
node {
name: pve7
nodeid: 4
quorum_votes: 1
ring0_addr: 194.1.1.87
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: hostingcluster
config_version: 26
interface {
bindnetaddr: 194.1.1.81
ringnumber: 0
}
ip_version: ipv4
secauth: on
version: 2
}

pvestatus - pve5n gone missing (194.1.1.95)

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 194.1.1.81 (local)
0x00000002 1 194.1.1.82
0x00000003 1 194.1.1.83
0x00000005 1 194.1.1.84
0x00000004 1 194.1.1.87
0x00000007 1 194.1.1.96
root@pve1:/etc/corosync# pvecm status
Quorum information
------------------
Date: Mon Mar 25 10:46:20 2019
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 0x00000001
Ring ID: 1/295504
Quorate: Yes

Votequorum information
----------------------
Expected votes: 7
Highest expected: 7
Total votes: 6
Quorum: 4
Flags: Quorate

Screenshot

Any suggestions at all??

Search

Search

Nodes keep dropping out

wheelsca

New Member

Attachments

David Calvache Casas

Renowned Member

MarkjT

Renowned Member

MarkjT

Renowned Member

Andrew Martin

New Member

We value your privacy